<a href="https://colab.research.google.com/github/andrybrew/bigdatanalysis-bi/blob/master/001_machine_learning_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification - German Credit Risk**

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. In this section, we will use credit risk as our classification case study.

“It takes money to make more money”. As an individual or company when we want to lend money, we set some critical parameters or guidelines to understand the credit risk. In this project, our aim is to analyze good and bad credit risk associated with individuals. The purpose of this stage of the project is to build classifiers which will help in predicting whether or not an individual has good or bad credit risk. This will be based on German Credit Dataset, which was sourced from Kaggle Repository.

Each person is classified as good (1) or bad (0) credit risks according these attributes:

1.   Age (numeric)
2.   Sex (text: male, female)
3.   Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
4.   Housing (text: own, rent, or free)
5.   Saving accounts (text - little, moderate, quite rich, rich)
6.   Checking account (numeric, in DM - Deutsch Mark)
7.   Credit amount (numeric, in DM)
8.   Duration (numeric, in month)
9.   Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

#### **Install and Import Libraries**

Before we begin to implement our classifier, we need to import some libraries to use them later. Here are the libraries we need to import.

***Install Library***

In [None]:
# Install Category Encoders
! pip install category_encoders

***Import Libraries***

In [None]:
# Import Library for Data Manipulation
import pandas as pd
import category_encoders as ce

# Import Library for Machine Learning
import sklearn.metrics as metrics

# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

#### **Import Dataset**

Then, import our credit risk dataset into this notebook using Pandas library. Then, we discover the dataset information and statistics.

***Credit Risk Data***

In [None]:
# Import Dataset
df_credit = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/german_credit_data3.csv', sep =';')
df_credit

In [None]:
# Prints the Dataset Information
df_credit.info()

In [None]:
# Prints Descriptive Statistics
df_credit.describe().transpose()

#### **Explore the Dataset**

We need to visualize the data before implement our classifier. Data visualization is the act of taking information (data) and placing it into a visual context, such as a map or graph. Data visualizations make big and small data easier for the human brain to understand, and visualization also makes it easier to detect patterns, trends, and outliers in groups of data. Here we use Seaborn library.

***Visualize Data using Pairplot***

In [None]:
# Set Graph Size
plt.rcParams['figure.figsize'] = (16, 8)

# Visualize Pair Plot with Colors
sns.pairplot(df_credit, hue='risk')

***Visualize Correlation between Features***

In [None]:
# Draw Correlation Map
sns.clustermap(df_credit.corr(), center=0, cmap='vlag', linewidths=.75)

#### **Preprocess the Data**

We should transforms raw data into an understandable format. Raw data cannot be sent through a model because would cause certain errors. That is why we need to preprocess data before sending through a model.

***Handling Missing Values***

In [None]:
# Check for Missing Values
df_credit.isnull().sum()

***Encode Categorical Data***

Data encoding purposed to transform a categorical data into binary numeric format. Here we use OneHotencoder module from sklearn to encode our categorical data.

In [None]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df_credit2 = pd.DataFrame(encoder.fit_transform(df_credit[['sex',	'housing',	'saving',	'checking', 'purpose']]))
df_credit2.columns = encoder.get_feature_names(['sex', 'housing', 'saving', 'checking', 'purpose'])

# Concat the Encoded Data
df_credit_encoded = df_credit.drop(['sex', 'housing', 'saving', 'checking', 'purpose'] ,axis=1, inplace=True)
df_credit_encoded = pd.concat([df_credit, df_credit2], axis=1)

# Show Encoded Dataframe
df_credit_encoded

***Select Feature and Target***

Features are individual independent variables that act as the input in your system while target is whatever the output of the input variables. 

In [None]:
# Select Features
feature = df_credit_encoded.drop(['risk'], axis=1)
feature

In [None]:
# Select Target
target = df_credit_encoded['risk']
target

***Set Training and Testing Data***

The next step is to split our data into tran and test sets. For this purpose, we use the scikit-learn's train_test_split function.

In [None]:
# Import Module
from sklearn.model_selection import train_test_split, cross_val_score

# Set Training and Testing Data (70:30)
feature_train, feature_test, target_train, target_test = train_test_split(feature , target, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

#### **Modeling**

##### **Decision Tree Classifier**

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

***Build Model***

In [None]:
# Import library
from sklearn import tree

# Modeling Decision Tree
dtree = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtree.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_dtree = dtree.predict(feature_test)
target_predicted_dtree

In [None]:
# Visualize Tree

from six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                class_names=['bad','good'],
                feature_names=['age',	'job',	'creditamount',	'duration',	'sex_female',	'sex_male',	'housing_free',
                        	    'housing_own',	'housing_rent',	'saving_little',	'saving_moderate',	'saving_quiterich',	
                              'saving_rich', 'saving_unknown', 'checking_little',	'checking_moderate',	'checking_rich',
                              'checking_unknown',	'purpose_business',
                              'purpose_car',	'purpose_domesticappliances',	'purpose_education', 'purpose_furniture/equipment',
                              'purpose_radio/TV',	'purpose_repairs',	'purpose_vacation/others'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_dtree = metrics.confusion_matrix(target_test, target_predicted_dtree)
cm_dtree

In [None]:
# Accuracy, Precision, Recall
acc_dtree = metrics.accuracy_score(target_test, target_predicted_dtree)
prec_dtree = metrics.precision_score(target_test, target_predicted_dtree)
rec_dtree = metrics.recall_score(target_test, target_predicted_dtree)
f1_dtree = metrics.f1_score(target_test, target_predicted_dtree)
kappa_dtree = metrics.cohen_kappa_score(target_test, target_predicted_dtree)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_dtree )
print('Precision:', prec_dtree)
print('Recall:', rec_dtree)
print('F1 Score:', f1_dtree)
print('Cohens Kappa Score:', kappa_dtree)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_dtree_prob = dtree.predict_proba(feature_test)[::,1]
fp_rate_dtree, tp_rate_dtree, _ = metrics.roc_curve(target_test,  target_predicted_dtree_prob)
auc_dtree = metrics.roc_auc_score(target_test, target_predicted_dtree_prob)
plt.plot(fp_rate_dtree, tp_rate_dtree, label='Decision Tree, auc='+str(auc_dtree))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **K-Nearest Neighbor Classifier**

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970's as a non-parametric technique.

***Build Model***

In [None]:
# Import Module
from sklearn.neighbors import KNeighborsClassifier

# Modeling Naive Bayes
knn = KNeighborsClassifier(n_neighbors= 71)
knn.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_knn = knn.predict(feature_test)
target_predicted_knn

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_knn = metrics.confusion_matrix(target_test, target_predicted_knn)
cm_knn

In [None]:
# Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(target_test, target_predicted_knn)
prec_knn = metrics.precision_score(target_test, target_predicted_knn)
rec_knn = metrics.recall_score(target_test, target_predicted_knn)
f1_knn = metrics.f1_score(target_test, target_predicted_knn)
kappa_knn = metrics.cohen_kappa_score(target_test, target_predicted_knn)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_knn)
print('Precision:', prec_knn)
print('Recall:', rec_knn)
print('F1 Score:', f1_knn)
print('Cohens Kappa Score:', kappa_knn)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_knn_prob = knn.predict_proba(feature_test)[::,1]
fp_rate_knn, tp_rate_knn, _ = metrics.roc_curve(target_test,  target_predicted_knn_prob)
auc_knn = metrics.roc_auc_score(target_test, target_predicted_knn_prob)
plt.plot(fp_rate_knn, tp_rate_knn, label='KNN, auc='+str(auc_knn))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **Naive Bayes Classifier**

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

***Build Model***

In [None]:
# Import Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes
nb = GaussianNB()
nb.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_nb = nb.predict(feature_test)
target_predicted_nb

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_nb = metrics.confusion_matrix(target_test, target_predicted_nb)
cm_nb

In [None]:
# Accuracy, Precision, Recall
acc_nb = metrics.accuracy_score(target_test, target_predicted_nb)
prec_nb = metrics.precision_score(target_test, target_predicted_nb)
rec_nb = metrics.recall_score(target_test, target_predicted_nb)
f1_nb = metrics.f1_score(target_test, target_predicted_nb)
kappa_nb = metrics.cohen_kappa_score(target_test, target_predicted_nb)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_nb)
print('Precision:', prec_nb)
print('Recall:', rec_nb)
print('F1 Score:', f1_nb)
print('Cohens Kappa Score:', kappa_nb)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_nb_prob = nb.predict_proba(feature_test)[::,1]
fp_rate_nb, tp_rate_nb, _ = metrics.roc_curve(target_test,  target_predicted_nb_prob)
auc_nb = metrics.roc_auc_score(target_test, target_predicted_nb_prob)
plt.plot(fp_rate_nb, tp_rate_nb, label='Naive Bayes, auc='+str(auc_nb))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

#### **Evaluating Models**

***Compare Model Performance***

In [None]:
# Comparing Model Performance
print('Decision Tree Accuracy =',acc_dtree)
print('Decision Tree Precision =',prec_dtree)
print('Decision Tree Recall =',rec_dtree)
print('Decision Tree F1-Score =', f1_dtree)
print('_______________________')
print('k-NN Accuracy =', acc_knn)
print('k-NN Precision =', prec_knn)
print('k-NN Recall =', rec_knn)
print('k-NN F1-Score =', f1_knn)
print('_______________________')
print('Naive Bayes Accuracy =', acc_nb)
print('Naive Bayes Precision =', prec_nb)
print('Naive Bayes Recall =', rec_nb)
print('Naive Bayes F1-Score =', f1_nb)

***Compare ROC Curve***

In [None]:
# Comparing ROC Curve
plt.plot(fp_rate_dtree,tp_rate_dtree,label='Decision Tree, auc='+str(auc_dtree))
plt.plot(fp_rate_knn,tp_rate_knn,label='K-NN, auc='+str(auc_knn))
plt.plot(fp_rate_nb,tp_rate_nb,label='Naive Bayes, auc='+str(auc_nb))
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()

#### **Predict New Data**

***Import New Credit Data***

In [None]:
# Import New Dataset
df_new_credit = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/german_new_credit_data.csv', sep =';')
df_new_credit

***Preprocess the New Credit Data***

In [None]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df_new_credit2 = pd.DataFrame(encoder.fit_transform(df_new_credit[['sex', 'housing', 'saving', 'checking', 'purpose']]))
df_new_credit2.columns = encoder.get_feature_names(['sex', 'housing', 'saving', 'checking', 'purpose'])

# Concat the Encoded Data
df_new_credit_encoded = df_new_credit.drop(['sex', 'housing', 'saving', 'checking', 'purpose'] ,axis=1, inplace=True)
df_new_credit_encoded = pd.concat([df_new_credit, df_new_credit2], axis=1)

# Show Encoded Dataframe
df_new_credit_encoded

In [None]:
# Select Features
new_feature = df_new_credit_encoded
new_feature

***Predict New Customer Data***

In [None]:
# Predict using Decision Tree Classifier
new_predicted_dtree = pd.DataFrame(dtree.predict(new_feature), columns = ['creditrisk_dtree'])
new_predicted_dtree.reset_index()
new_predicted_dtree

In [None]:
# Predict using K-Nearest Neighbor Classifier
new_predicted_knn = pd.DataFrame(knn.predict(new_feature), columns = ['creditrisk_knn'])
new_predicted_knn.reset_index()
new_predicted_knn

In [None]:
# Predict using Naive Bayes Classifier
new_predicted_nb = pd.DataFrame(nb.predict(new_feature), columns = ['creditrisk_nb'])
new_predicted_nb.reset_index()
new_predicted_nb

***Show Prediction Comparation***

In [None]:
# Show Prediction Result
pred_new_credit = pd.concat([df_new_credit, new_predicted_dtree, new_predicted_knn, new_predicted_nb], axis=1)
pred_new_credit

***Save Prediction Result***

In [None]:
# Save Prediction Result
pred_new_credit.to_csv('new_credit_prediction.csv', index=False)

# Classification - Telco Customer Churn

Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer to churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn.

The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platform and builds a new way of features’ engineering and selection.

Here we model the classification model from telco customer churn data. This data is consist of customer profile, customer subscription history, and their churn information. We will predict customer behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs

Each row represents a customer, each column contains customer’s attributes described below:
1.   customerID : Customer ID
2.   gender : Whether the customer is a male or a female
3.   SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
4.   Partner : Whether the customer has a partner or not (Yes, No)
5.   Dependents : Whether the customer has dependents or not (Yes, No)
6.   tenure : Number of months the customer has stayed with the company
7.   PhoneService : Whether the customer has a phone service or not (Yes, No)
8.   MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
9.   InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
10.   OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
11.   OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
12.   DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
13.   TechSupport : Whether the customer has tech support or not (Yes, No, No internet service)
14.   StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service)
15.   StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service)
16.   Contract : The contract term of the customer (Month-to-month, One year, Two year)
17.   PaperlessBilling : Whether the customer has paperless billing or not (Yes, No)
18.   PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
19.   MonthlyCharges : The amount charged to the customer monthly
20.   TotalCharges : The total amount charged to the customer
21.   Churn Whether: the customer churned or not (Yes or No)

Source: https://www.kaggle.com/blastchar/telco-customer-churn, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

#### **Import Dataset**

Import our customer churn dataset into this notebook using Pandas library. Then, we discover the dataset information and statistics.

***Credit Risk Data***

In [None]:
# Import Dataset
df_churn = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/customer_churn.csv', sep =';')
df_churn

In [None]:
# Prints the Dataset Information
df_churn.info()

In [None]:
# Prints Descriptive Statistics
df_churn.describe().transpose()

#### **Explore the Dataset**

We need to visualize the data before implement our classifier. Data visualization is the act of taking information (data) and placing it into a visual context, such as a map or graph. Data visualizations make big and small data easier for the human brain to understand, and visualization also makes it easier to detect patterns, trends, and outliers in groups of data. Here we use Seaborn library.

***Visualize Correlation between Features***

In [None]:
# Draw Correlation Map
sns.clustermap(df_churn.corr(), center=0, cmap='vlag', linewidths=.75)

#### **Preprocess the Data**

We should transforms raw data into an understandable format. Raw data cannot be sent through a model because would cause certain errors. That is why we need to preprocess data before sending through a model.

***Handling Missing Values***

In [None]:
# Check for Missing Values
df_churn.isnull().sum()

In [None]:
# Search for Median Value
median = df_churn['TotalCharges'].median()

# Use Median to Replace Missing Values
df_churn['TotalCharges'].fillna(median, inplace=True)

# Check for Missing Values
df_churn.isnull().sum()

***Encode Categorical Data***

Data encoding purposed to transform a categorical data into binary numeric format. Here we use OneHotencoder module from sklearn to encode our categorical data.

In [None]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df_churn2 = pd.DataFrame(encoder.fit_transform(df_churn[['gender', 'InternetService', 'Contract', 'PaymentMethod']]))
df_churn2.columns = encoder.get_feature_names(['gender', 'InternetService', 'Contract', 'PaymentMethod'])

# Replace Categotical Data with Encoded Data
df_churn_encoded = df_churn.drop(['gender', 'InternetService', 'Contract', 'PaymentMethod'] ,axis=1, inplace=True)
df_churn_encoded = pd.concat([df_churn, df_churn2], axis=1)

# Show Encoded Dataframe
df_churn_encoded

***Select Feature and Target***

Features are individual independent variables that act as the input in your system while target is whatever the output of the input variables. 

In [None]:
# Select Features
feature = df_churn_encoded.drop(['customerID', 'Churn'], axis=1)
feature

In [None]:
# Select Target
target = df_churn_encoded['Churn']
target

***Set Training and Testing Data***

The next step is to split our data into tran and test sets. For this purpose, we use the scikit-learn's train_test_split function.

In [None]:
# Import Module
from sklearn.model_selection import train_test_split, cross_val_score

# Set Training and Testing Data (70:30)
from sklearn.model_selection import train_test_split, cross_val_score
feature_train, feature_test, target_train, target_test = train_test_split(feature , target, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

#### **Modeling**

##### **Decision Tree Classifier**

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

***Build Model***

In [None]:
# Import library
from sklearn import tree

# Modeling Decision Tree
dtree = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)
dtree.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_dtree = dtree.predict(feature_test)
target_predicted_dtree

In [None]:
# Visualize Tree

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                class_names=['notchurn', 'churn'],
                feature_names=['SeniorCitizen',	'Partner',	'Dependents', 'tenure',	'PhoneService', 'OnlineSecurity',	'OnlineBackup',	'DeviceProtection',
                               'TechSupport',	'StreamingTV',	'StreamingMovies',	'PaperlessBilling',	'MonthlyCharges', 'TotalCharges', 'gender_Female',
                               'gender_Male',	'InternetService_DSL', 'InternetService_Fiber optic', 'InternetService_No',	'Contract_Month-to-month',
                               'Contract_One year',	'Contract_Two year',	'PaymentMethod_Bank transfer (automatic)', 'PaymentMethod_Credit card (automatic)',
                               'PaymentMethod_Electronic check',	'PaymentMethod_Mailed check'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_dtree = metrics.confusion_matrix(target_test, target_predicted_dtree)
cm_dtree

In [None]:
# Accuracy, Precision, Recall
acc_dtree = metrics.accuracy_score(target_test, target_predicted_dtree)
prec_dtree = metrics.precision_score(target_test, target_predicted_dtree)
rec_dtree = metrics.recall_score(target_test, target_predicted_dtree)
f1_dtree = metrics.f1_score(target_test, target_predicted_dtree)
kappa_dtree = metrics.cohen_kappa_score(target_test, target_predicted_dtree)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_dtree )
print('Precision:', prec_dtree)
print('Recall:', rec_dtree)
print('F1 Score:', f1_dtree)
print('Cohens Kappa Score:', kappa_dtree)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_dtree_prob = dtree.predict_proba(feature_test)[::,1]
fp_rate_dtree, tp_rate_dtree, _ = metrics.roc_curve(target_test,  target_predicted_dtree_prob)
auc_dtree = metrics.roc_auc_score(target_test, target_predicted_dtree_prob)
plt.plot(fp_rate_dtree, tp_rate_dtree, label='Decision Tree, auc='+str(auc_dtree))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **K-Nearest Neighbor Classifier**

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970's as a non-parametric technique.

***Build Model***

In [None]:
# Import Module
from sklearn.neighbors import KNeighborsClassifier

# Modeling Naive Bayes
knn = KNeighborsClassifier(n_neighbors= 71)
knn.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_knn = knn.predict(feature_test)
target_predicted_knn

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_knn = metrics.confusion_matrix(target_test, target_predicted_knn)
cm_knn

In [None]:
# Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(target_test, target_predicted_knn)
prec_knn = metrics.precision_score(target_test, target_predicted_knn)
rec_knn = metrics.recall_score(target_test, target_predicted_knn)
f1_knn = metrics.f1_score(target_test, target_predicted_knn)
kappa_knn = metrics.cohen_kappa_score(target_test, target_predicted_knn)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_knn)
print('Precision:', prec_knn)
print('Recall:', rec_knn)
print('F1 Score:', f1_knn)
print('Cohens Kappa Score:', kappa_knn)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_knn_prob = knn.predict_proba(feature_test)[::,1]
fp_rate_knn, tp_rate_knn, _ = metrics.roc_curve(target_test,  target_predicted_knn_prob)
auc_knn = metrics.roc_auc_score(target_test, target_predicted_knn_prob)
plt.plot(fp_rate_knn, tp_rate_knn, label='KNN, auc='+str(auc_knn))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

##### **Naive Bayes Classifier**

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

***Build Model***

In [None]:
# Import Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes
nb = GaussianNB()
nb.fit(feature_train, target_train)

# Predict Test Data 
target_predicted_nb = nb.predict(feature_test)
target_predicted_nb

***Model Evaluation***

In [None]:
# Confsion Matrix
cm_nb = metrics.confusion_matrix(target_test, target_predicted_nb)
cm_nb

In [None]:
# Accuracy, Precision, Recall
acc_nb = metrics.accuracy_score(target_test, target_predicted_nb)
prec_nb = metrics.precision_score(target_test, target_predicted_nb)
rec_nb = metrics.recall_score(target_test, target_predicted_nb)
f1_nb = metrics.f1_score(target_test, target_predicted_nb)
kappa_nb = metrics.cohen_kappa_score(target_test, target_predicted_nb)

# Show Accuracy, Precision, Recall
print('Accuracy:', acc_nb)
print('Precision:', prec_nb)
print('Recall:', rec_nb)
print('F1 Score:', f1_nb)
print('Cohens Kappa Score:', kappa_nb)

In [None]:
# Import Visualization Package
plt.rcParams['figure.figsize'] = (10, 10)
plt.style.use('ggplot')

# Visualize ROC Curve
target_predicted_nb_prob = nb.predict_proba(feature_test)[::,1]
fp_rate_nb, tp_rate_nb, _ = metrics.roc_curve(target_test,  target_predicted_nb_prob)
auc_nb = metrics.roc_auc_score(target_test, target_predicted_nb_prob)
plt.plot(fp_rate_nb, tp_rate_nb, label='Naive Bayes, auc='+str(auc_nb))
plt.xlabel('false positive rate') 
plt.ylabel('true positive rate')
plt.legend(loc=4)
plt.show()

#### **Evaluating Models**

***Compare Model Performance***

In [None]:
# Comparing Model Performance
print('Decision Tree Accuracy =',acc_dtree)
print('Decision Tree Precision =',prec_dtree)
print('Decision Tree Recall =',rec_dtree)
print('Decision Tree F1-Score =', f1_dtree)
print('_______________________')
print('k-NN Accuracy =', acc_knn)
print('k-NN Precision =', prec_knn)
print('k-NN Recall =', rec_knn)
print('k-NN F1-Score =', f1_knn)
print('_______________________')
print('Naive Bayes Accuracy =', acc_nb)
print('Naive Bayes Precision =', prec_nb)
print('Naive Bayes Recall =', rec_nb)
print('Naive Bayes F1-Score =', f1_nb)

***Compare ROC Curve***

In [None]:
# Comparing ROC Curve
plt.plot(fp_rate_dtree,tp_rate_dtree,label='Decision Tree, auc='+str(auc_dtree))
plt.plot(fp_rate_knn,tp_rate_knn,label='K-NN, auc='+str(auc_knn))
plt.plot(fp_rate_nb,tp_rate_nb,label='Naive Bayes, auc='+str(auc_nb))
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=4)
plt.show()