# Project objective
This project is designed to review random forest, logistic regression and naive Bayes machine learning methods and their python implementation for predicting probability of default of credit card clients. Performance of the models are compared using k-fold cross validation for different k values to choose the best model. Then the best model is used to predict the labels of the test set. 

**Probability of Default (PD)**: likelihood that a borrower will be unable to meet its debt obligations.


Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)
* **pandas**: Pandas provides easy-to-use data structures and data analysis tools for Python. (https://pandas.pydata.org/)

We also use **warnings** to stop the notebook from returning warning messages.


In [0]:
import numpy as np
import pandas as pd
import sklearn as sk

import warnings
warnings.filterwarnings('ignore')

# Introduction to the dataset

**Name**: Default of credit card clients dataset

**Summary**: Description of dataset and features (attributes) are provided in the dataset link.

**number of features**: 24 (real, integer) 

**Number of data points (instances)**: 30,000

**Link to the dataset**: http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients




## Importing the dataset
We can import the dataset in multiple ways

**Colab Notebook**: You can download the dataset file (or files) from the link (if provided) and uploading it to your google drive and then you can import the file (or files) as follows:

**Note.** When you run the following cell, it tries to connect the colab with google derive. Follow steps 1 to 5 in this link (https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/) to complete the 

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

# This path is common for everybody
# This is the path to your google drive
input_path = '/content/gdrive/My Drive/'
# reading the data (target)
target_dataset = pd.read_csv(input_path + 'default of credit card clients.csv')

target_dataset = target_dataset.drop([0], axis=0)
target_dataset.head

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


<bound method NDFrame.head of           ID      X1 X2 X3 X4  X5  ...    X19    X20   X21    X22   X23  Y
1          1   20000  2  2  1  24  ...    689      0     0      0     0  1
2          2  120000  2  2  2  26  ...   1000   1000  1000      0  2000  1
3          3   90000  2  2  2  34  ...   1500   1000  1000   1000  5000  0
4          4   50000  2  2  1  37  ...   2019   1200  1100   1069  1000  0
5          5   50000  1  2  1  57  ...  36681  10000  9000    689   679  0
...      ...     ... .. .. ..  ..  ...    ...    ...   ...    ...   ... ..
29996  29996  220000  1  3  1  39  ...  20000   5003  3047   5000  1000  0
29997  29997  150000  1  3  2  43  ...   3526   8998   129      0     0  0
29998  29998   30000  1  2  2  37  ...      0  22000  4200   2000  3100  1
29999  29999   80000  1  3  1  41  ...   3409   1178  1926  52964  1804  1
30000  30000   50000  1  2  1  46  ...   1800   1430  1000   1000  1000  1

[30000 rows x 25 columns]>

**Local directory**: In case you save the data in your local directory, you need to change "input_path" to the local directory you saved the file (or files) in.

**GitHub**: If you use my GitHub (or your own GitHub) repo, you need to change the "input_path" to where the file (or files) exist in the repo. For example, when I clone ***ml_in_practice*** from my GitHub, I need to change "input_path" to 'data/' as the file (or files) is saved in the data dicretory in this repository. 

**Note.**: You can also clone my ***ml_in_practice*** repository (here: https://github.com/alimadani/ml_in_practice) and follow the same process.

### Separating features from output variable
The dataframe of the target dataset has a column we would like to predict its values (output variable). We need to separate this column from the rest of the dataframe which include the features we want to use to build the model.

In [3]:
output_var = target_dataset['Y']
input_features = target_dataset[[col for col in target_dataset.columns if 'X' in col]]
print('number of features: {}'.format(input_features.shape[0]))

number of features: 30000


### Checking balance of classes
We need to determine if there is a class imbalance in the dataset as it will be important for choosing the metric for the performance assessment.

In [4]:
from collections import Counter

Counter(output_var)

Counter({'0': 23364, '1': 6636})

## Splitting data to training and testing sets

We need to check generalizability of the model. To accomplish this task in this project, we implement the following process:
1) splitting data to training set (70%) and test set (30%)
2) train and validate the models built based on different algorithms using 10-fold cross validation
3) testing the best model in the test set
split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about generalizability of the model we train.

**Note.**: We need the validation and test sets to be big enough for checking generalizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build classification models using three algorithms including naive Bayes, k nearest neighbour and random forest.

### Decision tree
A decision tree is built starting from the best feature splitting the data points to 2 purest possible groups. Then each group is splitted again by next best features for purification of groups. Although this process can be continued till getting to 100% purity (having only one class) in each group, it would probably lower than generalizability of the model. Hence, we usually cut the tree before getting to 100% purity.

### Random forest
Decision trees usually have high variance, meaning their prediction performance varies largely between datasets. To overcome this issue we can rely on concept of ensemble learning. In ensemble learning we want to use wisdom of crowd instead of single classifier. For example, random forest as an ensemble model uses multiple decision trees to predict class of each data point. Here is the process of bulding a random forest model:

1) Randomly sampling data points with replacement (bootstrapping)

2) Randomly selecting the features 

3) Build a decision tree using the randomly selected data points and features in steps 1 and 2.

4) Building multiple decision trees as decsribed in steps 1 to 3

5) Using majority vote of all the decision trees as the identified class for a given data point

Note. We don't need to write code for these steps but they will be done automatically when using random forest in python. But we need to know how it works. 

### Logistic regression
If we have set of features X1 to Xn, y can be obtained as:
\begin{equation*} y=b0+b1X1+b2X2+...+bnXn\end{equation*}

where y is the predicted value obtained by weighted sum of the feature values.

Then probability of each class (for example if there is a malignant tumor) can be obtained using the logistic function 

\begin{equation*} p(class=malignant)=\frac{1}{(1+exp(-y))} \end{equation*}

Based on the given class labels and the features given in the trainign data, coefficients b0 to bn can be ontained during the optimization process.

b0 to bn are fixed for all samples while X1 to Xn are feature values specific to each sample. Hence, the logistic function will give us probability of each class assigned to each sample. Finally, the model will choose the class with the highest probability for each sample.


**Note.** The logistic regression model is parametric and the parameters are the regression coefficiets b0 to bn.

### Naive Bayes
To understand Naive Bayes algotirhm, we need to know what Bayes theorem. Bayes theorem related conditional rpobabilities as follows:

\begin{equation*} p(A|B)p(B)=p(B|A)p(A) \end{equation*}
that can be rewritten as

\begin{equation*} p(A|B)=\frac{p(B|A)p(A)}{p(B)} \end{equation*}

where p(A) and p(B) are probabilities of events A and B, respectively. p(A|B) and p(B|A) are also conditional probabilities of A given B and B given A, respectively.

**Example without numbers**

Now let's assume we have 3 features X1, X2 and X3 and we want to identify the probability of class C for sample A with feature values *x1*, *x2* and *x3*:

\begin{equation*} p(class=C|X1=x1, X2=x2 , X3=x3)=\frac{p(X1=x1|class=C)p(X2=x2|lass=C)p(X3=x3|class=C)p(class=C)}{p(X1=x1)p(X2=x2)p(X3=x3)} \end{equation*}

where 
\begin{equation*} p(X1=x1, X2=x2 , X3=x3)=p(X1=x1)p(X2=x2)p(X3=x3) \end{equation*}
and
\begin{equation*} p(X1=x1, X2=x2 , X3=x3|class=C)=p(X1=x1|class=C)p(X2=x2|class=C)p(X3=x3|lass=C)p(class=C) \end{equation*}

as the features are independent variables. 

**Real life example with numbers**

We want to know the chance of having breast cancer if the diagnosis test is positive for a woman with the age between 40 and 60. This example is mainly for understanding Bayes theorem not Naive Bayes classifier. In case of Naive Bayes algorithm, this process can be easily extended to multiple features as described in the above example.

***Assumptions (not necessarily correct)***

* 2% of women between 40 and 60 have breast cancer
* True positive rate is 95% (if a woman has breast cancer, it will be diagnosed with 95% probability). Therefore, 5% of the time the women without breast cancer will be diagnosed positively by the test.

Now the question is *What is the chance of havign breast cancer if a woman has positive result from a diagnosis test?*

\begin{equation*} p(having \quad breast \quad cancer|positive)=\frac{p(positive|breast \quad cancer)p(breast cancer)}{p(positive)} \end{equation*}

where 


\begin{equation*} p(positive) = p(positive|having \quad breast \quad cancer)p(having \quad breast \quad cancer) \\+ p(positive|not \quad having \quad breast \quad cancer)p(not \quad having \quad breast \quad cancer)\\=
0.95*0.02+0.05*0.98\\=0.068\end{equation*}

Therefore,

\begin{equation*} p(having \quad breast  \quad cancer|positive)=\frac{p(positive|breast \quad cancer)p(having \quad breast \quad cancer)}{p(positive)}\\= \frac{0.95*0.02}{0.068}\\=0.28\end{equation*}


As we can see, there is only 28% chance of having cancer upon positive test result. Although the numbers were not clinically valid numbers, we deal with similar results in disease diagnosis. This is one of the reasons that further checkups by phycisions are mandatory upon positive results. Do not panic when you have a positive result but follow up with your doctor immediately.

**Note.** Naive Bayes classifier is called ***Naive*** as it assumes each feature will independently contribute in prediction of a class for each data point (sample).

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier 

# Create logistic regression object
rf = RandomForestClassifier(random_state=10)

# Create logistic regression object
logreg = LogisticRegression(random_state=10)

# Create naive Bayes object
nb = GaussianNB()


# assessing performance of the model using k-fold cross validation
scores_rf = cross_val_score(rf, X_train, y_train, cv=10, scoring='f1_macro')

scores_logreg = cross_val_score(logreg, X_train, y_train, cv=10, scoring='f1_macro')

scores_nb = cross_val_score(nb, X_train, y_train, cv=10, scoring='f1_macro')

# average performance across all folds
print("Average Accuracy of random forest model across the folds: %0.2f" % (scores_rf.mean()))
print("Average Accuracy of logistic regression model across the folds: %0.2f" % (scores_logreg.mean()))
print("Average Accuracy of naive Bayes model across the folds: %0.2f" % (scores_nb.mean()))

Average Accuracy of random forest model across the folds: 0.68
Average Accuracy of logistic regression model across the folds: 0.44
Average Accuracy of naive Bayes model across the folds: 0.38


As random forest has higher performance in the cross-validation setting, we use it in the test set to further assess its performance.  

## Prediction of test (or validation) set
We now have to use the trained model to predict y_test.

In [0]:
# Let's train the model using all the training set together
rf.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_test = rf.predict(X_test)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use accuracy and balanced accuracy. Here are their definitions:

* **recall** in this context is also referred to as the true positive rate or sensitivity

How many relevant item are selected




$${\displaystyle {\text{recall}}={\frac {tp}{tp+fn}}\,} $$

 

* **specificity** true negative rate



$${\displaystyle {\text{true negative rate}}={\frac {tn}{tn+fp}}\,}$$


* **precision** is the fraction of true positives out of all the positive predictions

$${\displaystyle {\text{precision}}={\frac {tp}{tp+fp}}\,} $$


* **balanced accuracy**: This measure gives you a sense of performance for all the classes together as follows:

$${\displaystyle {\text{balanced accuracy}}={\frac {recall+specificity
}{2}}\,}$$

* **F1 score** is the harmonic mean of precision and recall as follows

$${\displaystyle {\text{F1}}={\frac {2}{\frac {1}{precision}+ \frac {1}{recall}}}\,} $$


In [8]:
from sklearn import metrics

print("\n\ncomparing balanced accuracies of the models using the test set")
print("blanced accuracy of the predictions using random forest:", metrics.balanced_accuracy_score(y_test, y_pred_test))
print("recall of the predictions using random forest:", metrics.f1_score(y_test, y_pred_test, average='macro'))



comparing balanced accuracies of the models using the test set
blanced accuracy of the predictions using random forest: 0.6538865118529945
recall of the predictions using random forest: 0.6765296683804681


The 'macro' is uquivalent to arithmatic mean of F1 per each class: (F1(class 0)+F1(class 1))/2