# Contents <a id='top'></a>

1. <a href=#eda>Data Exploration</a>
1. <a href=#sup>Supervised Learning</a>
1. <a href=#skl>sklearn</a>
1. <a href=#metrics>Classification Error Metrics</a>
1. <a href=#app>Application</a>
    1. <a href=#knn>k-Nearest Neighbours</a>
    1. <a href=#hyper>Hyperparameter Search</a>
1. <a href=#ref>References and Links</a>

<a id='eda'></a>
# 1. Introduction to Lending Club Data
<a href=#top>(back to top)</a>

The data set we will be using comes from the Lending Club. It is a peer-to-peer lending company. It offers loans that are funded by other people: 

* A borrower applies for a loan of a certain amount.
* The company assesses the risk of lending. 
* Even if an application is accepted, the requested loan might not be fully funded by investors.

The full dataset can be obtained from the [Kaggle website](https://www.kaggle.com/wordsforthewise/lending-club). It is approximately 650MB in size. However, for our session, we are only going to work with a partial dataset, from 2007 to 2011. It is available on LumiNUS as `loans.xlsx`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
loans = pd.read_excel('../data/loans.xlsx', index_col=0)

In [None]:
loans[['loan_amnt', 'funded_amnt_inv', 'y']].head()

Our version of the loans dataset consists of 42,535 rows and 76 columns. The Kaggle site contains a data dictionary that explains what each column means. Our target variable is contained in the column `y`: we wish to predict it using the remaining columns. 

It was computed from the existing columns as follows. First, for each loan (i.e. row), we compute the proportion of loan not funded:

$$
\text{prop. not funded} = \frac{(\text{loan amnt}) - (\text{funded amnt inv})}{\text{loan amnt}}
$$

Our target variable takes on the value 1 if the proportion not funded is larger than 0.05:

$$
y = \begin{cases}
1 & \text{if prop. not funded} \ge 0.05 \\
0 & \text{otherwise}
\end{cases}
$$

In [None]:
loans.y.mean()

As we see, only about 19% of the loans were unsuccessful. This indicates a moderately unbalanced dataset, and serves as our benchmark. If we were to always predict that a loan is successful (y=0), then we would be correct approximately 81% of the time.

> *Can we do better than this?*

<a id='sup'></a>
# 2. Supervised Learning
<a href=#top>(back to top)</a>

In supervised learning, our goal is to develop a model that can predict a quantity of interest from a set of features. In this process,

* Algorithms learn from a training set of labelled examples.
* This training set is meant to be representative of the set of all possible inputs.
* Example algorithms include logistic regression, support vector machines, decision trees and random forests.

Here are some examples:

1. We wish to predict if a student will graduate from university or not, based on his/her 'A' level results.
2. We wish to predict tomorrow's stock price based on today's price.

The other main sort of learning is unsupervised learning. Here are some examples:

1. We wish to identify the salient topics from a set of English documents.
2. We wish to estimate the probability density function that a sample of observations came from.

In our class, we shall focus only on *supervised* learning.

## Classification versus Regression

If the answer to the question (supervised learning problem) we are facing is either YES or NO, then we have a **classification** problem.

* Given the results of a clinical test, does this patient suffer from diabetes?
* Given an MRI, is there a tumor?

On the other hand, if we are trying to predict a real-valued quantity,
then we are faced with a **regression** problem.

* Given the details about an apartment, what will the rental be?
* Given historical transactions of a customer, how much is he likely to spend on his next purchase?

## Supervised Learning Overview

<img src="../figs/sup_learning.png" style="width: 900px;"/>

In words:

1. Split up a dataset into training and testing data. **Do not touch the test data again until the end.**
2. Preprocess/clean the training data and store the parameters for later use on the test data.
    * Example preprocessings are scaling, one-hot encoding, PC 
    decomposition, etc.
3. Decide on what models you wish to try. Each model has parameters to be fit (from the data), and **hyperparameters** to be chosen by you.
    * Example models are k-nearest neighbours (KNN) and random forests.
    * A hyperparameter for KNN is the number of neighbours to use.
    * A hyperparameter for random forests is the number of trees.
    * Hyperparameters usually control the **complexity** of a 
    model. If a model is too complex, it will over-fit to the 
    training data but fare poorly on the test data.
4. Use **cross-validation** or a set-aside validation set to decide on the hyperparameters for your chosen estimator. These procedures typically minimise a loss function or error metric.
5. Once you are satisfied with your choices, evaluate the selected model on the test set to obtain an estimate of your generalisation error.

<a id='skl'></a>
# 3. Scikit-learn
<a href=#top>(back to top)</a>

* Scikit-learn is a library in Python which has several useful functions used in machine learning.
* The library has many algorithms for classification, regression, clustering and other machine learning methods.
* It uses other libraries like NumPy and matplotlib which are also
used in this course.
* The website for scikit-learn is an excellent source of examples and tips on using the functions within this package:
http://scikit-learn.org/stable/index.html

All objects in scikit-learn have common access points. The three main 
interfaces are:

1. Estimator interface - `fit()` method.
    * This function allows us to build and fit models.
    * Any object that can estimate some parameters based on a dataset 
    is an *estimator*.
	* Estimation is performed by the `fit()` method. This method 
    takes in two datasets as arguments (the input data, and the
    corresponding output/labels).
    
2. Predictor interface - `predict()` method.
    * This function allows us to make predictions.
	* Estimators capable of making predictions when given a 
    dataset are called *predictors*.
	* A predictor has a `predict()` method. It takes in a dataset 
    of new instances and returns a dataset of corresponding 
    predictions.
    
3. Transformer interface - `transform()` method.
    * This function is for converting data.
	* Estimators which can also transform a dataset are called *transformers*.
	* Transformations are carried out by the `transform()` method.
    * This method takes in the dataset to transform as a parameter and
    returns the transformed dataset.
    * We will not have too much time to spend on the transformer 
    interface in this course.

### Input Data Structure

For supervised learning problems in scikit-learn, the input data has to be structured in NumPy-like arrays.


The **feature matrix X**, of shape $n$ by $d$ contains features:
* $n$ rows: the number of samples
* $d$ columns: the number of features or distinct traits used to
describe each item in a quantitative manner

Each row in the feature matrix is referred to as a sample, example or an instance.

A **label vector y** stores the target values. This vector stores the true output value for each corresponding sample (row) in
matrix X.

<img src="../figs/05_input_structure.png" style="width: 500px;"/>


<a id='metrics'></a>
# 4. Measures of Performance
<a href=#top>(back to top)</a>

Before we head into creating classifiers which will help us predict if a loan will be successful, let's understand what determines the usefulness of a classifier.

A basic measure of performance would be the *accuracy* of predictions.

$$ \text{accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} 
$$

When more detailed analysis is needed, partial performance metrics can be presented in a *confusion matrix*.

It considers various scenarios depending on the classifier's prediction and the actual outcome.

<img src="../figs/day_07_confusion.png" style="width: 500px;"/>


### TP, FP, TN, FN

In the confusion matrix, there are 4 possible cases:
* True positives (TP)
    * Classifier predicts sample as positive and it really is so.
* False positives (FP)
    * Classifier predicts sample as positive but in truth, it is 
    negative.
    * An inaccurate prediction.
* True negatives (TN)
    * Classifier predicts sample as negative and it really is so.
* False negatives (FN)
    * Classifier predicts sample as negative but in truth, it is 
    positive.
    * An inaccurate prediction.

### Precision and Recall

With the confusion matrix, more performance metrics can be defined besides the accuracy of a classifier.

* The **recall** of a classifier is the proportion of truly positives correctly
identified:
$$
\text{recall}=  \frac{\text{TP}}{\text{TP + FN}} $$
* The **precision** of a classifier is the proportion of predicted
positives that are truly positive:
$$
\text{precision} =  \frac{\text{TP}}{\text{TP + FP}} $$
* Above, we have defined recall and precision for the *positive* category outcome. There are analogous definitions for the *negative* outcome.
* Recall is also referred to as the True Positive Rate (TPR). One
minus the precision is also referred to as the False Positive Rate (FPR).

### F1 score

* The harmonic mean of two numbers $x_1$ and $x_2$ is 
$$
\left( \frac{1/x_1 + 1/x_2}{2} \right)^{-1}
$$
* We can combine precision and recall into one score using their harmonic mean:
$$
\text{F1} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} +
\text{receall}}
$$
Roughly, the F1 score is a summary of how good the classifier is in terms of both precision and recall.

<a id='app'></a>
# 4. Application to Lending Club Data
<a href=#top>(back to top)</a>

## Pre-processing Data

In [None]:
from sklearn import neighbors, metrics, preprocessing
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, validation_curve
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, f1_score, precision_recall_curve
from sklearn.metrics import matthews_corrcoef
#from sklearn.ensemble import RandomForestClassifier

In [None]:
loans.loc[:, 'issue_yr'] = loans.issue_d.apply(lambda x: x.year)
loans.loc[:, 'issue_mth'] = loans.issue_d.apply(lambda x: x.month)

First, we drop those columns that have fewer than 40000 non-missing values

In [None]:
drop_these_columns = loans.apply(lambda x: np.sum(pd.notna(x)), axis=0) < 40000
drop_these_columns

In [None]:
loans.drop(columns=loans.columns[drop_these_columns], inplace=True)

In [None]:
loans.shape

Next, we drop all rows that have even 1 missing value. A better way would be to impute the missing values, but we save that for another time.

In [None]:
# drop missing values rows
no_miss = loans[pd.notna(loans).all(axis=1)].copy()
no_miss.shape

In [None]:
no_miss.sample(2)

There are a couple of columns that contain date information in our dataset: `issue_d` and `earliest_cr_line`. Let us extract the year information from them

In [None]:
#no_miss.loc[:, 'issue_yr'] = no_miss.issue_d.apply(lambda x: x.year)
#no_miss.loc[:, 'issue_mth'] = no_miss.issue_d.apply(lambda x: x.month)

In [None]:
no_miss.earliest_cr_line.str.split('-', expand=True).head()

In [None]:
cr_line_cols = no_miss.earliest_cr_line.str.split('-', expand=True)
cr_line_cols.columns = ['ecrl_mth', 'ecrl_yr']
cr_line_cols.ecrl_yr = cr_line_cols.ecrl_yr.astype(int)

In [None]:
no_miss = pd.concat([no_miss, cr_line_cols],axis=1)

In [None]:
no_miss.shape

Now, we split the data into a training and test set before proceeding.

In [None]:
y = no_miss.y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(no_miss, y, test_size=0.3, random_state=41, 
                                                 stratify=y)

In [None]:
y_test.mean()

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn import set_config

set_config(display='diagram')

### Finding useful Features

In this section, we whittle down the number of features we have in the dataset.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

In [None]:
X_num = X_train.loc[:, ['loan_amnt', 'int_rate', 'installment', 
                   'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 
                   'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 
                   'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
                   'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 
                   'collection_recovery_fee', 'collections_12_mths_ex_med',
                   'acc_now_delinq', 'issue_yr', 'issue_mth', 'ecrl_yr']]

In [None]:
skb = SelectKBest(mutual_info_classif, k=8)

In [None]:
skb.fit(X_num, y_train)

In [None]:
skb.scores_.round(4)

In [None]:
X_num.columns[skb.scores_ >= np.sort(skb.scores_)[-8]]
#X_num.columns[np.argsort(-skb.scores_)[:8]]

In [None]:
skb.transform(X_num).shape

In [None]:
X_num.loc[:, X_num.columns[skb.scores_ >= np.sort(skb.scores_)[-8]]]

In [None]:
X_train

In [None]:
X_num.shape

We can do a similar thing for the categorical columns, using the following code. None of them were clear, so I kept all of them.

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [None]:
X_cat_df = X_train.loc[:,['term', 'grade', 'sub_grade', 'emp_length',
       'home_ownership', 'verification_status', 
       'loan_status', 'purpose', 'zip_code','addr_state'] ]
X_cat_df = X_cat_df.astype('str')

In [None]:
X_cat_df.head()

In [None]:
oe1 = OrdinalEncoder()
oe1.fit(X_cat_df)

In [None]:
X_cat = oe1.transform(X_cat_df)

In [None]:
X_cat[0:5, :]

In [None]:
oe1.inverse_transform(X_cat[:2,])

In [None]:
oe1.categories_[9]

In [None]:
skb2 = SelectKBest(mutual_info_classif, k = 10)
#skb2a = SelectKBest(chi2, k = 10)

skb2.fit(X_cat, y_train)
#skb2a.fit(X_cat, y_train)

#X_cat_df.columns[np.argsort(skb2a.scores_)]
X_cat_df.columns[np.argsort(skb2.scores_)]

With the following chosen numerical and categorical features, we scale the numerical and one-hot encode the categorical.

In [None]:
num_features = ['loan_amnt', 'int_rate', 'installment', 'total_pymnt', 
                'total_pymnt_inv', 'total_rec_prncp', 'issue_yr']
cat_features = ['term', 'grade', 'emp_length', 'home_ownership', 
                'loan_status', 'purpose', 'addr_state']

In [None]:
all_features = num_features + cat_features

In [None]:
X_train = X_train.loc[:, all_features]
X_test = X_test.loc[:, all_features]

To do the scaling and encoding at one go, and because we need to store the fitted parameters for later use on the test set, we use a pipeline of transformers.

In [None]:
ct = ColumnTransformer([
      ('scale', StandardScaler(),
      make_column_selector(dtype_include=np.number)),
      ('onehot', OneHotEncoder(),
      make_column_selector(dtype_include=object))])

In [None]:
ct.fit(X_train)

In [None]:
X_ttrain = ct.transform(X_train)
X_ttest = ct.transform(X_test)

In [None]:
X_train.iloc[:10, :7]

In [None]:
X_ttrain[:10, :7].toarray().round(5)

<a id='knn'></a>
## k-Nearest Neighbours

k-Nearest Neighbours (KNN) is a simple model that tries to classify a set of data points into groups.

* The *k* in KNN refers to the number of nearest data points the algorithm should include before classifying them into a group (i.e. number of neighbours).
    * This a parameter you get to set when using `KNeighborsClassifiers()`
* We start with a single data point in the picture. Each time a new data point is added, its *k* closest neighbours (data points) are identified.
* Note that the definition of 'nearest' is subjective; we can choose the metric appropriate for the situation.
* Since its neighbours have already been classified into different groups, the new data point will be added to the group which the majority of the neighbours are in.


If we added a new data point *c* with *k = 3*, it is grouped into *b*, since two out of three points in the neighbourhood belong to *b*.

<img src="../figs/05_knn.png" style="width: 500px;"/>


In [None]:
nn10 = neighbors.KNeighborsClassifier(n_neighbors=10, n_jobs=8)

In [None]:
nn10.fit(X_ttrain, y_train,)

In [None]:
y_train_pred = nn10.predict(X_ttrain)

In [None]:
print(classification_report(y_train, y_train_pred))

In [None]:
1 - y.mean()

<a id='hyper'></a>
## Grid Search for Hyperparameters

In [None]:
#nn_range= np.arange(10, 1, -1)
nn_range = np.arange(50, 5, -4)

In [None]:
from sklearn.model_selection import GridSearchCV

test1 = GridSearchCV(neighbors.KNeighborsClassifier(), {'n_neighbors':nn_range[:2]}, 
                     scoring='f1', cv=5, verbose=2)

test1.fit(X_ttrain, y_train)

In [None]:
train_scores, cv_scores = validation_curve(neighbors.KNeighborsClassifier(), X_ttrain, y_train, 
                                           param_name='n_neighbors', cv=5, n_jobs=8,
                                           param_range=nn_range, scoring='f1', verbose=2)

In [None]:
train_means = np.mean(train_scores, axis=1)
train_sd = np.std(train_scores, axis=1)

cv_means = np.mean(cv_scores, axis=1)
cv_sd = np.std(cv_scores, axis=1)

plt.plot(1/nn_range, train_means, 'o-', label='Training', color='blue')
plt.fill_between(1/nn_range, train_means-train_sd, train_means+train_sd, color='blue', alpha=0.2)

plt.plot(1/nn_range, cv_means, 'o-', label='CV (Test)', color='red')
plt.fill_between(1/nn_range, cv_means-cv_sd, cv_means+cv_sd, color='red', alpha=0.2)

plt.legend(loc='lower right');plt.ylabel('F1-score');plt.xlabel('Complexity');plt.title('Validation Curve');

In [None]:
X_ttrain.shape

## ROC Curves and Precision-Recall Curves

In [None]:
nn20 = neighbors.KNeighborsClassifier(n_neighbors=20, n_jobs=8)
nn20.fit(X_ttrain, y_train)

In [None]:
y_test_probs = nn20.predict_proba(X_ttest)

In [None]:
y_test_pred = nn20.predict(X_ttest)

In [None]:
print(classification_report(y_test, y_test_pred))

In [None]:
fpr, tpr, threshold = roc_curve(y_test, y_test_probs[:, 1])

plt.plot(fpr, tpr,'b-');
plt.title('ROC Curve')
plt.xlabel('FPR');plt.ylabel('TPR');

In [None]:
print(f'The area under the AUC is {auc(fpr, tpr):.3f}.')

In [None]:
nn_precision, nn_recall, thresholds = precision_recall_curve(y_test, y_test_probs[:, 1])

In [None]:
plt.plot(nn_recall, nn_precision,'b-');
plt.title('Precision-Recall Curve')
plt.xlabel('Recall');plt.ylabel('Precision');
plt.plot([0.0, 1.0], [0.0, 1.0], color='red', linestyle="--");

In [None]:
pd.DataFrame({'prec':nn_precision[1:], 'rec':nn_recall[1:], 'thresh':thresholds})

<a id='ref'></a>
# 6. References
<a href=#top>(back to top)</a>

1. [Introduction to statistical learning](https://faculty.marshall.usc.edu/gareth-james/ISL/) This is one of the most complete introductory books on the topic.
2. [scikit-learn docs](https://scikit-learn.org/stable/user_guide.html)
3. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems By Aurélien Géron (electronic copy available at NUS libraries)
4. Introduction to Data Science A Python Approach to Concepts, Techniques and Applications by Laura Igual (electronic version available at NUS libraries)