# Prediction

In this stage, we'll take our cleaned dataset and experiment with different models to predict if a loan will default.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [3]:
df = pd.read_csv("../../data/processed/loan_sampled_50000.csv", 
                low_memory = False)

In [4]:
df.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,...,is_payment_plan,is_whole_loan,is_individual_app,term_months,is_cash,is_grade_a,is_grade_a_or_b,is_grade_f_or_g,is_verified,loan:income_ratio
0,6200,10.41,201.26,1.0,62000.0,5.23,0,1,8,0,...,False,True,True,36,True,False,True,False,False,0.1
1,19600,25.69,583.25,9.0,110000.0,15.01,0,3,12,1,...,False,False,True,60,True,False,False,True,True,0.178182
2,16000,11.99,355.84,1.0,60000.0,20.64,1,0,14,0,...,False,True,True,60,True,False,False,False,False,0.266667
3,16400,15.77,574.72,3.0,41000.0,4.51,0,3,4,0,...,False,True,True,36,True,False,False,False,False,0.4
4,7100,14.49,244.36,2.0,24000.0,25.35,0,2,8,0,...,False,False,True,36,True,False,False,False,True,0.295833


## Train-test split and normalization

We plit our data into a train set and a test set, then normalise each set independently. By removing outliers in the data processing step we can be confident that the normalisation step is not being skewed by outliers. 


In [51]:
X = df.drop(columns = ["target"])
y = df["target"]

In [52]:
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X, y, 
                                                   random_state = 10)
print(f"Train size: {X_train.shape}\nTest size: {X_test.shape}")

Train size: (19103, 82)
Test size: (6368, 82)


In [53]:
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train_orig.values), 
                       columns=X_train_orig.columns, index=X_train_orig.index)



In [54]:
scaler = MinMaxScaler()
X_test = pd.DataFrame(scaler.fit_transform(X_test_orig.values), 
                       columns=X_test_orig.columns, index=X_test_orig.index)



We notice that there is a class imbalance in our training set, as with our orginal dataset. While this does likely reflect the business context - it is more likely that a loan will not be defaulted on - we want to ensure that our models are not biased by way of the training data containing more examples of one class. 

We can take two resampling approaches to deal with this class imbalance problem:

- Oversampling the minority class

- Undersampling the majority class

There is a good discussion of approaches here: https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets. For simplicity, I will run the models on two datasets: the original training data, and an undersampled dataset (created by randomly removing instances of the majority class until the dataset is balanced. 

In [55]:
y_train.value_counts()

False    15011
True      4092
Name: target, dtype: int64

In [59]:
df_train = pd.concat([X_train_orig, y_train], axis = 1)
count_class_0, count_class_1 = df_train.target.value_counts()
df_class_0 = df_train[df_train['target'] == 0]
df_class_1 = df_train[df_train['target'] == 1]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print(df_test_under.target.value_counts())
X_train_under = df_test_under.drop(columns=["target"])
y_train_under = df_test_under["target"]

scaler = MinMaxScaler()
X_train_under = pd.DataFrame(scaler.fit_transform(X_train_under.values), 
                       columns=X_train_under.columns, index=X_train_under.index)

True     4092
False    4092
Name: target, dtype: int64




## Model testing

We're going to try the following models:

- Logistic regression

- Decision trees

- SVM

- Naive Bayes

- Random forest

- Linear SVC

To evaluate our models, we'll consider the following metrics:

- Confusion matrix

- Accuracy, precision, and recall

- ROC and AUC 