# Loan Prediction classification excercise

We consider the dataset file `dataset.csv`, contained in the `data/loan-prediction` directory.
A description of the dataset is available in the `README.txt` file on the same directory.

The **goal** is to use the informatino from past loan applicants contained in `dataset.csv` to predict whether a *new applicant* should be granted a loan or not.

### Load the dataset and handle missing values

In [9]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

DATASET_PATH = "./data/loan-prediction/dataset.csv"

# loading the dataset
data = pd.read_csv(DATASET_PATH, sep=",", index_col="Loan_ID")
print(f"The shape of the dataset is: {data.shape}")
print(data.head())

# handling missing values
from pandas.api.types import is_numeric_dtype
data = data.apply(lambda x: x.fillna(x.median()) if is_numeric_dtype(x) else x.fillna(x.mode().iloc[0]))

The shape of the dataset is: (614, 12)
         Gender Married Dependents     Education Self_Employed  \
Loan_ID                                                          
LP001002   Male      No          0      Graduate            No   
LP001003   Male     Yes          1      Graduate            No   
LP001005   Male     Yes          0      Graduate           Yes   
LP001006   Male     Yes          0  Not Graduate            No   
LP001008   Male      No          0      Graduate            No   

          ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
Loan_ID                                                                      
LP001002             5849                0.0         NaN             360.0   
LP001003             4583             1508.0       128.0             360.0   
LP001005             3000                0.0        66.0             360.0   
LP001006             2583             2358.0       120.0             360.0   
LP001008             6000     

### Handling outliers

Winsorization is a technique for handling outliers in statistics. It replaces extreme values in a dataset with values closer to the center of the distribution (the median).

In [None]:
# winsorize ApplicantIncome, CoapplicantIncome and LoanAmount
import scipy.stats as stats
stats.mstats.winsorize(data.ApplicantIncome, limits=0.05, inplace=True)
stats.mstats.winsorize(data.CoapplicantIncome, limits=0.05, inplace=True)
stats.mstats.winsorize(data.LoanAmount, limits=0.05, inplace=True)

# Apply log-transformation to ApplicantIncome and assign it to a new column
data["Log_ApplicantIncome"] = data.ApplicantIncome.apply(np.log)
# Apply log-transformation to LoanAmount and assign it to a new column
data["Log_LoanAmount"] = data.LoanAmount.apply(np.log)

### Encoding categorical features: one-hot encoding

One-hot encoding is a way to convert categorical data into a numerical representation by creating a binary vector for each category. The resulting vector has a length equal to the number of categories, and each element in the vector corresponds to a specific category.

In [21]:
# get all columns which are not numeric and not the loan status
categorical_features = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != "Loan_Status"]
data_with_dummies = pd.get_dummies(data, columns=categorical_features)

# as a convention, I prefer to place the column to be predicted as the last one
columns = data_with_dummies.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index("Loan_Status")))
data_with_dummies = data_with_dummies.loc[:, columns]

# encoding the Loan_Status label
data = data_with_dummies
data.Loan_Status = data.Loan_Status.map(lambda x: 1 if x == "Y" else -1)

print(data.head())

          ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
Loan_ID                                                                      
LP001002             5849                0.0       128.0             360.0   
LP001003             4583             1508.0       128.0             360.0   
LP001005             3000                0.0        66.0             360.0   
LP001006             2583             2358.0       120.0             360.0   
LP001008             6000                0.0       141.0             360.0   

          Credit_History  Gender_Female  Gender_Male  Married_No  Married_Yes  \
Loan_ID                                                                         
LP001002             1.0          False         True        True        False   
LP001003             1.0          False         True       False         True   
LP001005             1.0          False         True       False         True   
LP001006             1.0          False         

### Building a predictive model

In [29]:
# from sklearn.metrics import SCORERS
# import sklearn.metrics
# from sklearn.feature_extraction import DictVectorizer as DV
# from sklearn import tree
# from sklearn.model_selection import KFold
# from sklearn.model_selection import StratifiedKFold
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import cross_validate
# from sklearn.model_selection import train_test_split
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import roc_auc_score
# from sklearn.metrics import classification_report
# from sklearn.metrics import explained_variance_score
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import LinearSVC
# from sklearn.svm import SVC
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.ensemble import RandomForestRegressor
#from sklearn.externals import joblib

# split the dataset into training and testing

# extract the feature matrix from our original dataframe the feature matrix X
# is composed of all the columns except "Loan_Status" (the target class label)
X = data.iloc[:, :-1]

# we want the extract the target class from column vector y
y = data.Loan_Status

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape {X_test.shape}")

Training set shape: (491, 20)
Test set shape (123, 20)


### Feature scaling: why/when