<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Load-the-libraries" data-toc-modified-id="Load-the-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the libraries</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Processing</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Train-validation-split" data-toc-modified-id="Train-validation-split-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Train validation split</a></span></li></ul></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Tpot" data-toc-modified-id="Tpot-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Tpot</a></span></li><li><span><a href="#Tpot-best" data-toc-modified-id="Tpot-best-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Tpot best</a></span></li><li><span><a href="#Extra-Trees" data-toc-modified-id="Extra-Trees-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Extra Trees</a></span></li></ul></li></ul></div>

# Description
Reference: https://datahack.analyticsvidhya.com/contest/all/  


**Predict Loan Eligibility for Dream Housing Finance company**
Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 

**Data Dictionary**
Train file: CSV containing the customers for whom loan eligibility is known as 'Loan_Status'

| Variable | Description |
| :---|:---|
| Loan_ID | Unique Loan ID |
| Gender | Male/ Female |
| Married | Applicant married (Y/N) |
| Dependents | Number of dependents |
| Education | Applicant Education (Graduate/ Under Graduate) |
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome | Applicant income |
| CoapplicantIncome | Coapplicant income |
| LoanAmount | Loan amount in thousands |
| Loan_Amount_Term | Term of loan in months |
| Credit_History | credit history meets guidelines |
| Property_Area | Urban/ Semi Urban/ Rural |
| Loan_Status | (Target) Loan approved (Y/N) |


**Evaluation Metric**  
Your model performance will be evaluated on the basis of your prediction of loan status for the test data (test.csv), which contains similar data-points as train except for the loan status to be predicted. Your submission needs to be in the format as shown in sample submission.

We at our end, have the actual loan status for the test dataset, against which your predictions will be evaluated. We will use the Accuracy value to judge your response.



**Public and Private Split**   
Test file is further divided into Public (25%) and Private (75%)

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Load the libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option('max_columns',100)
pd.set_option('max_colwidth',500)
SEED=100

In [2]:
import sklearn
import xgboost

[(x.__name__,x.__version__) for x in [sklearn,xgboost]]

[('sklearn', '0.23.1'), ('xgboost', '1.1.1')]

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

from xgboost import XGBClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier

# Load the data

In [4]:
df_train = pd.read_csv('../data/raw/train.csv')
print(df_train.shape)
df_train.head()

(614, 13)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [5]:
df_test = pd.read_csv('../data/raw/test.csv')
print(df_test.shape)
df_test.head()

(367, 12)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


# Data Processing

In [6]:
def clean_data(df):
    df = df.copy()
    # drop unwanted features
    df = df.drop('Loan_ID',axis=1)

    # missing values imputation
    ## fill married yes no from graudated or not
    cond = (df['Education']=='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'Yes'
    cond = (df['Education']!='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'No'

    ## fill with mode
    cols_mode = ['Gender', 'Dependents', 'Self_Employed', 'Credit_History']
    for c in cols_mode:
        df[c] = df[c].fillna(df[c].mode()[0])

    ## fill with mean
    cols_mean = ['LoanAmount','Loan_Amount_Term' ]
    for c in cols_mean:
        df[c] = df[c].fillna(df[c].mean())

    # mapping string to integers
    df['Gender'] = df['Gender'].map({'Male':1, 'Female': 0})
    df['Married'] = df['Married'].map({'Yes':1, 'No': 0 })
    df['Education'] = df['Education'].map({'Graduate': 1, 'Not Graduate': 0})
    df['Self_Employed'] = df['Self_Employed'].map({'Yes':1, 'No': 0})
    
    # target 
    target = 'Loan_Status'
    if target in df.columns:
        df[target] = df[target].map({'Y':1, 'N': 0})

    # one hot encoding
    cols = ['Dependents','Property_Area']
    df = pd.get_dummies(df,columns=cols,drop_first=True)
    return df

In [7]:
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

df_train = clean_data(df_train)
df_test = clean_data(df_test)

print(df_train.shape)
print(df_test.shape)
df_train.head(2).append(df_train.tail(2))

(614, 15)
(367, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
0,1,0,1,0,5849,0.0,146.412162,360.0,1.0,1,0,0,0,0,1
1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,0
612,1,1,1,0,7583,0.0,187.0,360.0,1.0,1,0,1,0,0,1
613,0,0,1,1,4583,0.0,133.0,360.0,0.0,0,0,0,0,1,0


# Modelling

## Train validation split

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
df_train_orig = df_train.copy()
target = 'Loan_Status'
df_Xtrain, df_Xvalid, ser_ytrain, ser_yvalid = train_test_split(
    df_train_orig.drop(target,axis=1), df_train_orig[target],
    test_size = 0.2,
    random_state=SEED,
    stratify=df_train_orig[target]
)

ytrain = ser_ytrain.to_numpy().ravel()
yvalid = ser_yvalid.to_numpy().ravel()

print(f'train shape: {df_Xtrain.shape}')
print(f'valid shape: {df_Xvalid.shape}')

df_Xtrain.head(2)

train shape: (491, 14)
valid shape: (123, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
203,1,1,0,0,3500,1083.0,135.0,360.0,1.0,1,0,0,0,1
369,1,1,1,0,19730,5266.0,570.0,360.0,1.0,0,0,0,0,0


In [10]:
Xtr = df_Xtrain
ytr = ytrain
Xvd = df_Xvalid
yvd = yvalid

In [11]:
from sklearn import metrics

df_eval = pd.DataFrame({
    'Model': [],
    'Description': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F-score': [],
    'Time_Taken': [],
})

# Modelling

## Tpot

In [12]:
from tpot import TPOTClassifier



In [13]:
%%time

# generations=5

# tpot = TPOTClassifier(max_time_mins=2, population_size=50, verbosity=2, random_state=100)
# tpot.fit(Xtr, ytr)
# print(tpot.score(Xvd, yvd))
# tpot.export('../scripts/tpot_loan_prediction_pipeline.py')

"""
Best pipeline: RandomForestClassifier(MinMaxScaler(MLPClassifier(input_matrix, alpha=0.1, learning_rate_init=0.001)), bootstrap=True, criterion=entropy, max_features=0.7000000000000001, min_samples_leaf=10, min_samples_split=18, n_estimators=100)
0.8211382113821138
CPU times: user 4min 6s, sys: 4.33 s, total: 4min 11s
Wall time: 2min 30s
""";

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=300.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.80863739435168
Generation 2 - Current best internal CV score: 0.80863739435168
Generation 3 - Current best internal CV score: 0.8106782106782106
Generation 4 - Current best internal CV score: 0.8106782106782106
Generation 5 - Current best internal CV score: 0.8106782106782106
Best pipeline: RandomForestClassifier(MinMaxScaler(MLPClassifier(input_matrix, alpha=0.1, learning_rate_init=0.001)), bootstrap=True, criterion=entropy, max_features=0.7000000000000001, min_samples_leaf=10, min_samples_split=18, n_estimators=100)
0.8211382113821138
CPU times: user 4min 6s, sys: 4.33 s, total: 4min 11s
Wall time: 2min 30s


## Tpot best

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import MinMaxScaler
from tpot.builtins import StackingEstimator
import time

In [15]:
time_start = time.time()
model = make_pipeline(
    StackingEstimator(estimator=MLPClassifier(alpha=0.1, learning_rate_init=0.001)),
    MinMaxScaler(),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.7000000000000001, min_samples_leaf=10, min_samples_split=18, n_estimators=100)
)

model.fit(Xtr,ytr)
vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['tpot', 'rf+minmax+stacking', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,tpot,rf+minmax+stacking,0.813008,0.803922,0.964706,0.877005,0 min 0.60 s


## Extra Trees

In [16]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics

In [17]:
time_start = time.time()
model = ExtraTreesClassifier(bootstrap=True,
                             criterion='entropy', max_features=0.8,
                             min_samples_leaf=19,
                             min_samples_split=5, n_estimators=100)

model.fit(Xtr,ytr)
vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['ET', 'tuned', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,tpot,rf+minmax+stacking,0.813008,0.803922,0.964706,0.877005,0 min 0.60 s
1,1,ET,tuned,0.813008,0.792453,0.988235,0.879581,0 min 0.16 s
