<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Load-the-libraries" data-toc-modified-id="Load-the-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the libraries</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Train-validation-split" data-toc-modified-id="Train-validation-split-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Train validation split</a></span></li></ul></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Grid-Search" data-toc-modified-id="Grid-Search-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Grid Search</a></span></li><li><span><a href="#Take-log" data-toc-modified-id="Take-log-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Take log</a></span></li></ul></div>

# Description
Reference: https://datahack.analyticsvidhya.com/contest/all/  


**Predict Loan Eligibility for Dream Housing Finance company**
Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 

**Data Dictionary**
Train file: CSV containing the customers for whom loan eligibility is known as 'Loan_Status'

| Variable | Description |
| :---|:---|
| Loan_ID | Unique Loan ID |
| Gender | Male/ Female |
| Married | Applicant married (Y/N) |
| Dependents | Number of dependents |
| Education | Applicant Education (Graduate/ Under Graduate) |
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome | Applicant income |
| CoapplicantIncome | Coapplicant income |
| LoanAmount | Loan amount in thousands |
| Loan_Amount_Term | Term of loan in months |
| Credit_History | credit history meets guidelines |
| Property_Area | Urban/ Semi Urban/ Rural |
| Loan_Status | (Target) Loan approved (Y/N) |


**Evaluation Metric**  
Your model performance will be evaluated on the basis of your prediction of loan status for the test data (test.csv), which contains similar data-points as train except for the loan status to be predicted. Your submission needs to be in the format as shown in sample submission.

We at our end, have the actual loan status for the test dataset, against which your predictions will be evaluated. We will use the Accuracy value to judge your response.



**Public and Private Split**   
Test file is further divided into Public (25%) and Private (75%)

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Load the libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

pd.options.plotting.backend = "matplotlib"
pd.set_option('max_columns',100)
pd.set_option('max_colwidth',200)

import time,os,json
time_start_notebook = time.time()
home = os.path.expanduser('~')
SEED=100

In [2]:
import sklearn
import xgboost as xgb

# Load the data

In [3]:
def clean_data(df):
    df = df.copy()
    # drop unwanted features
    df = df.drop('Loan_ID',axis=1)

    # missing values imputation
    ## fill married yes no from graudated or not
    cond = (df['Education']=='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'Yes'
    cond = (df['Education']!='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'No'

    ## fill with mode
    cols_mode = ['Gender', 'Dependents', 'Self_Employed', 'Credit_History']
    for c in cols_mode:
        df[c] = df[c].fillna(df[c].mode()[0])

    ## fill with mean
    cols_mean = ['LoanAmount','Loan_Amount_Term' ]
    for c in cols_mean:
        df[c] = df[c].fillna(df[c].mean())

    # mapping string to integers
    df['Gender'] = df['Gender'].map({'Male':1, 'Female': 0})
    df['Married'] = df['Married'].map({'Yes':1, 'No': 0 })
    df['Education'] = df['Education'].map({'Graduate': 1, 'Not Graduate': 0})
    df['Self_Employed'] = df['Self_Employed'].map({'Yes':1, 'No': 0})
    
    # target 
    target = 'Loan_Status'
    if target in df.columns:
        df[target] = df[target].map({'Y':1, 'N': 0})

    # one hot encoding
    cols = ['Dependents','Property_Area']
    df = pd.get_dummies(df,columns=cols,drop_first=True)
    return df

In [4]:
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

df_train = clean_data(df_train)
df_test = clean_data(df_test)

print(df_train.shape)
print(df_test.shape)
df_train.head(2).append(df_train.tail(2))

(614, 15)
(367, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
0,1,0,1,0,5849,0.0,146.412162,360.0,1.0,1,0,0,0,0,1
1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,0
612,1,1,1,0,7583,0.0,187.0,360.0,1.0,1,0,1,0,0,1
613,0,0,1,1,4583,0.0,133.0,360.0,0.0,0,0,0,0,1,0


# Modelling

## Train validation split

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
df_train_orig = df_train.copy()
target = 'Loan_Status'
df_Xtrain, df_Xvalid, ser_ytrain, ser_yvalid = train_test_split(
    df_train_orig.drop(target,axis=1), df_train_orig[target],
    test_size = 0.2,
    random_state=SEED,
    stratify=df_train_orig[target]
)

ytrain = ser_ytrain.to_numpy().ravel()
yvalid = ser_yvalid.to_numpy().ravel()

print(f'train shape: {df_Xtrain.shape}')
print(f'valid shape: {df_Xvalid.shape}')

df_Xtrain.head(2)

train shape: (491, 14)
valid shape: (123, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
203,1,1,0,0,3500,1083.0,135.0,360.0,1.0,1,0,0,0,1
369,1,1,1,0,19730,5266.0,570.0,360.0,1.0,0,0,0,0,0


In [7]:
Xtr = df_Xtrain
ytr = ytrain
Xvd = df_Xvalid
yvd = yvalid

In [8]:
from sklearn import metrics

df_eval = pd.DataFrame({
    'Model': [],
    'Description': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F-score': [],
    'Time_Taken': [],
})

# Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import time

In [10]:
time_start = time.time()
model = LogisticRegression(random_state=SEED)
model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['Logreg', 'default', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,Logreg,default,0.821138,0.8,0.988235,0.884211,0 min 0.04 s


In [11]:
time_start = time.time()
model = LogisticRegression(random_state=SEED,max_iter=100_000)
model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['Logreg', 'max_iter=100_000', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,Logreg,default,0.821138,0.8,0.988235,0.884211,0 min 0.04 s
1,1,Logreg,max_iter=100_000,0.821138,0.8,0.988235,0.884211,0 min 0.03 s


# Grid Search

```python
LogisticRegression(
    penalty='l2',
    *,
    dual=False,
    tol=0.0001,
    C=1.0,
    fit_intercept=True,
    intercept_scaling=1,
    class_weight=None,
    random_state=None,
    solver='lbfgs',
    max_iter=100,
    multi_class='auto',
    verbose=0,
    warm_start=False,
    n_jobs=None,
    l1_ratio=None,
)

```

In [12]:
from sklearn.model_selection import GridSearchCV

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold

In [14]:
%%time

params = dict(
    penalty = ['none','l1','l2','elasticnet'],
    C = [0.01, 0.1, 1.0, 10, 100]
)

best_params = dict(
    n_jobs=-1,
    random_state=SEED,
    solver='lbfgs',
    fit_intercept=True,
    max_iter=10_000,
    )


# rskf = RepeatedStratifiedKFold(n_splits=5,n_repeats=3,random_state=SEED)
# model =  LogisticRegression(**best_params)
# grid = GridSearchCV(model, params, n_jobs=-1,
#                     scoring='accuracy', cv=rskf, verbose=False)

# grid.fit(Xtr,ytr)

# print(grid.best_params_)
# print(f'Best score = {grid.best_score_}')

"""
{'C': 1.0, 'penalty': 'l2'}
Best score = 0.8051673194530338
CPU times: user 371 ms, sys: 15.6 ms, total: 386 ms
Wall time: 4.39 s

""";

CPU times: user 12 µs, sys: 1 µs, total: 13 µs
Wall time: 14.3 µs


"\n{'C': 1.0, 'penalty': 'l2'}\nBest score = 0.8051673194530338\nCPU times: user 371 ms, sys: 15.6 ms, total: 386 ms\nWall time: 4.39 s\n\n"

In [15]:
%%time

params = dict(
    solver=['lbfgs','liblinear','newton-cg']
)

best_params = dict(
    n_jobs=-1,
    C=1,
    penalty='l2',
    random_state=SEED,
    solver='lbfgs',
    fit_intercept=True,
    max_iter=10_000,
    )

# skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=SEED)
# model =  LogisticRegression(**best_params)
# grid = GridSearchCV(model, params, n_jobs=-1,
#                     scoring='accuracy', cv=skf, verbose=False)

# grid.fit(Xtr,ytr)

# print(grid.best_params_)
# print(f'Best score = {grid.best_score_}')

"""
{'solver': 'lbfgs'}
Best score = 0.8085343228200372
CPU times: user 53.7 ms, sys: 3.97 ms, total: 57.7 ms
Wall time: 606 ms
""";

CPU times: user 20 µs, sys: 0 ns, total: 20 µs
Wall time: 23.1 µs


"\n{'solver': 'lbfgs'}\nBest score = 0.8085343228200372\nCPU times: user 53.7 ms, sys: 3.97 ms, total: 57.7 ms\nWall time: 606 ms\n"

In [16]:
%%time

params = dict(
    C=[0.8,0.9,1.0,1.1,1.2,1.3,1.4]
)

best_params = dict(
    n_jobs=-1,
    C=1,
    penalty='l2',
    random_state=SEED,
    solver='lbfgs',
    fit_intercept=True,
    max_iter=10_000,
    )

rskf = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=SEED)
model =  LogisticRegression(**best_params)
grid = GridSearchCV(model, params, n_jobs=-1,
                    scoring='accuracy', cv=rskf, verbose=False)

grid.fit(Xtr,ytr)

print(grid.best_params_)
print(f'Best score = {grid.best_score_}')

{'C': 1.4}
Best score = 0.8040568954854669
CPU times: user 613 ms, sys: 70.9 ms, total: 684 ms
Wall time: 6.19 s


In [17]:
time_start = time.time()
model = LogisticRegression(random_state=SEED,max_iter=100_000,C=1.4)
model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['Logreg', 'max_iter=100_000,C=1.4', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,Logreg,default,0.821138,0.8,0.988235,0.884211,0 min 0.04 s
1,1,Logreg,max_iter=100_000,0.821138,0.8,0.988235,0.884211,0 min 0.03 s
2,2,Logreg,"max_iter=100_000,C=1.4",0.821138,0.8,0.988235,0.884211,0 min 0.04 s


In [18]:
%%time

params = dict(
    C=[1.30,1.32,1.34,1.38,1.42,1.44,1.46,1.48]
)

best_params = dict(
    n_jobs=-1,
    C=1,
    penalty='l2',
    random_state=SEED,
    solver='lbfgs',
    fit_intercept=True,
    max_iter=10_000,
    )


rskf = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=SEED)
model =  LogisticRegression(**best_params)
grid = GridSearchCV(model, params, n_jobs=-1,
                    scoring='accuracy', cv=rskf, verbose=False)

grid.fit(Xtr,ytr)

print(grid.best_params_)
print(f'Best score = {grid.best_score_}')

{'C': 1.42}
Best score = 0.8056895485466913
CPU times: user 414 ms, sys: 15.4 ms, total: 429 ms
Wall time: 5.76 s


In [19]:
time_start = time.time()
model = LogisticRegression(random_state=SEED,max_iter=100_000,C=1.42)
model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['Logreg', 'max_iter=100_000,C=1.42', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,Logreg,default,0.821138,0.8,0.988235,0.884211,0 min 0.04 s
1,1,Logreg,max_iter=100_000,0.821138,0.8,0.988235,0.884211,0 min 0.03 s
2,2,Logreg,"max_iter=100_000,C=1.4",0.821138,0.8,0.988235,0.884211,0 min 0.04 s
3,3,Logreg,"max_iter=100_000,C=1.42",0.821138,0.8,0.988235,0.884211,0 min 0.03 s


# Take log

In [20]:
df_Xtrain.head(2)

Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
203,1,1,0,0,3500,1083.0,135.0,360.0,1.0,1,0,0,0,1
369,1,1,1,0,19730,5266.0,570.0,360.0,1.0,0,0,0,0,0


In [21]:
cols_log = ['ApplicantIncome','CoapplicantIncome','LoanAmount']

for c in cols_log:
    df_Xtrain[c+'_log'] = np.log1p(df_Xtrain[c])
    df_Xvalid[c+'_log'] = np.log1p(df_Xvalid[c])

In [22]:
df_Xtrain.head(2)

Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban,ApplicantIncome_log,CoapplicantIncome_log,LoanAmount_log
203,1,1,0,0,3500,1083.0,135.0,360.0,1.0,1,0,0,0,1,8.160804,6.988413,4.912655
369,1,1,1,0,19730,5266.0,570.0,360.0,1.0,0,0,0,0,0,9.889946,8.569216,6.347389


In [25]:
features = [i for i in df_Xtrain.columns if i not in cols_log]

print(features)
Xtr = df_Xtrain[features]
Xvd = df_Xvalid[features]

['Gender', 'Married', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Dependents_1', 'Dependents_2', 'Dependents_3+', 'Property_Area_Semiurban', 'Property_Area_Urban', 'ApplicantIncome_log', 'CoapplicantIncome_log', 'LoanAmount_log']


In [24]:
time_start = time.time()
model = LogisticRegression(random_state=SEED,max_iter=10_000)
model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['Logreg', 'log, max_iter=10_000', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,Logreg,default,0.821138,0.8,0.988235,0.884211,0 min 0.04 s
1,1,Logreg,max_iter=100_000,0.821138,0.8,0.988235,0.884211,0 min 0.03 s
2,2,Logreg,"max_iter=100_000,C=1.4",0.821138,0.8,0.988235,0.884211,0 min 0.04 s
3,3,Logreg,"max_iter=100_000,C=1.42",0.821138,0.8,0.988235,0.884211,0 min 0.03 s
4,4,Logreg,"log, max_iter=10_000",0.813008,0.792453,0.988235,0.879581,0 min 0.08 s
