<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Load-the-libraries" data-toc-modified-id="Load-the-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the libraries</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modelling</a></span></li></ul></div>

# Description
Reference: https://datahack.analyticsvidhya.com/contest/all/  


**Predict Loan Eligibility for Dream Housing Finance company**
Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 

**Data Dictionary**
Train file: CSV containing the customers for whom loan eligibility is known as 'Loan_Status'

| Variable | Description |
| :---|:---|
| Loan_ID | Unique Loan ID |
| Gender | Male/ Female |
| Married | Applicant married (Y/N) |
| Dependents | Number of dependents |
| Education | Applicant Education (Graduate/ Under Graduate) |
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome | Applicant income |
| CoapplicantIncome | Coapplicant income |
| LoanAmount | Loan amount in thousands |
| Loan_Amount_Term | Term of loan in months |
| Credit_History | credit history meets guidelines |
| Property_Area | Urban/ Semi Urban/ Rural |
| Loan_Status | (Target) Loan approved (Y/N) |


**Evaluation Metric**  
Your model performance will be evaluated on the basis of your prediction of loan status for the test data (test.csv), which contains similar data-points as train except for the loan status to be predicted. Your submission needs to be in the format as shown in sample submission.

We at our end, have the actual loan status for the test dataset, against which your predictions will be evaluated. We will use the Accuracy value to judge your response.



**Public and Private Split**   
Test file is further divided into Public (25%) and Private (75%)

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Load the libraries

In [1]:
import numpy as np
import pandas as pd
import sklearn

pd.set_option('max_columns',200)
SEED = 100

# Load the data

In [2]:
df_train = pd.read_csv('../data/raw/train.csv')
df_test_raw = pd.read_csv('../data/raw/test.csv')

df_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
def clean_data(df):
    df = df.copy()
    
    # remove id and target
    df = df.drop('Loan_ID',axis=1)

    # missing values imputations
    m = {1.0:'Yes', 0.0:'No'}
    c = 'Credit_History'
    df[c] = df[c].map(m)
    df[c] = df[c].fillna('Unknown')
    
    
    m = {'0':'0', '1': '1', '2': '2', '3+':'3'}
    c = 'Dependents'
    df[c] = df[c].map(m)
    df[c] = df[c].fillna('Unknown')
    
    cols = ['Gender', 'Self_Employed','Married']
    for c in cols:
        df[c] = df[c].fillna('Unknown')
    
    
    cols = ['LoanAmount','Loan_Amount_Term']
    for c in cols:
        df[c] = df[c].fillna(df[c].mean())

        
    # ohe
    cols = cols = ['Gender', 'Married', 'Dependents',
                   'Education', 'Self_Employed',
                   'Credit_History', 'Property_Area']
    df = pd.get_dummies(df,columns=cols)
    
    return df

In [4]:
df_train_raw = pd.read_csv('../data/raw/train.csv')
df_test_raw = pd.read_csv('../data/raw/test.csv')

df_train = clean_data(df_train_raw)
df_test = clean_data(df_test_raw)

# shuffle train
df_train = df_train.sample(frac=1,random_state=SEED)

print(df_train.shape)
print(df_test.shape)

df_train.head(2).append(df_test.tail(2))

(614, 27)
(367, 25)


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Loan_Status,Gender_Female,Gender_Male,Gender_Unknown,Married_No,Married_Unknown,Married_Yes,Dependents_0,Dependents_1,Dependents_2,Dependents_3,Dependents_Unknown,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Unknown,Self_Employed_Yes,Credit_History_No,Credit_History_Unknown,Credit_History_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
253,2661,7101.0,279.0,180.0,Y,0,1,0,0,0.0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,1,0
506,20833,6667.0,480.0,360.0,Y,0,1,0,0,0.0,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1
365,5000,2393.0,158.0,360.0,,0,1,0,0,,1,1,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0
366,9200,0.0,98.0,180.0,,0,1,0,1,,0,1,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0


In [5]:
pd.concat([df_train_raw.isna().sum(),
           df_test_raw.isna().sum()],axis=1).sort_values(0)

# there are no nans in test data for married.

Unnamed: 0,0,1
Loan_ID,0,0.0
Education,0,0.0
ApplicantIncome,0,0.0
CoapplicantIncome,0,0.0
Property_Area,0,0.0
Loan_Status,0,
Married,3,0.0
Gender,13,11.0
Loan_Amount_Term,14,6.0
Dependents,15,10.0


In [6]:
# make last column of train and test married_unknown
df_test['Married_Unknown'] = 0
x = df_train['Married_Unknown'].to_numpy()
del df_train['Married_Unknown']
df_train['Married_Unknown'] = x

df_test.head(2)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Gender_Female,Gender_Male,Gender_Unknown,Married_No,Married_Yes,Dependents_0,Dependents_1,Dependents_2,Dependents_3,Dependents_Unknown,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Unknown,Self_Employed_Yes,Credit_History_No,Credit_History_Unknown,Credit_History_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Married_Unknown
0,5720,0,110.0,360.0,0,1,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0
1,3076,1500,126.0,360.0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0


In [7]:
df_train.head(2).append(df_test.tail(2)).isna().sum().loc[lambda x: x>0]

Loan_Status    2
dtype: int64

In [8]:
target = 'Loan_Status'

# Modelling

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

In [10]:
params = {'min_samples_leaf': 17,
          'max_features': 0.95,
          'max_depth': 3,
          'learning_rate': 0.03,
          'n_estimators': 500}

model = GradientBoostingClassifier(**params)
model.fit(df_train.drop(target,1),df_train[target])

GradientBoostingClassifier(learning_rate=0.03, max_features=0.95,
                           min_samples_leaf=17, n_estimators=500)

In [11]:
test_preds =  model.predict(df_test)

In [12]:
df_sub = pd.DataFrame({'Loan_ID': df_test_raw['Loan_ID'],
                       'Loan_Status': test_preds})

df_sub.head(2)

Unnamed: 0,Loan_ID,Loan_Status
0,LP001015,Y
1,LP001022,Y


In [13]:
df_sub.to_csv('../outputs/sub_gbc.csv',index=False)

In [14]:
!head ../outputs/sub_gbc.csv

Loan_ID,Loan_Status
LP001015,Y
LP001022,Y
LP001031,Y
LP001035,Y
LP001051,Y
LP001054,Y
LP001055,Y
LP001056,N
LP001059,Y
