<img src='LG_IMG.png' height="350" width="350"></img>

<h3> Problem </h3>

When a customer applies for a loan, banks and other credit providers use statistical models to determine whether or not to grant the loan based on the likelihood of the loan being repaid. 
The factors involved in determining this likelihood are complex, and extensive statistical analysis and modelling are required to predict the outcome for each individual case. 
So the problem we are tackling is to prepare a model that predicts loan repayment or default based on the data provided.

The dataset could be downloaded from the following link 

https://gallery.cortanaintelligence.com/Competition/Loan-Granting-Binary-Classification-1

<h3> Data Dictionary </h3>

The dataset consists of the following fields:

• Loan ID: A unique Identifier for the loan information.

• Customer ID: A unique identifier for the customer. Customers may have more than one loan.

• Loan Status: A categorical variable indicating if the loan was paid back or defaulted.

• Current Loan Amount: This is the loan amount that was either completely paid off, or the amount that was defaulted.

• Term: A categorical variable indicating if it is a short term or long term loan.

• Credit Score: A value between 0 and 800 indicating the riskiness of the borrowers credit history.

• Years in current job: A categorical variable indicating how many years the customer has been in their current job.

• Home Ownership: Categorical variable indicating home ownership. Values are "Rent", "Home Mortgage", and "Own". If the value is OWN, then the customer is a home owner with no mortgage

• Annual Income: The customer's annual income

• Purpose: A description of the purpose of the loan.

• Monthly Debt: The customer's monthly payment for their existing loans

• Years of Credit History: The years since the first entry in the customer’s credit history

• Months since last delinquent: Months since the last loan delinquent payment

• Number of Open Accounts: The total number of open credit cards

• Number of Credit Problems: The number of credit problems in the customer records.

• Current Credit Balance: The current total debt for the customer

• Maximum Open Credit: The maximum credit limit for all credit sources.

• Bankruptcies: The number of bankruptcies

• Tax Liens: The number of tax liens.

<b> For analysis, lets import some useful libraries and functions to get more intuition </b>

In [4]:
##Importing usefull files and libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import loangrantmodel


Full Code is provided below....

In [5]:
#loading data
data = loangrantmodel.read_data()
data.head()

Unnamed: 0,Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Years in current job,Home Ownership,Annual Income,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens
0,6cf51492-02a2-423e-b93d-676f05b9ad53,7c202b37-2add-44e8-9aea-d5b119aea935,Charged Off,12232,Short Term,7280.0,< 1 year,Rent,46643.0,Debt Consolidation,777.39,18.0,10.0,12,0,6762,7946,0.0,0.0
1,552e7ade-4292-4354-9ff9-c48031697d72,e7217b0a-07ac-47dd-b379-577b5a35b7c6,Charged Off,25014,Long Term,7330.0,10+ years,Home Mortgage,81099.0,Debt Consolidation,892.09,26.7,,14,0,35706,77961,0.0,0.0
2,9b5e32b3-8d76-4801-afc8-d729d5a2e6b9,0a62fc41-16c8-40b5-92ff-9e4b763ce714,Charged Off,16117,Short Term,7240.0,9 years,Home Mortgage,60438.0,Home Improvements,1244.02,16.7,32.0,11,1,11275,14815,1.0,0.0
3,5419b7c7-ac11-4be2-a8a7-b131fb6d6dbe,30f36c59-5182-4482-8bbb-5b736849ae43,Charged Off,11716,Short Term,7400.0,3 years,Rent,34171.0,Debt Consolidation,990.94,10.0,,21,0,7009,43533,0.0,0.0
4,1450910f-9495-4fc9-afaf-9bdf4b9821df,70c26012-bba5-42c0-8dcb-75295ada31bb,Charged Off,9789,Long Term,6860.0,10+ years,Home Mortgage,47003.0,Home Improvements,503.71,16.7,25.0,13,1,16913,19553,1.0,0.0


<h3> Data Preprocessing </h3>

Upon performing exploratory data analysis we performed the following operations for our pre-processing steps.

- Removing duplicates
- Handling wrong credit scores
- Handling the missing values
- Clipping values 
- Attribute merging 
- Converting the metadata of the features
- Mapping categorical attributes in a feature to numeric values
- Normalizing the numeric values


<h4> Removing duplicates </h4>

Since we are recieving unique transactions, so as per the data dictionary we should have unique 'loan id' as the information that row contains should include once.

In [6]:
data = loangrantmodel.remove_duplicates(data)

<h4> Handling wrong credit scores </h4>

Since we have a range defined for the credit scores, from 0 - 800. 
In our EDA phase we observed values greater than 1000 which could have been resulted from wrong typing. 

In [7]:
data = loangrantmodel.creditscore_process(data)

<h4> Handling the missing values </h4>

In [8]:
data = loangrantmodel.annualincome_process(data)
data = loangrantmodel.fillingNull_process(data)

<h4> Clipping values </h4>

In [9]:
data = loangrantmodel.currentloanamount_process(data)

<h4> Attribute merging </h4>

In [10]:
data = loangrantmodel.homeownership_process(data)

<h4> Converting the metadata of the features </h4>

In [11]:
data = loangrantmodel.convert_numeric_process(data)

<h4> Mapping categorical attributes in a feature to numeric values </h4>

In [12]:
data = loangrantmodel.fillingnan_md_moc_process(data)
data = loangrantmodel.purpose_process(data)
data = loangrantmodel.responsevariable_process(data)

Looking at the pre-processesd data

In [13]:
data.head(10)

Unnamed: 0,Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Years in current job,Home Ownership,Annual Income,Purpose,...,Tax Liens,new_creditscore,new_annualincome,new_bankruptcies,new_taxliens,new_msld,new_monthlydebt,new_maximumopencredit,new_purpose,new_loanstatus
17782,87b6a064-b524-4cff-a968-b176bdc70075,000bbb5d-3a62-4712-908e-caacd7a815d5,Fully Paid,33231.0,Short Term,,2 years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,0.0,1334.28,68602.0,4,2.0
17783,4850727e-1ab2-4af1-b269-a95dd99ae975,4498fc97-e3f6-4789-b81b-c971fe967bb3,Fully Paid,23609.0,Short Term,,2 years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,0.0,86.91,13772.0,4,2.0
17784,54a17b9a-c581-4f08-bc2f-21e2765e57c6,2b0ca10b-5a2c-4521-9c3f-7a55de743006,Charged Off,35651.0,Long Term,,9 years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,73.0,1489.53,10183.0,4,1.0
17785,02fa33d7-7fab-4f13-b30a-418d12221ba1,5c4c3d3b-6199-402b-8d36-0952ffcc700f,Charged Off,8086.0,Long Term,,10+ years,Home Mortgage,,Debt Consolidation,...,0.0,730.0,61494.0,1.0,0.0,46.0,662.09,22219.0,4,1.0
17786,42d53764-b5f5-4161-8132-dc1ac88efa62,c7be3971-5532-4ba8-b01a-caad8c48f034,Fully Paid,10977.0,Short Term,,10+ years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,1.0,0.0,0.0,968.12,10370.0,4,2.0
17787,8a56a292-8fbd-442d-81e0-4d4a0efd64db,51786c25-90c4-4707-bca2-829b8e54ae17,Fully Paid,15231.0,Short Term,,3 years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,1.0,0.0,0.0,794.04,25915.0,4,2.0
17788,d77b643f-f69c-482a-a5ae-c202b9c71836,4776e4fb-d0c7-4dc6-8627-554733df8561,Fully Paid,13972.0,Long Term,,10+ years,Home Mortgage,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,0.0,1439.68,24430.0,4,2.0
17789,938d7ae6-4de8-41ae-8907-1e871484a377,f823e5a0-ec22-4c16-80d5-c54ca952ee67,Fully Paid,13838.0,Short Term,,7 years,Home Mortgage,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,0.0,1347.21,36184.0,4,2.0
17790,7586ae85-3dd4-4271-9346-a7d129eca484,abb2b12d-5657-4ba6-a713-a44b25ee131d,Charged Off,4953.0,Short Term,,3 years,Rent,,Debt Consolidation,...,0.0,730.0,61494.0,0.0,0.0,21.0,135.2,2775.0,4,1.0
17791,f229d187-3d18-47f4-bef8-d7c2172ccbd0,808889cf-fd4e-4b95-bc40-74a1ea63f7f8,Charged Off,7825.0,Short Term,,< 1 year,Rent,,Business Loan,...,0.0,730.0,61494.0,0.0,0.0,0.0,198.55,8314.0,2,1.0


<h3> Modeling & Evaluation </h3>

We tried different ML algorithm for making a predictive model that works well for the data. 
Since its a classification problem, we tried out multiple classifiers such as the following:

- Ensemble Model (Random Forest)
- Naive Bayes
- Ensemble Model (Ada Boost)

All of the above ML classifiers were evaluated with <b><i> cross validation </i></b>(hold out) method

In [14]:
help(loangrantmodel.randomforest)
help(loangrantmodel.randomforest_crossvalidation)
help(loangrantmodel.naivebayes_crossvalidation)
help(loangrantmodel.adaboost_crossvalidation)

Help on function randomforest in module loangrantmodel:

randomforest(data, features, trees=10)

Help on function randomforest_crossvalidation in module loangrantmodel:

randomforest_crossvalidation(data, features, trees=10, folds=5)

Help on function naivebayes_crossvalidation in module loangrantmodel:

naivebayes_crossvalidation(data, features, folds=5)

Help on function adaboost_crossvalidation in module loangrantmodel:

adaboost_crossvalidation(data, features, est=50, folds=5)



In [15]:
include_features = ['new_loanstatus','Years of Credit History','new_creditscore','new_annualincome','new_bankruptcies','new_taxliens','new_msld','new_monthlydebt','new_maximumopencredit','new_purpose']

In [16]:
sc_rf = loangrantmodel.randomforest_crossvalidation(data, include_features, trees = 150, folds = 10)

[ 0.70130454  0.69748088  0.71063878  0.71308064  0.71375548  0.71488022
  0.71533011  0.71878515  0.71451069  0.71226097]


In [17]:
sc_rf.mean()

0.71120274583965659

In [18]:
sc_nb = loangrantmodel.naivebayes_crossvalidation(data,include_features,folds= 10)

[ 0.71648673  0.71502474  0.65890688  0.61680351  0.60409403  0.54425824
  0.63142504  0.6064117   0.61169854  0.35073116]


In [19]:
sc_nb.mean()

0.60558405604590315

In [20]:
sc_adaboost = loangrantmodel.adaboost_crossvalidation(data,include_features)

[ 0.71680819  0.72518698  0.71904173  0.7266183   0.71773241]


In [21]:
sc_adaboost.mean()

0.72107752058876984

In [22]:
 d = {'Model' : pd.Series(['NB','RF','ADABOOST']),'Accuracy' : pd.Series([sc_nb.mean(),sc_rf.mean(),sc_adaboost.mean()])}

In [23]:
 df_accuracy = pd.DataFrame(d)

In [24]:
df_accuracy.head()

Unnamed: 0,Accuracy,Model
0,0.605584,NB
1,0.711203,RF
2,0.721078,ADABOOST


<h4> Full Code </h4>

In [None]:
# %load loangrantmodel.py
#!/usr/bin/env python2
"""
Created on Thu Nov 16 00:55:54 2017

@author: ahadmushir
@description: Loan Grant model structure 
"""

import pandas as pd
import numpy as np

cache = dict() #for Sanity Cache

def read_data():
    data = pd.read_csv("original_data.csv")
    return data

ddata = read_data()

def remove_duplicates(data):
    d1 = data.drop_duplicates('Loan ID')
    return d1

#After removing the duplicate rows 
def creditscore_process(data):
    data['new_creditscore'] = data['Credit Score']
    d2 = data.loc[data['new_creditscore'] > 800 ]
    d2['new_creditscore'] = (d2['new_creditscore'] / 10)
    
    d3 = pd.concat([data,d2], axis=0)
    d3 = d3.drop_duplicates('Loan ID',keep='last')
    
    temp_df = d3['new_creditscore']
    temp_df = temp_df.dropna()
    med = np.median(temp_df)
    
    d3['new_creditscore'] = d3['new_creditscore'].fillna(med)
    sanity_cache(data,'new_creditscore')
    
    return d3

#After dealing with credit score
def annualincome_process(data):
    temp_df = data['Annual Income']
    temp_df = temp_df.dropna()
    
    med = np.median(temp_df)
    data['new_annualincome'] = data['Annual Income'].fillna(med)
    sanity_cache(data, 'new_annualincome')
        
    return data

#Dealing with features of 'Bankruptcies', 'Tax Liens' & 'Months since last delinquent'
def fillingNull_process(data):
    data['new_bankruptcies'] = data['Bankruptcies'].fillna(0)
    data['new_taxliens'] = data['Tax Liens'].fillna(0)
    data['new_msld'] = data['Months since last delinquent'].fillna(0)
    
    sanity_cache(data, 'new_bankruptcies')
    sanity_cache(data, 'new_taxliens')
    sanity_cache(data, 'new_msld')
    
    return data

#After dealing with Annual Income 
#Clip values of Current Loan Amount
def currentloanamount_process(data):
    temp_df = data.loc[data['Current Loan Amount'] >= 99999998]
    temp_df['Current Loan Amount'] = np.median(data['Current Loan Amount'])
    d1 = pd.concat([data,temp_df])
    d1 = d1.drop_duplicates('Loan ID',keep='last')
    return d1

#Dealing with Home ownership feature
def homeownership_process(data):
    data.loc[data['Home Ownership'] == 'HaveMortgage', 'Home Ownership'] = "Home Mortgage" 
    return data

#Dealing with features 'Monthly Debt', 'Maximum Open Credit' & 'MSLD'
def convert_numeric_process(data):
    
    data['Monthly Debt'] = data['Monthly Debt'].convert_objects(convert_numeric=True)
    data['Maximum Open Credit'] = data['Maximum Open Credit'].convert_objects(convert_numeric=True)
#    data['Months since last delinquent'] = data['Months since last delinquent'].convert_objects(convert_numeric=True)
    
    
    return data

#Monthly Debt & MaxOpenCredit ~ Missing values
##
def fillingnan_md_moc_process(data):
    temp_df = data['Monthly Debt']
    temp_df = temp_df.dropna()
    med1 = np.median(temp_df)
    
    temp_df1 = data['Maximum Open Credit']
    temp_df1 = temp_df1.dropna()
    med2 = np.median(temp_df1)
    
    data['new_monthlydebt'] = data['Monthly Debt'].fillna(med1)
    data['new_maximumopencredit'] = data['Maximum Open Credit'].fillna(med2)
    
    return data

def purpose_process(data):
    data['new_purpose'] = data['Purpose'] 
    data['new_purpose']= data['new_purpose'].replace('other','0')
    data['new_purpose']= data['new_purpose'].replace('Other','0')
    data['new_purpose']= data['new_purpose'].replace('major_purchase','0')
    
    data['new_purpose']= data['new_purpose'].replace('moving','1')
    data['new_purpose']= data['new_purpose'].replace('vacation','1')
    data['new_purpose']= data['new_purpose'].replace('Take a Trip','1')
    
    data['new_purpose']= data['new_purpose'].replace('Business Loan','2')
    data['new_purpose']= data['new_purpose'].replace('small_business','2')
    
    data['new_purpose']= data['new_purpose'].replace('Home Improvements','3')
    data['new_purpose']= data['new_purpose'].replace('Buy House','3')
    data['new_purpose']= data['new_purpose'].replace('Buy a Car','3')
    
    data['new_purpose']= data['new_purpose'].replace('Debt Consolidation','4')

    data['new_purpose']= data['new_purpose'].replace('Educational Expenses','5')
    data['new_purpose']= data['new_purpose'].replace('renewable_energy','5')
    data['new_purpose']= data['new_purpose'].replace('wedding','5')
    data['new_purpose']= data['new_purpose'].replace('Medical Bills','5')
    data['new_purpose'] = data['new_purpose'].convert_objects(convert_numeric=True) 
    return data

def responsevariable_process(data):
    data['new_loanstatus'] = data['Loan Status']
    data['new_loanstatus'] = data['Loan Status'].replace('Charged Off ','1')
    data['new_loanstatus'] = data['Loan Status'].replace('Fully Paid','2')
    data['new_loanstatus'] = data['new_loanstatus'].convert_objects(convert_numeric=True) 
    data['new_loanstatus'] = data['new_loanstatus'].fillna(1)
    
    return data
##Locked!

#TODO: Working here!
##
##MIN MAX Normalizing 
def norm_process(data,features):
    from sklearn import preprocessing

    data = data[features]
    response_variable = data['new_loanstatus']
    data = data.drop('new_loanstatus', axis=1)
    
    x = data.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    data = pd.DataFrame(x_scaled)
    data = pd.concat([data,response_variable],axis = 1)
    
    return data

##
##
#Functions for our model sanity check
def sanity_cache(data,feature):
    temp_l = list()
    temp_mean = np.mean(data[feature])
    temp_med = np.median(data[feature])
    
    temp_l.append(temp_mean)
    temp_l.append(temp_med)
    
    cache[feature] = (temp_l)
    
    return 'Saved in Cache'
     

def sanity_check(data,features):
    result = ''
    for i in features:
        if cache.has_key(i):
            if (cache[i][0] == np.mean(data[i]) and cache[i][1] == np.median(data[i])):
                result = ('Values distribution is same! for feature ' + str(i))
                print result
            else:
                result = ('Values distribution is not the same! for feature ' + str(i))
                print result + ' result should be ' + str(cache[i][0]) + ',' + str(cache[i][1])  
        else:
            result = ('Feature: ' + str(i) + ' not present in cache')
            print result
            
    return result

def roc(y_test, y_pred_prob):
    from sklearn import metrics
    import matplotlib as plt
    
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
    plt.plot(fpr, tpr)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.rcParams['font.size'] = 12
    plt.title('ROC curve for diabetes classifier')
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Sensitivity)')
    plt.grid(True)
    print('AUC: ' + str(metrics.roc_auc_score(y_test, y_pred_prob)))
    print ('Accuracy: ' + str(metrics.accuracy_score(y_test, y_pred_prob)))

def randomforest(data, features, trees = 10):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.cross_validation import train_test_split
    from sklearn import metrics
    
    rf = RandomForestClassifier(n_estimators=trees)
    final_df = data[features]
    train_set, test_set = train_test_split(final_df, test_size = 0.2)
    train_set = final_df[features]
    target = train_set[['new_loanstatus']]
    test_target = test_set[['new_loanstatus']]
    
    rf.fit(train_set,target.values.ravel())
    predicted_value = rf.predict(test_set)
    #print('AUC: ' + str(metrics.roc_auc_score(test_target, predicted_value)))
    print ('Accuracy: ' + str(metrics.accuracy_score(test_target, predicted_value)))
    print (metrics.confusion_matrix(test_target, predicted_value))
    #roc(test_target,predicted_value)
    
    return rf

def randomforest_crossvalidation(data, features, trees = 10, folds =5 ):
    from sklearn import cross_validation
    
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.cross_validation import train_test_split
    from sklearn import metrics
    from sklearn.model_selection import cross_val_score
    
    rf = RandomForestClassifier(n_estimators=trees)
    final_df = data[features]

    train_set = final_df
    target = train_set[['new_loanstatus']]
    train_set = train_set.drop('new_loanstatus', axis=1)

    scores = cross_val_score(rf, train_set, target.values.ravel(), cv=folds)
    print scores
    
    return scores

def naivebayes_crossvalidation(data, features, folds = 5 ):
    from sklearn import cross_validation
    
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.cross_validation import train_test_split
    from sklearn import metrics
    from sklearn.model_selection import cross_val_score
    
    gnb = GaussianNB()
    
    final_df = data[features]

    train_set = final_df
    target = train_set[['new_loanstatus']]
    train_set = train_set.drop('new_loanstatus', axis=1)

    scores = cross_val_score(gnb, train_set, target.values.ravel(), cv=folds)
    print scores
    
    return scores

def adaboost_crossvalidation(data,features , est = 50, folds = 5):
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn import cross_validation
    from sklearn import metrics
    from sklearn.model_selection import cross_val_score
    
    clf = AdaBoostClassifier(n_estimators=est)
    
    final_df = data[features]

    train_set = final_df
    target = train_set[['new_loanstatus']]
    train_set = train_set.drop('new_loanstatus', axis=1)

    scores = cross_val_score(clf, train_set, target.values.ravel(), cv=folds)    
    print scores
    
    return scores

ddata = responsevariable_process(purpose_process(fillingnan_md_moc_process(convert_numeric_process(homeownership_process(currentloanamount_process(fillingNull_process(annualincome_process(creditscore_process(remove_duplicates(ddata))))))))))
include_features = ['new_loanstatus','Years of Credit History','new_creditscore','new_annualincome','new_bankruptcies','new_taxliens','new_msld','new_monthlydebt','new_maximumopencredit','new_purpose']
ddata_norm = norm_process(ddata,include_features)
include_features_norm = [0, 1, 2, 3, 4, 5, 6, 7, 8, u'new_loanstatus']

##0.711 accuracy mean from rf_crossvalidation (trees= 150, folds= 10)
##0.605 accuracy mean from naivebayes_crossvalidation
##0.721 accuracy mean from adaboost_crossvalidation (est = 50, folds = 5)
##0.720 accuracy mean from adaboost_crossvalidation (est = 1500, folds = 10)
##0.710 accuracy mean from rf_crossvalidations (est = 250, folds = 10) with Norm
##0.716 accuracy mean from adaboost_crossvalidation (est = 250, folds = 10) with Norm


   