# Data Challenge - Fair
***Arpita Jena***

### Importing required packages

In [31]:
import pandas as pd
import numpy as np

pd.options.display.max_rows = 1000
pd.options.display.max_columns = 20

import matplotlib.pyplot as plt
%matplotlib inline

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
from sklearn.model_selection import cross_validate,train_test_split,GridSearchCV
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

### Loading data

In [43]:
raw_data = pd.read_csv('train_data.csv')
raw_data.head(5)

Unnamed: 0,id,loan_status,mths_since_last_major_derog,revol_bal,purpose,addr_state,title,home_ownership,application_type,verification_status_joint,earliest_cr_line,apply_date,verification_status,emp_length,dti,emp_title,annual_inc
0,63480419,Current,,15954,debt_consolidation,KS,Debt consolidation,MORTGAGE,INDIVIDUAL,,5-Aug,15-Nov,Not Verified,5 years,24.85,Branch Office Administrator,62000.0
1,51386490,Current,,73814,debt_consolidation,MD,Debt consolidation,MORTGAGE,INDIVIDUAL,,Sep-83,15-Jun,Verified,10+ years,26.38,IIntelligence Analyst,113000.0
2,13567696,Current,,30013,major_purchase,TX,Major purchase,OWN,INDIVIDUAL,,Dec-99,14-Apr,Not Verified,10+ years,14.41,Global Service Delivery Lead,180000.0
3,22252931,Fully Paid,,10768,credit_card,DE,Credit card refinancing,MORTGAGE,INDIVIDUAL,,1-Sep,14-Jul,Not Verified,10+ years,24.31,Operations Manager,66000.0
4,6539569,Fully Paid,,35551,home_improvement,WI,Home Improvement,MORTGAGE,INDIVIDUAL,,Dec-92,13-Aug,Verified,2 years,1.7,Coram Specialty Infusion,110000.0


### Checking for null values

In [3]:
print(raw_data.isnull().sum())

id                                  0
loan_status                         0
mths_since_last_major_derog    150255
revol_bal                           0
purpose                             0
addr_state                          0
title                              36
home_ownership                      0
application_type                    0
verification_status_joint      199889
earliest_cr_line                   11
apply_date                          0
verification_status                 0
emp_length                      10265
dti                                 0
emp_title                       11793
annual_inc                          2
dtype: int64


As we can see there are a high number of null values in some columns *(mths_since_last_major_derog, verification_status_joint, emp_length, emp_title)*. it is really going to be difficult to impute them considering there are a very few data points for imputation. One approach is to remove those columns from the data. The other three columns *(title, earliest_cr_line, annual_inc)* have very few missing values. We can try to impute them or remove those observations from the data considering the proportion is very low.

# Q1

## Feature Engineering
### Converting to datetime format

In [4]:
# Convert 'earliest_cr_line' to date format
cr_line = raw_data.earliest_cr_line.astype(str).values
for i,x in enumerate(cr_line):
    if x =='nan':
        cr_line[i]=np.nan
    elif not x[-3:].isalpha():
        if len(x)==5:
            cr_line[i] = '0' + '-'.join(x.split('-')[::-1])
        else:
            cr_line[i] = '-'.join(x.split('-')[::-1])
    else:
        if len(x)==5:
            cr_line[i] = '0' + cr_line[i]
            
raw_data['earliest_cr_line'] = cr_line
raw_data['earliest_cr_line'] = pd.to_datetime(raw_data.earliest_cr_line, format = '%y-%b')

In [5]:
# Convert 'apply_date' to date format
raw_data['apply_date'] = raw_data.apply_date.apply(lambda x: '0'+x if len(x)==5 else x )
raw_data['apply_date'] = pd.to_datetime(raw_data.apply_date,format = '%y-%b')

In [6]:
# Creating a new feature 'cr_line_age': how old the credit line is
raw_data  = raw_data[(~raw_data.earliest_cr_line.isnull())]
raw_data['cr_line_age'] = np.round(raw_data['apply_date'].\
                                   sub(raw_data['earliest_cr_line'], axis=0) / np.timedelta64(1, 'M'))

In [7]:
# Extracting month and year fields from 'apply_date','earliest_cr_line'
dates = ['apply_date','earliest_cr_line']
attr = ['year', 'month']

for date_field in dates:
    for field in attr:
        raw_data[date_field + '_' + field] = getattr(raw_data[date_field].dt, field)

In [8]:
# categorizing variable 'emp_length'
emp_length_map = {
    np.nan:-1,
    '< 1 year':0,
    '1 year':1,
    '2 years':2,
    '3 years':3,
    '4 years':4,
    '5 years':5,
    '6 years':6,
    '7 years':7,
    '8 years':8,
    '9 years':9,
    '10+ years':10
}
raw_data['emp_length'].replace('n/a', '0', inplace=True)
raw_data['emp_length'] = raw_data['emp_length'].map(emp_length_map)

### Missing values imputation

In [9]:
raw_data['mths_since_last_major_derog'] = raw_data['mths_since_last_major_derog'].fillna(0)
raw_data['annual_inc'].fillna(raw_data['annual_inc'].median(), inplace=True)

Assumptions for the above:
1. If the customer didn't have bad rating in the last 90 days then the numbers should be a numerical value greater than 0.
2. Median is a good measure for imputing continuous variables (i.e. annual_inc)

In [10]:
# Let's create categories for annual_income assuming most of the bad loans fall under low income category
raw_data['inc_cat'] = np.nan
lst = [raw_data]

for col in lst:
    col.loc[col['annual_inc'] <= 100000, 'inc_cat'] = 0
    col.loc[(col['annual_inc'] > 100000) & (col['annual_inc'] <= 200000), 'inc_cat'] = 1
    col.loc[col['annual_inc'] > 200000, 'inc_cat'] = 2

### Relabelling target column loan_status

In [11]:
print(raw_data.loan_status.unique())
raw_data.loan_status.value_counts()

['Current' 'Fully Paid' 'Charged Off' 'In Grace Period'
 'Late (31-120 days)' 'Issued'
 'Does not meet the credit policy. Status:Charged Off' 'Late (16-30 days)'
 'Default' 'Does not meet the credit policy. Status:Fully Paid']


Current                                                135528
Fully Paid                                              46837
Charged Off                                             10270
Late (31-120 days)                                       2626
Issued                                                   1919
In Grace Period                                          1411
Late (16-30 days)                                         509
Does not meet the credit policy. Status:Fully Paid        443
Default                                                   288
Does not meet the credit policy. Status:Charged Off       158
Name: loan_status, dtype: int64

I have assumed *'Late (31-120 days)', 'Does not meet the credit policy. Status:Charged Off', 'Default'* fall under default category and all other types are non default loans.

In [12]:
mapping = {'Current':0, 'Fully Paid':0, 'Charged Off':1,\
           'In Grace Period':0, 'Late (31-120 days)':1, 'Issued':0,\
           'Does not meet the credit policy. Status:Charged Off':1,\
           'Late (16-30 days)':0, 'Default':1,\
           'Does not meet the credit policy. Status:Fully Paid':0}
raw_data.status = raw_data.loan_status
raw_data['status'] = raw_data.status.astype('category')
raw_data.replace({"loan_status": mapping}, inplace=True)
raw_data.loan_status.value_counts()

0    186647
1     13342
Name: loan_status, dtype: int64

The above signifies high class imbalance, which we need to treat for better model performance.

In [13]:
raw_data.application_type.value_counts()

INDIVIDUAL    199878
JOINT            111
Name: application_type, dtype: int64

### Data Cleaning

Assuming **'title', 'application_type', 'verification_status_joint', 'emp_title', 'id', 'emp_length'** would not have contributed much to the models, I removed them from the data. Before removing I performed some EDA to verify my assumptions and they turned out as expected. For example, there is not enough data available to see the effect of joint application on default loans since they started in October 2015. Also, annual_inc is a better indicator than emp_title.

In [38]:
data = raw_data.copy()
data.drop(['title','application_type', 'verification_status_joint',\
           'emp_title','id', 'emp_length', 'apply_date', 'earliest_cr_line'], axis=1, inplace=True)

## Exploratory data analysis

In [14]:
count_loan_status = raw_data.groupby('apply_date')['loan_status'].mean().reset_index().sort_values(by='apply_date')

data = [go.Scatter(x=count_loan_status.apply_date, y=count_loan_status.loan_status)]

layout = dict(
    title = "Average Default per Month",
    xaxis=dict(
        title='Month'
    ),
    yaxis=dict(
        title='Average default'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

The graph above shows an intutive understanding of borrower behavior over time. It seems like the number of defaults have been reducing over time.

In [15]:
count_inc_status = raw_data.groupby('apply_date')['annual_inc'].mean().reset_index().sort_values(by='apply_date')

data = [go.Scatter(x=count_inc_status.apply_date, y=count_inc_status.annual_inc)]

layout = dict(
    title = "Average income of borrowers",
    xaxis=dict(
        title='Month'
    ),
    yaxis=dict(
        title='Average income'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

Seeing the increasing trend of annual income of borrowers, I felt this could be a possible reason for reduced number of defaults over time.

In [16]:
count_status = raw_data.groupby(['apply_date','status'])['id'].count().reset_index().sort_values(by='apply_date')

data = []
for cat in count_status.status.unique():
    data.append(
    go.Scatter(x=count_status[count_status.status==cat].apply_date, 
            y=count_status[count_status.status==cat].id,
             name = cat))

layout = dict(
    title = "Number of Loans Extended for each Loan Status per Month",
    xaxis=dict(
        title='Month'
    ),
    yaxis=dict(
        title='Number of loans'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

The above shows the number of loans extended out is more in recent years which also contributes to the reason for less number of defaults in recent years. Some of the loans are at very early stage and are yet to be defined as default/non default.

In [17]:
purpose_loan = raw_data.groupby('purpose')['loan_status'].mean().reset_index().sort_values(by='loan_status')

data = [go.Bar(
    x=purpose_loan.purpose,
    y=purpose_loan.loan_status,
    name='Average Default Loans per Loan Purpose',
    marker=dict(
        color='rgb(49,130,189)'
    )
)]
layout = dict(
    barmode='group',
    title = "Average Default Loans per Loan Purpose",
    xaxis=dict(
        title='Purpose'
    ),
    yaxis=dict(
        title='Average number of loans'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

Based on the graph above, small business and educational purpose loans were amongst the top categories where borrowers defaulted. The first one might be contributed by poor business performance and the later one might have reasons such as student dropping out in between and not having proper job after graduation to pay back the loan.

In [18]:
count_addr_state = raw_data.groupby('addr_state').size().reset_index(name = 'frequency').\
sort_values('frequency')
count_default_state = raw_data[raw_data['loan_status'] == 1].groupby('addr_state').\
size().reset_index(name = 'defaults').sort_values('defaults')
df = count_addr_state.join(count_default_state.set_index(['addr_state']), lsuffix='_left',\
                               rsuffix='_right', on=['addr_state'], how = 'left')
df['ratio'] = df['defaults']/df['frequency']
df['ratio'] = df['ratio'].fillna(0)

for col in df.columns:
    df[col] = df[col].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

df['text'] = df['addr_state'] + '<br>' + 'ratio: '+ df['ratio'] +'<br>'+'loans: '+ df['frequency']\
+'<br>'+'defaults: '+ df['defaults']

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = df['addr_state'],
        z = df['ratio'].astype(float),
        locationmode = 'USA-states',
        text = df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Millions USD")
        ) ]

layout = dict(
        title = 'Statewise default proportion',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map' )

We can see group of states where default proportion is comparable. Nevada, California, Florida, Virginia are some of the high default states.

In [19]:
home_loan = raw_data.groupby(['home_ownership'])['loan_status'].mean().reset_index().sort_values(by='loan_status')

data = [go.Bar(
    x=home_loan.home_ownership,
    y=home_loan.loan_status,
    name='Average Default Loans per Home Ownership',
    marker=dict(
        color='rgb(49,130,189)'
    )
)]
layout = dict(
    title = "Average Default Loans per Home Ownership Status",
    xaxis=dict(
        title='Home ownership'
    ),
    yaxis=dict(
        title='Average number of default loans'
    )
)

fig = dict(data=data, layout=layout)
iplot(fig)

The highest number of defaults are in None category. My assumption is that these people might fall under homeless category. But, further details are needed to study these categories. 

## Q 2
## Model implementation

In [20]:
data.drop('status',axis=1,inplace=True)

In [21]:
# Checking if any null values are remaining
data.isnull().sum()

loan_status                    0
mths_since_last_major_derog    0
revol_bal                      0
purpose                        0
addr_state                     0
home_ownership                 0
verification_status            0
dti                            0
annual_inc                     0
cr_line_age                    0
apply_date_year                0
apply_date_month               0
earliest_cr_line_year          0
earliest_cr_line_month         0
inc_cat                        0
dtype: int64

In [22]:
# label encoding of categorical columns
le = preprocessing.LabelEncoder()
data = data.apply(le.fit_transform)

In [23]:
target = raw_data['loan_status']
data = data.drop('loan_status', axis = 1)

### Baseline logistic model on imbalanced data

In [33]:
y_pred = cross_validation.cross_val_predict(LogisticRegression(), data, raw_data['loan_status'], cv=10)
print('Accuracy Score: {}'.format(metrics.accuracy_score(raw_data['loan_status'], y_pred)))
print('Logloss Score: {}'.format(metrics.log_loss(raw_data['loan_status'], y_pred)))
print('F1 Score: {}'.format(metrics.f1_score(raw_data['loan_status'], y_pred)))
print(metrics.classification_report(raw_data['loan_status'], y_pred))

Accuracy Score: 0.9329563125971928
Logloss Score: 2.31560730372731
F1 Score: 0.0041592394533571005
             precision    recall  f1-score   support

          0       0.93      1.00      0.97    186647
          1       0.23      0.00      0.00     13342

avg / total       0.89      0.93      0.90    199989



The model has an accuracy of 93%; however, due to the imbalanced classes, it's performance is very poor in predicting default loans. It is essential to treat class imbalance before training the model.

### Train-test split for model training
'stratify' option maintains same proportion of defaults to non-default loans in train and test data. This will allow the train and test data to be better generalized.

In [24]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.30, random_state=0, stratify=target)

### Removing class imbalance using SMOTE
SMOTE (Synthetic Minority Over-Sampling Technique): It is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.

In [61]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_sample(X_train, y_train)

### Model evaluation
I have used F1 score to compare performance of the models as it takes into account both the precision and recall of the predictions. 

### Logistic regression on balanced data

In [44]:
def eval_model(model, X_train, X_test, y_train, y_test):
    y_pred = model.predict(X_train)
    print('Train')
    print('Accuracy Score: {}'.format(metrics.accuracy_score( y_train, y_pred)))
    print('F1 Score: {}'.format(metrics.f1_score( y_res, y_pred)))
    print(metrics.classification_report( y_res, y_pred))

    y_test_pred = model.predict(X_test)
    print('Test')
    print ('Accuracy Score: {}'.format(metrics.accuracy_score( y_test, y_test_pred)))
    print ('F1 Score: {}'.format(metrics.f1_score( y_test, y_test_pred)))
    print(metrics.classification_report( y_test, y_test_pred))

In [63]:
parameters = {'C':[1, 10, 20], 'penalty':['l1', 'l2']}
clf = GridSearchCV(LogisticRegression(n_jobs = -1), parameters, cv=10, scoring='f1', verbose=1, n_jobs = -1)
clf.fit(X_res, y_res)
eval_model(clf, X_res, X_test, y_res, y_test)

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  9.0min finished


Train
Accuracy Score: 0.7019165269836897
F1 Score: 0.6967203859377251
             precision    recall  f1-score   support

          0       0.70      0.72      0.71    130653
          1       0.71      0.68      0.70    130653

avg / total       0.70      0.70      0.70    261306

Test
Accuracy Score: 0.712735636781839
F1 Score: 0.23095801169068758
             precision    recall  f1-score   support

          0       0.97      0.72      0.82     55994
          1       0.14      0.65      0.23      4003

avg / total       0.91      0.71      0.78     59997



This model has significan improvement over the earlier one. Though the accuracy has decreased in train data, the F1 score has improved in out test data (0.23). Also, both log-loss and accuracy are better in test as compares to train data.  
### Random Forest

In [61]:
X_rand = pd.DataFrame(X_res, columns = X_train.columns)
params = {
    'max_features':[0.8],
    'min_samples_leaf':[3, 5],
    'min_samples_split':[10, 8]
}
clf = GridSearchCV(RandomForestClassifier(n_jobs = -1, n_estimators=100), params,\
                   cv=10,scoring='f1',verbose=1, n_jobs= -1)
clf.fit(X_rand, y_res)

eval_model(clf, X_rand, X_test, y_res, y_test)

Fitting 10 folds for each of 4 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 41.2min finished


Train
Accuracy Score: 0.9758750277452488
Logloss Score: 0.833247055900319
F1 Score: 0.9752807578894536
             precision    recall  f1-score   support

          0       0.95      1.00      0.98    130653
          1       1.00      0.95      0.98    130653

avg / total       0.98      0.98      0.98    261306

Test
Accuracy Score: 0.9319465973298665
Logloss Score: 2.3504825638109987
F1 Score: 0.00873998543335761
             precision    recall  f1-score   support

          0       0.93      1.00      0.96     55994
          1       0.16      0.00      0.01      4003

avg / total       0.88      0.93      0.90     59997



In [None]:
X_rand = pd.DataFrame(X_res, columns = X_train.columns)
params = {
    'max_features':[0.8],
    'min_samples_leaf':[3, 5],
    'min_samples_split':[10, 8]
}
clf = GridSearchCV(RandomForestClassifier(n_jobs = -1, n_estimators=100), params,\
                   cv=10,scoring='f1',verbose=1, n_jobs= -1)
clf.fit(X_rand, y_res)

eval_model(clf, X_rand, X_test, y_res, y_test)

In [67]:
clf.best_params_

{'max_features': 0.8, 'min_samples_leaf': 3, 'min_samples_split': 8}


In [66]:
X_rand = pd.DataFrame(X_res, columns = X_train.columns)
m = RandomForestClassifier(n_estimators=500, max_features = 0.8, min_samples_leaf = 3, min_samples_split = 8)
m.fit(X_rand,  y_res)
eval_model(m, X_rand, X_test, y_res, y_test)

Train
Accuracy Score: 0.9754770269339396
F1 Score: 0.9748625048054668
             precision    recall  f1-score   support

          0       0.95      1.00      0.98    130653
          1       1.00      0.95      0.97    130653

avg / total       0.98      0.98      0.98    261306

Test
Accuracy Score: 0.9321632748304082
F1 Score: 0.00876765708718948
             precision    recall  f1-score   support

          0       0.93      1.00      0.96     55994
          1       0.17      0.00      0.01      4003

avg / total       0.88      0.93      0.90     59997



#### Feature importance
Let's have a look at the feature importance score from random forest and keep only the important features

In [64]:
clf.best_params_

{'max_features': 0.8, 'min_samples_leaf': 3, 'min_samples_split': 8}

In [69]:
len(y_train)

139992

Let's have a look at the feature importance scores.

In [72]:
def rf_feat_importance(m, df):
    return pd.DataFrame({'col' : df.columns, 'feat_imp' : m.feature_importances_}).\
sort_values('feat_imp', ascending = False)
m = RandomForestClassifier(n_estimators=100, max_features = 0.8, min_samples_leaf = 3, min_samples_split = 8)
m.fit(X_rand,  y_res)
fi = rf_feat_importance(m, X_rand)
fi

Unnamed: 0,col,feat_imp
9,apply_date_year,0.430486
4,home_ownership,0.18706
5,verification_status,0.180644
2,purpose,0.048049
10,apply_date_month,0.028179
1,revol_bal,0.024375
7,annual_inc,0.022202
6,dti,0.022116
3,addr_state,0.014735
8,cr_line_age,0.013073


Above we can see the features in the order of significance. I will keep only those variables which have importance score higher than 0.005.

In [76]:
to_keep  =  fi[fi.feat_imp > 0.005].col
X_keep = X_rand[to_keep].copy()
X_test_keep = X_test[to_keep].copy()
m = RandomForestClassifier(n_estimators=100, max_features = 0.8, min_samples_leaf = 3, min_samples_split = 8)
m.fit(X_keep, y_res)

eval_model(m, X_keep, X_test_keep, y_res, y_test)

Train
Accuracy Score: 0.9759247778466625
Logloss Score: 0.8315287421519602
F1 Score: 0.9753326040159509
             precision    recall  f1-score   support

          0       0.95      1.00      0.98    130653
          1       1.00      0.95      0.98    130653

avg / total       0.98      0.98      0.98    261306

Test
Accuracy Score: 0.9318799273296998
Logloss Score: 2.3527852906935847
F1 Score: 0.007768875940762322
             precision    recall  f1-score   support

          0       0.93      1.00      0.96     55994
          1       0.14      0.00      0.01      4003

avg / total       0.88      0.93      0.90     59997



Random forest has poor performance as compared to logistic regression in terms of F1 score.

### XGBoost
XGBoost is boosting method that sequentially builds weak classifier to improve the overall results of the model by adding new trees and correcting previous classifiers. The model runs very quickly and is highly flexible due to the number of parameters that could be tuned.

In [75]:
# Split train and cross validation sets
# X_train, X_test, y_train, y_test = train_test_split(np.array(data), np.array(target), test_size=0.30)
Xgb_train = np.asmatrix(X_res)
Xgb_test = np.asmatrix(X_test)
eval_set=[(Xgb_train, y_res)]
print('Initializing xgboost.sklearn.XGBClassifier and starting training...')


clf = XGBClassifier(
    objective="binary:logistic", 
    learning_rate=0.004, 
    random_state=10,
    colsample_bytree = 0.8,
    max_depth=3, 
    gamma=10, 
    n_estimators=500,
    reg_lambda=0.1,
    colsample_bylevel=0.5,
    booster='gbtree')

clf.fit(Xgb_train, y_res, early_stopping_rounds=20, eval_metric="map", eval_set=eval_set, verbose=True)

eval_model(clf, Xgb_train, Xgb_test, y_res, y_test)

Initializing xgboost.sklearn.XGBClassifier and starting training...
[0]	validation_0-map:0.765
Will train until validation_0-map hasn't improved in 20 rounds.
[1]	validation_0-map:0.864035
[2]	validation_0-map:0.850718
[3]	validation_0-map:0.894323
[4]	validation_0-map:0.911025
[5]	validation_0-map:0.92308
[6]	validation_0-map:0.919676
[7]	validation_0-map:0.912283
[8]	validation_0-map:0.918921
[9]	validation_0-map:0.921334
[10]	validation_0-map:0.918211
[11]	validation_0-map:0.919275
[12]	validation_0-map:0.917594
[13]	validation_0-map:0.919209
[14]	validation_0-map:0.916376
[15]	validation_0-map:0.914563
[16]	validation_0-map:0.923443
[17]	validation_0-map:0.924804
[18]	validation_0-map:0.92714
[19]	validation_0-map:0.928933
[20]	validation_0-map:0.931526
[21]	validation_0-map:0.929926
[22]	validation_0-map:0.932618
[23]	validation_0-map:0.930538
[24]	validation_0-map:0.928923
[25]	validation_0-map:0.931493
[26]	validation_0-map:0.930689
[27]	validation_0-map:0.930975
[28]	validation

Though xgboost has better performance (0.22 of F1 score) than random forest it is not able to beat logistic regression in terms of F1 score though they are very close.

### Q3

### Conclusion and final thoughts for production¶
- Comparing the above 3 models logistic regression outperformed the other modeling approaches. 
- The model gives the probability of borrower to default. We can set a cutoff value on this probability above which the borrower can be labeled as a future laon defaulter. I have to take care of the false positives and false negatives while calculating the threshold value. I would take suggestions from someone who have domain knowledge in this field.
- I can also use credit hisory of the borrower and include it in my model.
- If the probability exceeds this threshold, then we could flag the borrower as someone who is likely to default. 
- To determine this threshold, I would consult a risk analyst or a risk management team to determine the best threshold so that the web app will not extend too many loans that will eventually default. 
- In addition, the web app could request a credit report to check the borrower's actual credit history or a more comprehensive report on the borrower to determine if they will be able to pay back their loans.
- Having some metric to calculate credit score of borrowers and ingesting that information to the model will also result in better model performance.