# Credit Modelling 

### Can Machine Learning Model If A Borrower Pays Off A Loan On Time Or Not?

###### We will be looking at credit risk using finacial lending data from [lending club](https://www.lendingclub.com/). The dataset includes only the approved loans nad 

###### The dictionary explaining the columns of data is found at [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit)

###### They offer a marketplace for peer lending. Borrowers risk are evaluated by the club using a wide range of the borrowers finanicial information to assing them an interest rate.

###### A high interest rate is a measure of the risk. A low risk investment will therefore give low but nearly guaranteed returns.

###### Investors are mainly interested in receiving the best overall returns on investments so balancing risk and rewards is key.

###### In this project we will be putting our minds in that of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. 

###### To do this we will need to understand the data, ensure no data 'leakages', perform feature engineering then create and test machine learning models.

In [1]:
import pandas as pd
import numpy as np
import random as random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

loans_2007 = pd.read_csv('loans_2007.csv')
data_dictionary = pd.read_csv('LCDataDictionary.xlsx - LoanStats.csv')

  from numpy.core.umath_tests import inner1d


In [2]:
loans_2007

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.000000,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.000000,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.000000,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.000000,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.000000,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,1075269,1311441.0,5000.0,5000.0,5000.000000,36 months,7.90%,156.46,A,A4,...,161.03,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
6,1069639,1304742.0,7000.0,7000.0,7000.000000,60 months,15.96%,170.08,C,C5,...,1313.76,May-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
7,1072053,1288686.0,3000.0,3000.0,3000.000000,36 months,18.64%,109.43,E,E1,...,111.34,Dec-2014,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
8,1071795,1306957.0,5600.0,5600.0,5600.000000,60 months,21.28%,152.39,F,F2,...,152.39,Aug-2012,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
9,1071570,1306721.0,5375.0,5375.0,5350.000000,60 months,12.69%,121.45,B,B5,...,121.45,Mar-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [3]:
loans_2007.shape

(42538, 52)

###### The Data Dictionary contains rows for Loans, Rejections and Internal Notes.

###### Lets filter this to show only rows that are of interest to Loans

In [4]:
data_dictionary.sample(5)

Unnamed: 0,LoanStatNew,Description
95,title,The loan title provided by the borrower
29,initial_list_status,The initial listing status of the loan. Possib...
89,recoveries,post charge off gross recovery
67,num_rev_tl_bal_gt_0,Number of revolving trades with balance >0
22,fico_range_low,The lower boundary range the borrower’s FICO a...


In [5]:
loans_dictionary = data_dictionary[data_dictionary['LoanStatNew'].isin(loans_2007.columns)]
#Print each row as Juptyer truncates the description usually#
for index, row in loans_dictionary.iterrows():
    print(row['LoanStatNew'], '-----', row['Description'])
    print('First value:', loans_2007[row['LoanStatNew']][0], 'of type', type(loans_2007[row['LoanStatNew']][0]))
    print('\n')

acc_now_delinq ----- The number of accounts on which the borrower is now delinquent.
First value: 0.0 of type <class 'numpy.float64'>


addr_state ----- The state provided by the borrower in the loan application
First value: AZ of type <class 'str'>


annual_inc ----- The self-reported annual income provided by the borrower during registration.
First value: 24000.0 of type <class 'numpy.float64'>


application_type ----- Indicates whether the loan is an individual application or a joint application with two co-borrowers
First value: INDIVIDUAL of type <class 'str'>


chargeoff_within_12_mths ----- Number of charge-offs within 12 months
First value: 0.0 of type <class 'numpy.float64'>


collection_recovery_fee ----- post charge off collection fee
First value: 0.0 of type <class 'numpy.float64'>


collections_12_mths_ex_med ----- Number of collections in 12 months excluding medical collections
First value: 0.0 of type <class 'numpy.float64'>


delinq_2yrs ----- The number of 30+ days pas

### Feature Engineering - Removing Redundant Features

###### Our learning model will be predicting if a loan will be approved. Certain features here show information that could not be know pre-application:
- collection_recovery_fee, funded_amt, funded_amnt_inv, grade, 'initial_list_status', issue_d, 'last_credit_pull_d'
- last_pymnt_amnt, last_pymnt_d, 'loan_amnt', out_prncp, out_prncp_inv, 'pymnt_plan', 'revol_util', sub_grade, total_pymnt, total_pymnt_inv, total_rec_int, total_rec_late_fee, 

###### Some features provide no value:
- id, member_id, 'policy_code', recoveries, zip_code

###### Some features provide to much information to the point that per unique value there is very little information (can cause high bias):
- emp_title

In [6]:
redundant_columns = ['collection_recovery_fee', 'emp_title', 'funded_amnt',
 'funded_amnt_inv', 'grade', 'id', 'initial_list_status',
 'issue_d', 'last_credit_pull_d', 'last_pymnt_amnt', 'last_pymnt_d',
 'loan_amnt',  'member_id', 'out_prncp', 'out_prncp_inv', 'policy_code', 
 'pymnt_plan', 'recoveries', 'revol_util', 'sub_grade', 
 'total_pymnt', 'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee',
 'total_rec_prncp', 'zip_code']
loans_2007 = loans_2007.drop(columns = redundant_columns)

In [7]:
loans_2007.describe(include = 'all')

Unnamed: 0,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,title,...,pub_rec,revol_bal,total_acc,collections_12_mths_ex_med,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
count,42535,42535,42535.0,41423,42535,42531.0,42535,42535,42535,42522,...,42506.0,42535.0,42506.0,42390.0,42535,42506.0,42390.0,42506.0,41170.0,42430.0
unique,2,394,,11,5,,3,9,14,21264,...,,,,,1,,,,,
top,36 months,10.99%,,10+ years,RENT,,Not Verified,Fully Paid,debt_consolidation,Debt Consolidation,...,,,,,INDIVIDUAL,,,,,
freq,31534,970,,9369,20181,,18758,33136,19776,2259,...,,,,,42535,,,,,
mean,,,322.623063,,,69136.56,,,,,...,0.058156,14297.86,22.124406,0.0,,9.4e-05,0.0,0.143039,0.045227,2.4e-05
std,,,208.927216,,,64096.35,,,,,...,0.245713,22018.44,11.592811,0.0,,0.0097,0.0,29.359579,0.208737,0.004855
min,,,15.67,,,1896.0,,,,,...,0.0,0.0,1.0,0.0,,0.0,0.0,0.0,0.0,0.0
25%,,,165.52,,,40000.0,,,,,...,0.0,3635.0,13.0,0.0,,0.0,0.0,0.0,0.0,0.0
50%,,,277.69,,,59000.0,,,,,...,0.0,8821.0,20.0,0.0,,0.0,0.0,0.0,0.0,0.0
75%,,,428.18,,,82500.0,,,,,...,0.0,17251.0,29.0,0.0,,0.0,0.0,0.0,0.0,0.0


###### Some features here have no variation in their values: 'collections_12_mths_ex_med'  & 'chargeoff_within_12_mths' all have 0.0 results throughout. There is no value in these columns. We will drop these also.

In [8]:
loans_2007 = loans_2007.drop(columns = ['collections_12_mths_ex_med', 'chargeoff_within_12_mths', 'tax_liens'])

###### Next lets review any columns that have only one unique_value. Clearly here there will be no value for machine learning.

In [9]:
drop_columns = []
for column in loans_2007.columns:
    values = loans_2007[column].dropna()
    unique_values = values.unique()
    if len(unique_values) == 1:
        drop_columns.append(column)
        
loans_2007 = loans_2007.drop(columns = drop_columns)
print(sorted(drop_columns))

['application_type']


### Target Engineering

###### Next lets look at our target column which here is the 'loan_status'

In [10]:
loans_2007['loan_status'].value_counts(dropna = False)

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
NaN                                                        3
Name: loan_status, dtype: int64

###### Our interest here lies only with loans that were fully paid or not.

###### Therefore we will only use rows with Fully Paid and Charged Off

In [11]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]

###### Quick check to confirm it works.

In [12]:
loans_2007['loan_status'].value_counts()

Fully Paid     33136
Charged Off     5634
Name: loan_status, dtype: int64

###### Now that we have just two types a loan repaid and a loan not repaid we must represent this in numerics for machine learning. 

###### A paid loan will be set to 1 and unpaid set to 0

In [13]:
paid_unpaid_dict = {'loan_status' : {'Fully Paid' :1, 'Charged Off': 0}}
loans_2007 = loans_2007.replace(paid_unpaid_dict)
loans_2007.head()

Unnamed: 0,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,title,...,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,total_acc,acc_now_delinq,delinq_amnt,pub_rec_bankruptcies
0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,credit_card,Computer,...,0.0,Jan-1985,1.0,3.0,0.0,13648.0,9.0,0.0,0.0,0.0
1,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,car,bike,...,0.0,Apr-1999,5.0,3.0,0.0,1687.0,4.0,0.0,0.0,0.0
2,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,small_business,real estate business,...,0.0,Nov-2001,2.0,2.0,0.0,2956.0,10.0,0.0,0.0,0.0
3,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,other,personel,...,0.0,Feb-1996,1.0,10.0,0.0,5598.0,37.0,0.0,0.0,0.0
5,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,wedding,My wedding loan I promise to pay back,...,0.0,Nov-2004,3.0,9.0,0.0,7963.0,12.0,0.0,0.0,0.0


### Feature Engineering - Missing Data

###### Machine learning cannot handle missing data, unfortunately real data is often 'messy' and incomplete.

###### Lets check for missing data

In [14]:
null_counts = loans_2007.isnull().sum()
print('Null Counts:','\n')
print(null_counts[null_counts > 0])
print('\n')
print('Total Rows =', len(loans_2007))

Null Counts: 

emp_length              1036
title                     11
pub_rec_bankruptcies     697
dtype: int64


Total Rows = 38770


###### Lets see the value counts (as decimal proportions) of each of these variables, making sure to include any N/A values.

In [15]:
for column in ['emp_length', 'title', 'pub_rec_bankruptcies']:
    null_counts_df = pd.DataFrame()
    print(column)
    null_counts_df['value'] = loans_2007[column].value_counts(normalize = True).index
    null_counts_df['counts'] = loans_2007[column].value_counts(normalize = True).values
    print(null_counts_df.sort_values(by='counts'))
    print('n/a proportion:')
    print(loans_2007[column].isnull().sum() / len(loans_2007))
    print('\n')

emp_length
        value    counts
10    9 years  0.032570
9     8 years  0.038268
8     7 years  0.045529
7     6 years  0.057799
6      1 year  0.084354
5     5 years  0.085043
4     4 years  0.089097
3     3 years  0.106694
2     2 years  0.114168
1    < 1 year  0.119971
0   10+ years  0.226507
n/a proportion:
0.02672169202992004


title
                                             value    counts
9682                                       medloan  0.000026
12912                        Bye Bye Credit Cards!  0.000026
12911                             Volkswagen Jetta  0.000026
12910                                        brown  0.000026
12909                           DECEMBER 2011 LOAN  0.000026
12908                                    A's2Zee's  0.000026
12907                          Roll it into 1 loan  0.000026
12913                     get out credit card debt  0.000026
12906                                Shed the Debt  0.000026
12904                    Tired of Credit Compan

###### Pub_rec_bankruptcies (The number of public record bankruptcies) is disproportionately one value. This will cause high bias and we will drop this column.

###### Removing the rows with employment length and title will lose 3% and 0.03% of the data respectively. We will drop rows where title data is missing.

###### For employment length we know this is likely to be a significant factor (reviewed later), and could cause issues if we try to impute (enter values using a statistical metric) the values.

In [16]:
loans_2007 = loans_2007.drop(columns= ['pub_rec_bankruptcies'])
loans_2007 = loans_2007.dropna(axis = 'rows')

### Feature Engineering - Numerical Conversions

###### Next step is to check we have data suitable for machine learning. This will need to be of type int or float. (A true integer that is recorded as a float will not affect the model)

In [17]:
loans_2007.dtypes.value_counts()

float64    11
object      9
int64       1
dtype: int64

In [18]:
loans_2007.columns[loans_2007.dtypes == object]

Index(['term', 'int_rate', 'emp_length', 'home_ownership',
       'verification_status', 'purpose', 'title', 'addr_state',
       'earliest_cr_line'],
      dtype='object')

In [19]:
non_numerical_data = loans_2007[loans_2007.columns[loans_2007.dtypes == object]]
non_numerical_data.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004


In [20]:
for column in non_numerical_data.columns:
    print(column)
    print('Top 5 values and their proportion as a decimal=')
    print(non_numerical_data[column].value_counts(normalize = True, dropna = False).head(5))
    print('number of unique values=', len(non_numerical_data[column].value_counts(normalize = True, dropna = False).index))
    print('-----')

term
Top 5 values and their proportion as a decimal=
 36 months    0.749655
 60 months    0.250345
Name: term, dtype: float64
number of unique values= 2
-----
int_rate
Top 5 values and their proportion as a decimal=
 10.99%    0.024017
 11.49%    0.020411
  7.51%    0.020040
 13.49%    0.019802
  7.88%    0.018582
Name: int_rate, dtype: float64
number of unique values= 371
-----
emp_length
Top 5 values and their proportion as a decimal=
10+ years    0.226540
< 1 year     0.119977
2 years      0.114198
3 years      0.106723
4 years      0.089068
Name: emp_length, dtype: float64
number of unique values= 11
-----
home_ownership
Top 5 values and their proportion as a decimal=
RENT        0.480967
MORTGAGE    0.442583
OWN         0.073773
OTHER       0.002598
NONE        0.000080
Name: home_ownership, dtype: float64
number of unique values= 5
-----
verification_status
Top 5 values and their proportion as a decimal=
Not Verified       0.432377
Verified           0.314495
Source Verified    0

###### From these columns labelled as object we can see:

###### Categorical columns:
> ###### Home ownership, verification status, purpose and title

###### Numerical  columns:

> ###### Term, Int rate and emp_length,


###### For the other columns:
> ###### Earliest credit line might potentially have some unforseen correlation with approval (perhaps more loans are approved in December if there was christmas bonus for example). For now we shall remove this out given the large feature engineering required for the potentially little benefit.

> ###### The same logic applies for Address State

In [21]:
loans_2007 = loans_2007.drop(columns = ['addr_state','earliest_cr_line'])

### Feature Engineering - Numerical Conversions: Categories 

###### Title has 19k unique entries, Purpose has only 14. Lets preview some of these.

In [22]:
random.seed = 1
loans_2007[['title', 'purpose']].sample(20)

Unnamed: 0,title,purpose
27576,At&T payoff,credit_card
25758,Home Improvement,home_improvement
38752,Paying off debt,debt_consolidation
30287,EHS 2010 Lending Club,debt_consolidation
38260,Help Paying for College Credits,educational
34436,New Business Website Launch,small_business
17906,Home improvement & car restoration,home_improvement
38196,Consolidate debt,debt_consolidation
39365,Pay credit card off,debt_consolidation
28149,Debt Consolidation,credit_card


###### It seems that purpose has fixed categories and title allows for seemingly any text to describe in more detail. We shall drop title as too many variables can easily cause an imbalance in the train test sets and increase bias.

In [23]:
loans_2007 = loans_2007.drop(columns = ['title'])

###### In deciding whether to keep purpose, home ownership and verification status lets convert the categories to [dummy variables](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). By doing this we can have for instance a column of 1 (yes) or 0 (no) for a home ownership of mortgage (and similar columns for each possible result of home ownership)

In [24]:
purpose_dummies_dataframe = pd.get_dummies(loans_2007['purpose'], prefix = 'purpose')
purpose_dummies_dataframe['loan_status'] = loans_2007['loan_status']
purpose_dummies_dataframe.corr()['loan_status'].sort_values()

purpose_small_business       -0.078474
purpose_debt_consolidation   -0.020934
purpose_other                -0.015799
purpose_renewable_energy     -0.006878
purpose_house                -0.006255
purpose_educational          -0.005710
purpose_medical              -0.003972
purpose_moving               -0.002909
purpose_vacation              0.000003
purpose_wedding               0.019350
purpose_home_improvement      0.020961
purpose_car                   0.021304
purpose_major_purchase        0.029213
purpose_credit_card           0.043370
loan_status                   1.000000
Name: loan_status, dtype: float64

###### The correlations are quite revealing. Starting a small business and debts generally the worst returning loans, the exception here is a credit card which appears to be the best return on investment. As the effects are quite broad and the variable size is not too large we will convert these into dummy variables.

In [25]:
loans_2007 = pd.concat([loans_2007, purpose_dummies_dataframe.drop(columns = 'loan_status')], axis = 'columns')
loans_2007 = loans_2007.drop(columns = 'purpose')

In [26]:
home_ownership_dummies_dataframe = pd.get_dummies(loans_2007['home_ownership'], prefix = 'home_ownership')
home_ownership_dummies_dataframe['loan_status'] = loans_2007['loan_status']
home_ownership_dummies_dataframe.corr()['loan_status'].sort_values()

home_ownership_RENT       -0.020954
home_ownership_OTHER      -0.005884
home_ownership_OWN        -0.000654
home_ownership_NONE        0.003647
home_ownership_MORTGAGE    0.021960
loan_status                1.000000
Name: loan_status, dtype: float64

###### Loans for mortgages are more likely to be repaid, where as loans for rent are not. Loans to homeowners are 50:50 likely to be repaid. This is interesting and will also be made to dummie variables.

In [27]:
loans_2007 = pd.concat([loans_2007, home_ownership_dummies_dataframe.drop(columns = 'loan_status')], axis = 'columns')
loans_2007 = loans_2007.drop(columns = 'home_ownership')

In [28]:
verification_status_dummies_dataframe = pd.get_dummies(loans_2007['verification_status'], prefix = 'verification_status')
verification_status_dummies_dataframe['loan_status'] = loans_2007['loan_status']
verification_status_dummies_dataframe.corr()['loan_status'].sort_values()

verification_status_Verified          -0.042067
verification_status_Source Verified   -0.005193
verification_status_Not Verified       0.043985
loan_status                            1.000000
Name: loan_status, dtype: float64

###### An income source that is unverified is more likely to be repaid. This seems counterintuitive. You would expect that unverified incomes would have a higher interest rate applied given the unknown risk and if the income proportion were the same as verified then the higher rate would naturally be less likely to be repaid.

###### Perhaps the higher rates make the applicant more cautious in accepting the loan offer and only ones with more stable finances accept. 

###### Eitherway we will also accept these as dummy variables.

In [29]:
loans_2007 = pd.concat([loans_2007, verification_status_dummies_dataframe.drop(columns = 'loan_status')], axis = 'columns')
loans_2007 = loans_2007.drop(columns = 'verification_status')

### Feature Engineering - Numerical Conversions: String Numericals

###### Next lets convert the strings in employment length and interest rate into their numerical equivalents.

###### For employment length we will assume that n/a and <1year is equal to 0 years. For 10+ we will assume 10 years.

In [30]:
loans_2007['emp_length'].value_counts(dropna = False)

10+ years    8546
< 1 year     4526
2 years      4308
3 years      4026
4 years      3360
5 years      3207
1 year       3182
6 years      2180
7 years      1717
8 years      1444
9 years      1228
Name: emp_length, dtype: int64

In [31]:
emp_length_str_mapping = {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
loans_2007['emp_length'] = loans_2007['emp_length'].replace(emp_length_str_mapping)

In [32]:
loans_2007['emp_length'] = loans_2007['emp_length'].astype('float')

In [33]:
loans_2007['int_rate'] = loans_2007['int_rate'].str.rstrip('%').astype('float')

In [34]:
non_numerical_data['term'].value_counts(normalize = True, dropna = False).head(5).index

Index([' 36 months', ' 60 months'], dtype='object')

###### We will set 36 months and 60 months to the binary representations as there are only two variables here.

In [35]:
loans_str_mapping = {' 36 months':0, ' 60 months': 1}
loans_2007['term'] = loans_2007['term'].replace(loans_str_mapping)

###### Quick check to see all columns are now numerical.

In [36]:
loans_2007.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37724 entries, 0 to 39785
Data columns (total 37 columns):
term                                   37724 non-null int64
int_rate                               37724 non-null float64
installment                            37724 non-null float64
emp_length                             37724 non-null float64
annual_inc                             37724 non-null float64
loan_status                            37724 non-null int64
dti                                    37724 non-null float64
delinq_2yrs                            37724 non-null float64
inq_last_6mths                         37724 non-null float64
open_acc                               37724 non-null float64
pub_rec                                37724 non-null float64
revol_bal                              37724 non-null float64
total_acc                              37724 non-null float64
acc_now_delinq                         37724 non-null float64
delinq_amnt                

### Feature Engineering - Class Imbalance

###### As we saw before we have disproportionate amount of fully paid to charged off loans.

###### Recall the classification divide:

In [37]:
loans_2007['loan_status'].value_counts(normalize = True)

1    0.856723
0    0.143277
Name: loan_status, dtype: float64

###### For loans paid off to unpaid loans the ratio is ~6:1.

###### One method is to avoid the divide in the classes by ensuring proportions are similar in the train and test set.

###### Another is to use an error metric that can account for such an imbalance

### Error Metric

###### Recall that we want to know whether a loan is likely to be repaid. A repaid loan is a positive and unpaid is a negative.

###### A true positive will represent profit and a true negative is avoided loss.

###### A false positive is unavoided loss, a false negative is avoided profit.

###### Ideally we want to maximise profit and avoided loss while minimising unavoided loss and avoided profit.

###### We will attempt to minimise the false positive rate which is :
> False Positives / (False Positivies + True Negatives)

###### and maximise the true positive rate which is:
> True Positivies / (True Positives + False Negatives)

###### We can think of these as:
>False Positive Rate: "the percentage of the loans that shouldn't be funded that I would fund".

>True Positive Rate: "the percentage of loans that should be funded that I would fund".

###### Given average profits and average losses we would of course weight these accordingly into the error metric. We will look at this later

### Machine Learning - Logistical Regression
###### As this is a classification problem (0 or 1) we will use logistical regression ([like this](https://upload.wikimedia.org/wikipedia/commons/6/6d/Exam_pass_logistic_curve.jpeg))

In [38]:
def tpr_fpr_calculator(predictions, y_test):
    predictions = pd.Series(predictions)
    true_positives = len(predictions[(predictions ==1) & (y_test == 1)])
    true_negatives = len(predictions[(predictions ==0) & (y_test == 0)])
    false_positives = len(predictions[(predictions ==1) & (y_test == 0)])
    false_negatives = len(predictions[(predictions ==0) & (y_test == 1)])
    true_positive_rate = true_positives / (true_positives + false_negatives)
    false_positive_rate = false_positives / (false_positives + true_negatives)
    tpr_fpr_calculator_data_labels = {'true_positives':[true_positives],
                                    'false_negatives':[false_negatives], 
                                    'true_positive_rate':[true_positive_rate],  
                                    'false_positives':[false_positives], 
                                    'true_negatives':[true_negatives],
                                    'false_positive_rate':[false_positive_rate]}
    temp_dataframe = pd.DataFrame(data = tpr_fpr_calculator_data_labels)
    return (temp_dataframe)

###### Determine base line rates as a comparison by using all data in the model and predicting all loan_statuses

In [39]:
lr = LogisticRegression()
lr.fit(loans_2007.drop(columns = ['loan_status']), loans_2007['loan_status'])
baseline_predictions = lr.predict(loans_2007.drop(columns = ['loan_status']))
tpr_fpr_calculator(baseline_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,30576,3,0.999902,5082,2,0.999607


###### True positive rate is high, which is good. But False positive rate is high which means out of all the loans we should not be funding we are funding all the loans which will lose money. 

###### Now perform cross validation and see the error rate.

In [40]:
#randomise the data#
loans_2007 = loans_2007.sample(frac = 1)
#split into features and target#
features = loans_2007.drop(columns = ['loan_status'])
target = loans_2007['loan_status']
#Instantiate a linear regression model
lr = LogisticRegression()
#Use cross validation. 20% of the data is selected to be the test set and this is shuffled 5 times to cover all data#
lr_cross_val_predictions = cross_val_predict(lr, features, target, cv=5)

In [41]:
tpr_fpr_calculator(lr_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,30575,4,0.999869,5084,0,1.0


###### A higher true positive rate. However the high false positive rate persists.

### Machine Learning - Logistical Regression with class balancing

###### One way to tackle this issue is to reduce the effect of the class imbalance.

###### Ideally the test and train sets will contain 1:1 ratios of rows for loan_status. Currently this ratio is 6:1

###### Methods to tackle this include:
> undersampling (where a massive reduction in the data is made to achieve the ratio)
> oversampling (repeat rows of the lower sized class)
> generate 'fake' data (impute new rows with the lower size class and create a larger dataset)

###### These techniques all have their issues in difficulty balancing.

###### A simpler technique to try is weighting the classifier to increase the importance of rightly predicting the smaller class. (in this case 6 times more important)

In [42]:
balanced_lr = LogisticRegression(class_weight = 'balanced')
#Use cross validation. 20% of the data is selected to be the test set and this is shuffled 5 times to cover all data#
balanced_lr_cross_val_predictions = cross_val_predict(balanced_lr, features, target, cv=5)
tpr_fpr_calculator(balanced_lr_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,18970,11609,0.62036,3143,1941,0.618214


###### A drop in false positive rate but a drop false negative rate. We can see now the cause of this is false positives and true positives have dropped but true negatives and false negatives have increased

###### Recall that false positives is possibly the worst effect so reducing this is desirable.

### Machine Learning - Logistical Regression with variable class balancing
###### Our balancing has worked, but now lets try a range of values to see the best:


In [43]:
testing_range = np.arange(1,10,2)
for i in testing_range:
    penalty = {0: i, 1: 1}
    balanced_lr = LogisticRegression(class_weight = penalty)
    balanced_lr_cross_val_predictions = cross_val_predict(balanced_lr, features, target, cv=5)
    results = tpr_fpr_calculator(balanced_lr_cross_val_predictions, loans_2007['loan_status'])
    print('for a penalty ratio of:', i)
    print(results)
    print('\n')

for a penalty ratio of: 1
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           30575                4            0.999869             5084   

   true_negatives  false_positive_rate  
0               0                  1.0  


for a penalty ratio of: 3
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           28096             2483              0.9188             4654   

   true_negatives  false_positive_rate  
0             430             0.915421  


for a penalty ratio of: 5
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           20930             9649            0.684457             3501   

   true_negatives  false_positive_rate  
0            1583             0.688631  


for a penalty ratio of: 7
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           16540            14039            0.540894             2742   

   true_negatives  false_positive_rate  

###### The class balancer is not very fine tuning. Selecting a higher and higher penalty really means we will accept less loans and therefore less profit albeit with the benefit of less loss. But after all making no loss is as easy as accepting no loans. 

###### We must go further, lets move on to another model to see how that fairs.

### Machine Learning - Random Forests.

###### The ensemble of multiple decision trees is involved in this classification, lets see how it fairs.

In [44]:
balanced_forests = RandomForestClassifier(class_weight = 'balanced', random_state = 1)
balanced_lr_cross_val_predictions = cross_val_predict(balanced_forests, features, target, cv=5)
tpr_fpr_calculator(balanced_lr_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,29547,1032,0.966251,4916,168,0.966955


In [45]:
balanced_forests = RandomForestClassifier(class_weight = 'balanced', random_state = 1,  max_depth=None, min_samples_split=1000)
balanced_lr_cross_val_predictions = cross_val_predict(balanced_forests, features, target, cv=5)
tpr_fpr_calculator(balanced_lr_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,18646,11933,0.609765,3082,2002,0.606216


In [46]:
penalty = {0: 100, 1: 1}
balanced_forests = RandomForestClassifier(class_weight = 'balanced_subsample', random_state = 1)
balanced_lr_cross_val_predictions = cross_val_predict(balanced_forests, features, target, cv=5)
tpr_fpr_calculator(balanced_lr_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,29579,1000,0.967298,4920,164,0.967742


###### Different variables similar outcomes. Lowering the false positive rate has the effect of lowering the true positive rate.

### Machine Learning - K Nearest Neighbours

In [47]:
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors=2)
KNN_cross_val_predictions = cross_val_predict(KNN, features, target, cv=5)
tpr_fpr_calculator(KNN_cross_val_predictions, loans_2007['loan_status'])

Unnamed: 0,true_positives,false_negatives,true_positive_rate,false_positives,true_negatives,false_positive_rate
0,22556,8023,0.73763,3799,1285,0.747246


###### The same effect persists with K Nearest Neighbours. Our model is not getting better.

### Machine Learning - Other Classifiers:

###### Next we will look at some more classifiers:

In [49]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

classifiers = [AdaBoostClassifier(), GaussianNB(), QuadraticDiscriminantAnalysis()]

np.random.RandomState(1)

for classifier in classifiers:
    predictions = cross_val_predict(classifier, features, target, cv=5)
    results = tpr_fpr_calculator(predictions, loans_2007['loan_status'])
    print(classifier, ',Results')
    print(results)
    print('---')

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None) ,Results
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           30307              272            0.991105             5044   

   true_negatives  false_positive_rate  
0              40             0.992132  
---
GaussianNB(priors=None) ,Results
   true_positives  false_negatives  true_positive_rate  false_positives  \
0           30426              153            0.994997             5063   

   true_negatives  false_positive_rate  
0              21             0.995869  
---
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001) ,Results
   true_positives  false_negatives  true_positive_rate  false_positives  \
0              57            30522            0.001864               14   

   true_negatives  false_positive_rate  
0            5070       

###### The inability to raise the true positive rate whilst lowering the false positive rate is highlighted by all the classifications we have performed.

###### Ultimately this tells us there is great overlap between good and bad investments and currently model predictions are not good enough.

###### Although the conclusion is disappointing, the journey was not.