# LendingClub Predictions

Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/public/how-peer-lending-works.action).

Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. 
- Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. 
- The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. 
- A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. 
- Each borrower is given a grade according to the interest rate they were assigned. 
- If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.
- Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application.
- The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Many loans aren't completely paid off on time, however, and some borrowers [default](https://www.lendingclub.com/investing/investor-education/collection-of-monthly-payments) on the loan.
- Lending Club releases data for all of the approved and declined loan applications periodically on their [website](https://www.lendingclub.com/info/download-data.action).
- The data dictionary for the data is present [here](https://github.com/ajdatahub/ProjectDS/tree/master/LendingClub%20Predictions). The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.


- The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:

    __Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?__


Before we can start doing machine learning, we need to clean our data and define what features we want to use and which column repesents the target column we want to predict.

### Data Cleaning

In [221]:
import pandas as pd

# Reading data into a dataframe and removing the first row 
loan_2007 = pd.read_csv('LoanStats3a.csv', skiprows = 1)

  interactivity=interactivity, compiler=compiler, result=result)


- Removing the desc column:
    - which contains a long text explanation for each loan
- Removing the url column:
    - which contains a link to each loan on Lending Club which can only be accessed with an investor account
- Removing all columns containing more than 50% missing values:
    - which allows us to move faster since we can spend less time trying to fill these values

In [222]:
loan_2007 = loan_2007.drop(['url','desc', 'hardship_flag'],axis = 1)
half_row = len(loan)/2
loan_2007 = loan_2007.dropna(thresh = half_row, axis = 1)
loan_2007.head()


Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,...,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens,disbursement_method,debt_settlement_flag
0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,...,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,Cash,N
1,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,...,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,Cash,N
2,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,...,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,Cash,N
3,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,...,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,Cash,N
4,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,...,0.0,1.0,Individual,0.0,0.0,0.0,0.0,0.0,Cash,N


A cleaner file with appropriate columns is available to carry out our feature analysis and predictions. We will us that fie going forward.

In [223]:
loans = pd.read_csv('loans_2007.csv')
loans.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Wen need to explore the dataset in order to undertsand the features to be used in the prediction process.  We will use the data dictionary to become familiar with what each column represents. We want to pay attention to any features that- 

- leak information from the future (after the loan has already been funded). __This can cause our model to overfit.__
- don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
- formatted poorly and need to be cleaned up
- require more data or a lot of processing to turn into a useful feature
- contain redundant information

I have created a spreadsheet that contains the name, data type, first row's value, and description from the data dictionary. This will make analysis of the columns easier. 

In [224]:
data_dict = pd.read_excel('DataDict.xlsx')
data_dict

Unnamed: 0,name,dtype,description
0,id,object,A unique LC assigned ID for the loan listing.
1,member_id,float64,A unique LC assigned Id for the borrower member.
2,loan_amnt,float64,The listed amount of the loan applied for by t...
3,funded_amnt,float64,The total amount committed to that loan at tha...
4,funded_amnt_inv,float64,The total amount committed by investors for th...
5,term,object,The number of payments on the loan. Values are...
6,int_rate,object,Interest Rate on the loan
7,installment,float64,The monthly payment owed by the borrower if th...
8,grade,object,LC assigned loan grade
9,sub_grade,object,LC assigned loan subgrade


After analyzing each column, we can conclude that the following features need to be removed:

- __id__: randomly generated field by Lending Club for unique identification purposes only (This column is already removed as most of the values were empty)
- __member_id__: also a randomly generated field by Lending Club for unique identification purposes only (This column is already removed as most of the values were empty)
- __funded_amnt__: leaks data from the future (after the loan is already started to be funded)
- __funded_amnt_inv__: also leaks data from the future (after the loan is already started to be funded)
- __grade__: contains redundant information as the interest rate column (int_rate)
- __sub_grade__: also contains redundant information as the interest rate column (int_rate)
- __emp_title__: requires other data and a lot of processing to potentially be useful
- __issue_d__: leaks data from the future (after the loan is already completed funded)

    - Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade and sub_grade values are categorical, the int_rate column contains continuous values, which are better suited for machine learning.

- __zip_code__: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- __out_prncp__: leaks data from the future, (after the loan already started to be paid off)
- __out_prncp_inv__: also leaks data from the future, (after the loan already started to be paid off)
- __total_pymnt__: also leaks data from the future, (after the loan already started to be paid off)
- __total_pymnt_inv__: also leaks data from the future, (after the loan already started to be paid off)
- __total_rec_prncp__: also leaks data from the future, (after the loan already started to be paid off)

    - The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.
    
- __total_rec_int__: leaks data from the future, (after the loan already started to be paid off),
- __total_rec_late_fee__: also leaks data from the future, (after the loan already started to be paid off),
- __recoveries__: also leaks data from the future, (after the loan already started to be paid off),
- __collection_recovery_fee__: also leaks data from the future, (after the loan already started to be paid off),
- __last_pymnt_d__: also leaks data from the future, (after the loan already started to be paid off),
- __last_pymnt_amnt__: also leaks data from the future, (after the loan already started to be paid off).


In [225]:
loans = loans.drop(["id", "member_id", "funded_amnt", 
                  "funded_amnt_inv", "grade", "sub_grade", 
                  "emp_title", "issue_d", "zip_code", "out_prncp", 
                  "out_prncp_inv", "total_pymnt", "total_pymnt_inv", 
                  "total_rec_prncp", "total_rec_int", "total_rec_late_fee", 
                  "recoveries", "collection_recovery_fee", "last_pymnt_d", 
                  "last_pymnt_amnt"], axis=1)

print('Total Number of Columns: ',loan.shape[1])


Total Number of Columns:  32


We could reduce the number of columns in the dataset from 54 to 32. 

We need to decide on a target column that we want to use for modeling.

We shoul use the __loan_status__ column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. 
- Currently, this column contains text values and we need to convert it to a numerical one for training a model. 

In [226]:
# Explore the different values for loan_status column
status = loans['loan_status'].value_counts()
status

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. 
- Only the __Fully Paid__ and __Chargeg Off__ values describe the final outcome of the loan.  
- While the __Default__ status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

Since we are considering only two values to be able to predict from, we can treat our problem as a binary classification.
- We will reremove all the rows with a status other than Fully Paid and Charged Off.
- Transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case. 


In [227]:
# Dropping rows where status is not 'Fully Paid','Charged Off'
# loans = loans.drop(loans[~loans['loan_status'].isin (['Fully Paid','Charged Off' ])].index)
loans = loans[(loans['loan_status'] == "Fully Paid") | (loans['loan_status'] == "Charged Off")]

# Transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case. 
status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}
loans = loans.replace(status_replace)

In [228]:
loans.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,n,...,f,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [229]:
'''
Find out columns that contain only one unique value and remove them. 
These columns won't be useful for the model since they don't add any information to each loan application.
'''

def unique_count(data):
    drop_columns = []
    columns = data.columns
    for each in columns:
        col = data[each].dropna()
        count = len(col.unique())
        if count == 1:
            drop_columns.append(each)
    return drop_columns

drop_cols = unique_count(loans)
drop_cols

['pymnt_plan',
 'initial_list_status',
 'collections_12_mths_ex_med',
 'policy_code',
 'application_type',
 'acc_now_delinq',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'tax_liens']

In [230]:
# Dropping columns with one unique value

loans = loans.drop(drop_cols, axis = 1)
loans.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'loan_status',
       'purpose', 'title', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'last_credit_pull_d',
       'pub_rec_bankruptcies'],
      dtype='object')

In [231]:
loans.shape[0]

38770

### Feature Engineering - Preparing the Features

For our data to be ready for ML algorithms, we will - 
- Handle missing values
- Convert categorical variables to numerical values
- Remove other variables which are not useful for making predictions.


In [232]:
# Counting number of missing values in the data frame

null_counts = loans.isnull().sum()
null_counts

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

We will delete the column where there are more than 1% of missing values. In addition, we'll remove the remaining rows containing null values.
- pub_rec_bankruptcies has more than 1% missing values.
- Drop rows with null values.


In [233]:
# Dropping the column pub_rec_bankruptcies
loans = loans.drop('pub_rec_bankruptcies', axis = 1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


In [234]:
# Selecting  the columns of object type from loans
object_columns_df = loans.select_dtypes('object')
object_columns_df[:1]

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016


In [235]:
# Counting the number of unique values of object data elements

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
AL     420
LA     420
KY     311
OK     285
KS     249
UT     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
AK      76
WY      76
SD      60
VT  

 We should analyze the unique value counts for the __purpose__ and __title__ columns to understand which column we want to keep.

In [242]:
purpose_unique_counts = loans["purpose"].value_counts()
title_unique_counts = loans["title"].value_counts()

print(purpose_unique_counts.shape[0])
print(title_unique_counts.shape[0])


14
18881


In [243]:
purpose_unique_counts.head()

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
Name: purpose, dtype: int64

In [244]:
title_unique_counts.head()

Debt Consolidation         2068
Debt Consolidation Loan    1599
Personal Loan               624
Consolidation               488
debt consolidation          466
Name: title, dtype: int64

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. 
- In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).


The __home_ownership__, __verification_status__, __emp_length__, and __term__ columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

We can use the following mapping to clean the __emp_length__ column:

- "10+ years": 10
- "9 years": 9
- "8 years": 8
- "7 years": 7
- "6 years": 6
- "5 years": 5
- "4 years": 4
- "3 years": 3
- "2 years": 2
- "1 year": 1
- "< 1 year": 0
- "n/a": 0

- We assume that people who may have been working more than 10 years have only really worked for 10 years. 
- We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. 

Lastly, the __addr_state__ column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. 
- We will remove this columns from the dataset.

In [245]:
emp_dict = {
    'emp_length' : {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans = loans.replace(emp_dict)

In [246]:
# Removing the last_credit_pull_d, addr_state, title, and earliest_cr_line columns from loans.

loans = loans.drop(['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'], axis = 1)

In [247]:
# Removing the '%' symbol from the values in int_rate and revol_util and converting the column to float type

loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float64')
loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float64')

In [248]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37675 entries, 0 to 39785
Data columns (total 18 columns):
loan_amnt              37675 non-null float64
term                   37675 non-null object
int_rate               37675 non-null float64
installment            37675 non-null float64
emp_length             37675 non-null int64
home_ownership         37675 non-null object
annual_inc             37675 non-null float64
verification_status    37675 non-null object
loan_status            37675 non-null int64
purpose                37675 non-null object
dti                    37675 non-null float64
delinq_2yrs            37675 non-null float64
inq_last_6mths         37675 non-null float64
open_acc               37675 non-null float64
pub_rec                37675 non-null float64
revol_bal              37675 non-null float64
revol_util             37675 non-null float64
total_acc              37675 non-null float64
dtypes: float64(12), int64(2), object(4)
memory usage: 5.5+ MB


We will now encode the __home_ownership__, __verification_status__, __purpose__, and __term__ columns as dummy variables so we can use them in our model.

In [249]:

dummies_data = pd.get_dummies(loans[['home_ownership', 'verification_status', 'purpose','term']])

# Combining the dummies dataframe with the existing dataset
loans = pd.concat([loans,dummies_data], axis = 1)

# Dropping the original columns -'home_ownership','term','verification_status','purpose'
loans = loans.drop(['home_ownership','term','verification_status','purpose'], axis = 1)

In [250]:
loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
5,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


We have applied data preperation techniques to -
- clean the data
- removed columns that had data leakage issues 
- removed variables which are not useful for making predictions
- contained redundant information
- required additional processing to turn into useful features 
- cleaned the features that had formatting issues
- converted categorical columns to dummy variables

Our goal was to generate new features from the data which help us make predicitons using the machine learning techniques. 

We want to make predictions about whether or not a loan will be paid off on time, which is contained in the __loan_status__ column of the dataset.


In [252]:
loans['loan_status'].value_counts()

1    32286
0     5389
Name: loan_status, dtype: int64

We see that there's a class imbalance in our target column, loan_status. There are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that weren't (negative case, label of 0). 
- Imbalances can cause issues with many machine learning algorithms, where they appear to have high accuracy, but actually aren't learning from the training data. __Because of its potential to cause issues, we need to keep the class imbalance in mind as we build machine learning models.__

### Error Metric 

The predictions we are trying to make is a binary classification problem, and we converted the loan_status column to 0s and 1s as a result. 
- Before selecting an algorithm to apply to the data, we should select an error metric.

In our case, we're primarily concerned with false positives and false negatives- 
- With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. 
- With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

Since we're considering of a conservative investor, we need to treat false positives differently than false negatives. 
- A conservative investor would want to minimize risk, and avoid false positives as much as possible.
- They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

As we saw earlier that there is a __class imbalance__ in the loan_status column in our data, we have an issue with using accuracy as a metirc. 
- This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy.

That is why, it is important to be aware of the class imbalance while implementing machine learning models and choose the error metric accordingly. 
- In our case, we don't want to use accuracy, and should instead use metrics that tell us the number of __false positives and false negatives.__

We should optimize for:

__False Positive Rate__ - False positive rate is the number of false positives divided by the number of false positives plus the number of true negatives. This divides all the cases where we thought a loan would be paid off but it wasn't by all the loans that weren't paid off.

- What percentage of my 1 predictions are incorrect?
- In our case, what percentage of the loans that I fund would not be repaid?

__True Positive Rate__ - True positive rate is the number of true positives divided by the number of true positives plus the number of false negatives. This divides all the cases where we thought a loan would be paid off and it was by all the loans that were paid off.

- What percentage of all the possible 1 predictions am I making?
- In our case, what percentage of loans that could be funded would I fund?


### Apllying Machine Learning to make Predicitons

We will now apply machine learning techniques to make predictions. 
To start off, we will use the Logistic Regression to make predictions. 

The cleaned data is present in a seperate file - cleaned_loans_2007.csv. We will load the file and start applying machine learning techniques to make predictions.

In [253]:
clean_loans = pd.read_csv('cleaned_loans_2007.csv')
clean_loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [254]:
features = clean_loans.columns
# Removing the target column
features =  features.drop('loan_status')

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(clean_loans[features], clean_loans['loan_status'])

# Make Predictions
lr_predictions = lr.predict(clean_loans[features])
lr_predictions

array([1, 1, 1, ..., 1, 1, 1])

The predictions made above were overfit because we used the same data to train our model and make predictions on the same data. 
- When we will evaluate the error metric, the value be unrealistically high because the model already knows the correct predictions. 

In order to get the correct depiction of error metric, we can implement k-fold cross validation. We can use the cross_val_predict() function from the sklearn.model_selection package

In [255]:
from sklearn.model_selection import cross_val_predict

lr = LogisticRegression()

X = clean_loans[features]
Y = clean_loans['loan_status']

# 3-fold cross validation
cross_val_predictions = cross_val_predict(lr, X, Y, cv = 3)

# Series of Predicitons
cross_val_predictions = pd.Series(cross_val_predictions)
cross_val_predictions.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

In [256]:
# Calculate the True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN)
TP = cross_val_predictions[(cross_val_predictions == 1) & (clean_loans['loan_status'] == 1)]
FP = cross_val_predictions[(cross_val_predictions == 1) & (clean_loans['loan_status'] == 0)]
TN = cross_val_predictions[(cross_val_predictions == 0) & (clean_loans['loan_status'] == 0)]
FN = cross_val_predictions[(cross_val_predictions == 0) & (clean_loans['loan_status'] == 1)]

# Calculate True Positive Rate (TPR)
TPR = len(TP)/(len(TP) + len(FN))

# Calculate False Positve Rate
FPR = len(FP)/(len(FP) + len(TN))

print('True Positive Rate : ', TPR)
print('False Positve Rate : ', FPR)

True Positive Rate :  0.9989121566494424
False Positve Rate :  0.9967943009795192


True Positive Rate and False Positve Rate are around what we'd expect if the model was predicting all ones. 
Evevn though, we are not using accuracy as the error metric, the classiier is. 
- __The model is not accounting the class imbalance.__

One way to handle this class imbalance is to tell the classifier to __penalize misclassifications of the less prevalent class more than the others.__

We can do this by setting the __class_weight__ parameter to balanced when creating the LogisticRegression instance. 
- This tells scikit-learn to penalize the misclassification of the minority class during the training process. 
- The penalty means that the logistic regression classifier __pays more attention to correctly classifying rows where loan_status is 0__. 
- This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0.

By setting the class_weight parameter to balanced, the penalty is set to be inversely proportional to the class frequencies.
- For our data, this would mean that for the classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly classifying a row where loan_status is 1.

In [267]:
lr = LogisticRegression(class_weight = 'balanced')

X = clean_loans[features]
Y = clean_loans['loan_status']

# 3-fold cross validation
predictions = cross_val_predict(lr, X, Y, cv = 3)


predictions = pd.Series(predictions)

predictions.head()

0    1
1    0
2    0
3    0
4    1
dtype: int64

In [258]:

# Calculate the True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN)
TP = predictions[(predictions == 1) & (clean_loans['loan_status'] == 1)]
FP = predictions[(predictions == 1) & (clean_loans['loan_status'] == 0)]
TN = predictions[(predictions == 0) & (clean_loans['loan_status'] == 0)]
FN = predictions[(predictions == 0) & (clean_loans['loan_status'] == 1)]

# Calculate True Positive Rate (TPR)
TPR = len(TP)/(len(TP) + len(FN))

# Calculate False Positve Rate
FPR = len(FP)/(len(FP) + len(TN))

print('True Positive Rate : ', TPR)
print('False Positve Rate : ', FPR)

True Positive Rate :  0.6664551415707249
False Positve Rate :  0.3889581478183437


We were able to reduce the False Positive Rate to around 39%. 
- From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything.

Significantly improving the false positive rate by balancing the classes, reduced the true positive rate. 
- Meaning that we will be potnetially rejecting good number of loans which could have been good candiates to be selected. 

We can lower our false positive rate even further by setting up harsher penalties for misclassifying the negative class. 

- While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. 
- We can also specify a penalty manually if we want to adjust the rates more. We can do this by passing in a dictionary of penalty values to the class_weight parameter. 

In [266]:
penalty = {
    0: 10,
    1: 1
}

lr_p = LogisticRegression(class_weight = penalty)

X = clean_loans[features]
Y = clean_loans['loan_status']

# 3-fold cross validation
predictions_p = cross_val_predict(lr_p, X, Y, cv = 3)

# Series to hold predicitons
predictions_p = pd.Series(predictions_p)

predictions_p.head()


0    0
1    0
2    0
3    0
4    0
dtype: int64

In [263]:
# Calculate the True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN)
TP = predictions_p[(predictions_p == 1) & (clean_loans['loan_status'] == 1)]
FP = predictions_p[(predictions_p == 1) & (clean_loans['loan_status'] == 0)]
TN = predictions_p[(predictions_p == 0) & (clean_loans['loan_status'] == 0)]
FN = predictions_p[(predictions_p == 0) & (clean_loans['loan_status'] == 1)]

# Calculate True Positive Rate (TPR)
TPR = len(TP)/(len(TP) + len(FN))

# Calculate False Positve Rate
FPR = len(FP)/(len(FP) + len(TN))

print('True Positive Rate : ', TPR)
print('False Positve Rate : ', FPR)

True Positive Rate :  0.23551808539570301
False Positve Rate :  0.08726625111308994


- We could further reduce the false positive rate to 9%. This will help the investor to avoid bad loans. 

- But, this also lowered the true positive rate to a lower number. This means that we are avoiding good opportunities to invest. But coming from a standpoint of a conservative investor, avaoiding bad loans is the priority. 


### Applying Random Forests Algorithm

Let's try a more complex algorithm, random forest. 
- Random forests are able to work with nonlinear data, and learn complex conditionals. 
- Logistic regressions are only able to work with linear data. 
- Training a random forest algorithm may enable us to get more accuracy due to columns that correlate nonlinearly with loan_status.

In [272]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state = 1, class_weight = 'balanced')

X = clean_loans[features]
Y = clean_loans['loan_status']

predictions_rf = cross_val_predict(rf, X, Y, cv = 3)

# Series to hold predicitons
predictions_rf = pd.Series(predictions_rf)

predictions_rf.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

In [273]:
# Calculate the True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN)
TP = predictions_rf[(predictions_rf == 1) & (clean_loans['loan_status'] == 1)]
FP = predictions_rf[(predictions_rf == 1) & (clean_loans['loan_status'] == 0)]
TN = predictions_rf[(predictions_rf == 0) & (clean_loans['loan_status'] == 0)]
FN = predictions_rf[(predictions_rf == 0) & (clean_loans['loan_status'] == 1)]

# Calculate True Positive Rate (TPR)
TPR = len(TP)/(len(TP) + len(FN))

# Calculate False Positve Rate
FPR = len(FP)/(len(FP) + len(TN))

print('True Positive Rate : ', TPR)
print('False Positve Rate : ', FPR)

True Positive Rate :  0.9708699725017376
False Positve Rate :  0.9271593944790739


Using a random forest classifier didn't improve our false positive rate. 
- The model is likely weighing too heavily on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher penalty for misclassifications of 0s.

So far, our best model had a false positive rate of 7%, and a true positive rate of 20%. 
- For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 7% of borrowers defaulting, and that the pool of 20% of borrowers is large enough to make enough interest money to offset the losses.

If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would.