# Lending Club Risk Prediction

The aim of this project is to build a machine learning model that will allow us to accurately predict if a borrower will pay off their loan on time or not. 

The dataset we'll be using is the complete loan data for all [Lending Club](https://www.lendingclub.com) loans issued from 2012 to 2013, which can be downloaded [here](https://www.lendingclub.com/info/download-data.action).

A description of each column can be found [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit#gid=2081333097).

# Importing and Cleaning the Data

In [1]:
import pandas as pd
import numpy as np

# We'll skip the first row because it provides information about the dataset's origin rather than column titles.
loans_2013 = pd.read_csv('loans2013.csv', skiprows=1)

og_cols = loans_2013.shape[1]
# Drop columns in which at least half of the column data is missing
missing_info_thresh = loans_2013.shape[0]/2
loans_2013.dropna(thresh=missing_info_thresh, axis=1, inplace=True)

# Save processed dataset incase we want to explore the original later
loans_2013.to_csv('processed_loans2013.csv', index=False)

print('{} columns dropped'.format(og_cols - loans_2013.shape[1]))

loans_2013.head(1)

  interactivity=interactivity, compiler=compiler, result=result)


58 columns dropped


Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,12000.0,12000.0,12000.0,36 months,7.62%,373.94,A,A3,Systems Engineer,3 years,...,100.0,0.0,0.0,233004.0,46738.0,14800.0,53404.0,N,Cash,N


That's a lot of columns. Determining which columns to keep is going to take some time. Let's familiarize ourselves with the columns and determine which columns:

- Leak information from the future
- Don't affect a borrower's ability to pay back a loan
- Need to be cleaned because of poor formatting
- Require more data or too much processing to be usefully turned into a feature
- Contain redundant information

We'll want to import the data dictionary to make this process a little less painful.

# Columns 1-20

In [2]:
# We'll need to expand the maximum column width to view the descriptions
pd.options.display.max_colwidth = 250

# Import the data dictionary to easily reference what each column describes
data_dict = pd.read_excel('LCDataDictionary.xlsx', sheet_name='LoanStats')

data_dict.head()

Unnamed: 0,LoanStatNew,Description
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan application
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by the borrower during registration.


In [3]:
def explain_columns(start, end):
    '''
    Returns a dataframe with index, column name, description of column, and first value in column
    
    start: starting index
    end: ending index 
    '''
    first_n_cols = loans_2013.columns[start:end]
    indices = []
    first_n_vals = pd.Series()


    for col in first_n_cols:
        index = list(data_dict[data_dict['LoanStatNew'] == col].index)
        indices.extend(index)
        first_n_vals = first_n_vals.append(pd.Series(loans_2013[col][0], index))

    first_n = data_dict[data_dict.index.isin(indices)]
    
    

    return pd.concat([first_n, pd.DataFrame(data=first_n_vals, columns=['first value'])], axis=1).reset_index(drop=True)


explain_columns(0, 20)

Unnamed: 0,LoanStatNew,Description,first value
0,addr_state,The state provided by the borrower in the loan application,TX
1,annual_inc,The self-reported annual income provided by the borrower during registration.,96500
2,emp_length,Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.,3 years
3,emp_title,The job title supplied by the Borrower when applying for the loan.*,Systems Engineer
4,funded_amnt,The total amount committed to that loan at that point in time.,12000
5,funded_amnt_inv,The total amount committed by investors for that loan at that point in time.,12000
6,grade,LC assigned loan grade,A
7,home_ownership,"The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.",MORTGAGE
8,installment,The monthly payment owed by the borrower if the loan originates.,373.94
9,int_rate,Interest Rate on the loan,7.62%


In [4]:
# It may be meaningless, but let's find out what that asterisk signifies
data_dict.iloc[-1:]

Unnamed: 0,LoanStatNew,Description
116,,* Employer Title replaces Employer Name for all loans listed after 9/23/2013


Columns to drop:

- emp_title: relevance is suspect. Also likely contains a ton of different possible categorical values which means turning this into a useful feature will take too much processing to be useful
- funded_amt: leaks future data
- funded_amt_inv: leaks future data
- grade: basically a weaker version of int_rate
- issue_d: leaks future data
- sub_grade: same as grade
- zip_code: basically the same as addr_state

In [5]:
dropped_cols = ['emp_title', 'funded_amnt', 'funded_amnt_inv', 'grade', 'issue_d', 'sub_grade', 'zip_code']
loans_2013.drop(columns=dropped_cols, inplace=True)
loans_2013.shape

(188183, 80)

# Columns 21-30

In [6]:
explain_columns(20, 40)

Unnamed: 0,LoanStatNew,Description,first value
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.,0
1,application_type,Indicates whether the loan is an individual application or a joint application with two co-borrowers,Individual
2,collection_recovery_fee,post charge off collection fee,0
3,collections_12_mths_ex_med,Number of collections in 12 months excluding medical collections,0
4,initial_list_status,"The initial listing status of the loan. Possible values are – W, F",f
5,last_credit_pull_d,The most recent month LC pulled credit for this loan,Mar-2019
6,last_pymnt_amnt,Last total payment amount received,2927.22
7,last_pymnt_d,Last month payment was received,Jun-2016
8,out_prncp,Remaining outstanding principal for total amount funded,0
9,out_prncp_inv,Remaining outstanding principal for portion of total amount funded by investors,0


collection_recovery_fee
collections_10_mths_ex_med
last_credit_pull_d
last_pymnt_amnt
last_pymnt_d
