# 2. Modeling Probability of Default

Modeling the probability that a loan defaults is critical to our investment strategy. After our exploration of 2013 and 2014 data, we select predictors that present promising separations between defaulted and repaid loans. We also make sure not to use any predictors capturing information that we would not have at the time of selecting which loans to invest in.

---

### Summary of variables

**Response:**

Newly defined variable, `paid`, which was a binary variable indicating whether a loan had any status other than 'Fully Paid'. 

**Predictors:**

- `funded_amnt`: The total amount committed to that loan at that point in time.
- `home_ownership` : The home ownership status provided by the borrower during registration or obtained from the credit report.    
- `int_rate` : Interest Rate on the loan.
- `purpose` : A category provided by the borrower for the loan request.
- `annual_inc`: The self-reported annual income provided by the borrower during registration.
- `verification_status`: Indicates if income was verified by LC, not verified, or if the income source was verified.
- `dti`: Debt to income ratio - ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self-reported monthly income.
- `revol_util`:  Revolving utilization - ratio of total current balance to credit limit for all revolving accounts.
- `grade`: Lending club assigned loan grade.
- `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.

    *We use one hot encoding for the categorical variables (employment length, home ownership, verification status, grade and purpose).*    
    *We normalize all columns with values that are not percentages (funded amount, annual income and debt-to-income ratio).*
    
    
---

## HIDE

In [68]:
import pandas as pd
import datetime
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE


In [15]:

## read in csv files
data_2013 = pd.read_csv("../data/2012_13_loan_data/LoanStats3b.csv", low_memory = False, encoding='latin-1')

data_a_2014 = pd.read_csv("../data/2014_loan_data/LoanStats_2014_A.csv", low_memory = False);
data_b_2014 = pd.read_csv("../data/2014_loan_data/LoanStats_2014_B.csv", low_memory = False);
data_c_2014 = pd.read_csv("../data/2014_loan_data/LoanStats_2014_C.csv", low_memory = False);
data_d_2014 = pd.read_csv("../data/2014_loan_data/LoanStats_2014_D.csv", low_memory = False);

data_a_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_A.csv", low_memory = False);
data_b_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_B.csv", low_memory = False);
data_c_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_C.csv", low_memory = False);
data_d_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_D.csv", low_memory = False);
data_e_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_E.csv", low_memory = False);
data_f_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_F.csv", low_memory = False);
data_g_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_G.csv", low_memory = False);
data_h_2015 = pd.read_csv("../data/2015_loan_data/LoanStats_2015_H.csv", low_memory = False);


## concatenate data into training and test sets
data_train = pd.concat([data_a_2014, data_b_2014, data_c_2014, data_d_2014, data_2013], ignore_index=True)

data_2015 = pd.concat([data_a_2015, data_b_2015, data_c_2015, data_d_2015, 
                       data_e_2015, data_f_2015, data_g_2015, data_h_2015], 
                       ignore_index=True)

In [27]:
data_train_ = data_train.copy()
# create response var
data_train_["paid"] = 1
data_train_.loc[data_train_.loan_status == 'Fully Paid', 'paid'] = 0

### MAKE TABLE TO SHOW IMBALANCE OF CLASSES
class_cts = data_train_.groupby(['paid']).agg({
    'paid' : 'count'})
class_cts['Percent'] = class_cts.paid/len(data_train)
class_cts.index = ['repaid','defaulted']
class_cts.columns = ['number of loans', 'percent of loans']

### ADD IN SMOTE
def split_data(df, cols):
    x = df.drop(cols, axis = 1)
    y = df.paid
    return x, y
cols_to_drop_training = []
x_train_initial, y_train_initial = split_data(data_train_.loc[:, ['paid', 'funded_amnt']], cols_to_drop_training)
sm = SMOTE(random_state=1, ratio = 1.0)
x_train, y_train = sm.fit_sample(x_train_initial, y_train_initial)
class_cts['SMOTE number of loans'] = [sum(y_train),len(y_train) - sum(y_train)]
class_cts['SMOTE percent of loans'] = [sum(y_train)/len(y_train), 1-sum(y_train)/len(y_train)]

### Handling Imbalanced Class

The algorithms we select to predict future loans as defaulting or not defaulting have trouble learning to predict underrepresented classes. As seen in the table below, the defaulted class in the training data is under represented. 

To address this class imbalance problem, we use synthesis of new minority class instances ([SMOTE](https://jair.org/index.php/jair/article/view/10302/24590)). For each minority class, SMOTE calculates the k nearest neighbors. Depending on the amount of oversampling required, one or more of the k-nearest neighbors are selecte to create synthetic examples to augment the training dataset. We oversample the defaulted loan class to achieve a 50-50 split of classes using the `imbalanced-learn` library.

In [91]:
class_cts.round(2)

Unnamed: 0,number of loans,percent of loans,SMOTE number of loans,SMOTE number of loans.1,SMOTE percent of loans
repaid,289177,0.78,289177,289177,0.5
defaulted,81266,0.22,289177,289177,0.5


### Summary of models

We select a handful of methods for this classification problem to 

In [None]:

    cols = ['funded_amnt', 'emp_length', 'home_ownership', 'int_rate', 'purpose', 'total_pymnt',
        'annual_inc', 'verification_status', 'dti', 'loan_status', 'revol_util', 'grade', 
            'term', 'term_adj', 'zip_code', 'delinq_2yrs']
    # Cols to one hot encode
    cols_encode = ['emp_length', 'home_ownership', 'verification_status', 'grade', 'purpose']
    # Cols to normalize
    cols_normalize = ['funded_amnt', 'annual_inc', 'dti']
    
    cols_to_drop_training = ['loan_status', 'paid', 'amnt', 'total_pymnt', 'term_adj', 'zip_code']
