# 2. Modeling Probability of Default

Modeling the probability that a loan defaults is critical to our investment strategy. After our exploration of 2013 and 2014 data, we select predictors that present promising separations between defaulted and repaid loans. We also make sure not to use any predictors capturing information that we would not have at the time of selecting which loans to invest in.

---

### Summary of variables

**Response:**

Newly defined variable, `paid`, which was a binary variable indicating whether a loan had any status other that 'paid' or 'Fully Paid'. 

**Predictors:**

- `funded_amnt`: The total amount committed to that loan at that point in time.
- `home_ownership` : The home ownership status provided by the borrower during registration or obtained from the credit report.    
- `int_rate` : Interest Rate on the loan.
- `purpose` : A category provided by the borrower for the loan request.
- `annual_inc`: The self-reported annual income provided by the borrower during registration.
- `verification_status`: Indicates if income was verified by LC, not verified, or if the income source was verified.
- `dti`: Debt to income ratio - ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self-reported monthly income.
- `revol_util`:  Revolving utilization - ratio of total current balance to credit limit for all revolving accounts.
- `grade`: Lending club assigned loan grade.
- `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.

    *We use one hot encoding for the categorical variables (employment length, home ownership, verification status, grade and purpose).*    
    *We normalize all columns with values that are not percentages (funded amount, annual income and debt-to-income ratio).*
    
    
---

## HIDE

### Handling Imbalanced Class

The algorithms we select to predict future loans as defaulting or not defaulting have trouble learning to predict underrepresented classes. 

Machine learning algorithms have trouble learning when one class dominates the other. For example, imagine writing an algorithm to classify iris flowers based on sepal length and width. Your classifier learns based on these data:



There are 4 ways of addressing class imbalance problems like these:

Synthesisis of new minority class instances
Over-sampling of minority class
Under-sampling of majority class
tweak the cost function to make misclassification of minority instances more important than misclassification of majority instances
This blog post is exclusively about the first solution: synthesis of new minority class instances via SMOTE, as implemented in the R library smotefamily.

How does SMOTE resolve the class imbalance problem?
SMOTE synthesises new minority instances between existing (real) minority instances. Imagine that SMOTE draws lines between existing minority instances like this.


https://jair.org/index.php/jair/article/view/10302/24590

### Summary of models

We select a handful of methods for this classification problem to 

In [None]:

    cols = ['funded_amnt', 'emp_length', 'home_ownership', 'int_rate', 'purpose', 'total_pymnt',
        'annual_inc', 'verification_status', 'dti', 'loan_status', 'revol_util', 'grade', 
            'term', 'term_adj', 'zip_code', 'delinq_2yrs']
    # Cols to one hot encode
    cols_encode = ['emp_length', 'home_ownership', 'verification_status', 'grade', 'purpose']
    # Cols to normalize
    cols_normalize = ['funded_amnt', 'annual_inc', 'dti']
    
    cols_to_drop_training = ['loan_status', 'paid', 'amnt', 'total_pymnt', 'term_adj', 'zip_code']
