# Modelling Credit Risk
## Introduction
   In this project I will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. I'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.

### Lending Club Process
Each borrower fills out an application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score and past historical data to assign an interest rate to the borrower. A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. Each borrower is given a grade according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiveing a return on their investments. On the Lending Club marketplace, qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. When they want to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges. The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.
### Project Scope
While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this project, I'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, I'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.

## Problem Statement
Can I build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

## Data
Lending Club releases data for all of the approved and declined loan applications periodically on [their website](https://www.lendingclub.com/info/statistics.action). The data dictionary can be found [here](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit).

The `LoanStats` sheet describes the approved loans datasets and the `RejectStats` describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, I'll be focusing on data on approved loans only. The approved loans datasets contain information on current loans, completed loans, and defaulted loans.

I'll be focusing on approved loan data from 2007-2011 since most of the loans during this period have already been resolved. More recent datasets contain too many loans that are still in the process of being paid off.

First, I'll need to explore the data to determine which features I want to use and which column represents the target I want to predict.

In [4]:
import pandas as pd
import numpy as np

loans = pd.read_csv('loans_2007.csv')
loans.shape, loans.head(2)

((42538, 52),
         id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
 0  1077501  1296599.0     5000.0       5000.0           4975.0   36 months   
 1  1077430  1314167.0     2500.0       2500.0           2500.0   60 months   
 
   int_rate  installment grade sub_grade  ... last_pymnt_amnt  \
 0   10.65%       162.87     B        B2  ...          171.62   
 1   15.27%        59.83     C        C4  ...          119.66   
 
   last_credit_pull_d collections_12_mths_ex_med  policy_code application_type  \
 0           Jun-2016                        0.0          1.0       INDIVIDUAL   
 1           Sep-2013                        0.0          1.0       INDIVIDUAL   
 
   acc_now_delinq chargeoff_within_12_mths delinq_amnt pub_rec_bankruptcies  \
 0            0.0                      0.0         0.0                  0.0   
 1            0.0                      0.0         0.0                  0.0   
 
   tax_liens  
 0       0.0  
 1       0.0  
 
 [2 rows x 52 co