# Capstone Project 1 Milestone Report

## Background

Lending Club is an online loan platform that allows individuals to take out personal loans of up to $40,000. Borrowers can apply for a loan online and will typically receive their money within a few days of submitting their application. Unlike a bank, the platform uses investors to fund loans and acts as the intermediary between investors and borrowers.

Occassionally a borrower does not pay back a loan in full and Lending Club must "Charge Off" the loan. This typically happens once a loan payment is at least 150 days past due, but can also occur earlier or later depending on the circumstances (i.e. a borrower files for bankruptcy).

## Problem

How much money does Lending Club lose to charged off loans? Is it possible to help Lending Club predict the risk of a specific borrower failing to pay off their loan? Can a model be built to minimize the risk to Lending Club investors and decrease the amount of money lost each year? 

## The Data

To answer this question I will be using Lending Club's __[dataset](https://www.lendingclub.com/info/download-data.action)__ that contains all loan information from 2007-2011. 

The original dataset includes over 140 features. I decided to start by reducing the data to 17 features of interest:
* **Funded Amount:** The amount loaned to the borrower
* **Term:** The length of the loan (either 36 months or 60 months)
* **Interest Rate:** Interest rate on the loan
* **Installment:** Loan payments
* **Grade:** LC assigned loan grade
* **Sub Grade:** LC assigned loan sub-grade 
* **Employment Title:** The job title supplied by the borrower when applying for a loan
* **Employment Length:** Borrowers length of employment 
* **Home Ownership:** Home ownership status provided by borrower: RENT, OWN, MORTGAGE, OTHER
* **Annual Income:** Annual income provided by borrower
* **Verification Status:** Indicates if income was verified by LC
* **Issue Date:** Month and year the loan was issued
* **Loan Status:** Lists whether a loan is CURRENT or CHARGED OFF
* **Purpose of Loan:** Purpose of loan provided by borrower
* **Title of Loan:** Loan title provided by borrower
* **State of Borrower:** State of residence provided by borrower
* **dti:** Debt to income ratio calculated using borrower's total monthly debt payments on the total debt obligations, divided by borrower's self-reported annual income 

### Cleaning the Dataset
#### Missing Data

I found several columns with missing data and handled them as follows: 
* **Employment Title:** Missing 2624 entries. Replaced all missing information with 'Unknown'. I also had several titles with less than 20 counts, so I renamed those to 'Other'. 
* **Annual Income:** Four entries were missing income data so I replaced those with the mean annual income of \$69,136.56. 
* **Title:** Title was missing 12 values, so I replaced those with 'Unknown'

#### Date Issued

In case I wanted to look deeper at the month and year that a loan was issued, I decided to create two additional columns: 
* **Month Issued:** the month a loan was issued
* **Year Issued:** the year a loan was issued

#### Loan Status

Since loan status is what I will be using as my independent variable throughout this project, I decided to turn it into a binomial variable as follows: 
* Fully Paid: 0
* Charged Off: 1

Note that Fully Paid means that the loan is currently up to date with all payments and is in good standing. It does not necessarily mean that the loan has been repaid in full.

## Exploratory Data Analysis


From 2007-2011, Lending Club issued over \$460 million dollars in loans. Of the 42,535 loans issued during that time, 15.1 percent of them were charged off. These loans totaled over $73.9 million dollars. While this amount does not take into account how much a borrower repaid before the loan was charged off or how much money Lending Club investors will lose in interest that would have been paid on the loan, it's safe to say it is still a lot of money Lending Club investors are losing!

<img src="visuals/graph1.png">

### Trends over time

Next I looked at the number of loans issued over time and compared it the number of loans that were charged off over time. 

<img src="visuals/graph2.png">

<img src="visuals/graph3.png">

Looking at the graphs, it appears the number of loans that are charged off has remained proportionally consistent over time. Additional statistical analysis will allow me to see if there is a more significant relationship here.

### Statistical Analysis

To better understand whether or not there is a significant relationship between a feature and whether or not a borrower's loan will be charged off, I performed t-tests and logistic regressions on each feature of interest. 

Before I began with the statistical analysis, I realized I needed to further narrow down which features to examine. When Lending Club receives a loan application, it gives that loan a grade (A-G). This grade is based on the applicant's loan information and credit score and is used to determine the interest rate on the loan. Since information like grade, interest rate, and installment are assigned to a borrower based on their potential risk as a borrower, I decided to only focus on the information provided by the borrower and use that to see if I can predict whether or not their loan will be charged off. This left the following features: 

* Funded Amount
* Employment Length
* Home Ownership
* Annual Income 
* Date Loan was Issued
* Purpose of Loan
* State of Residence
* Debt to Income Ratio 

My statistical analysis revealed that all 8 features are significantly correlated to the status of a loan and whether or not it is charged off. I will be further examining them as I move into the machine learning portion of my project.

A detailed breakdown of the stastical findings can be found in the __[project code](https://github.com/ameenamarie/Springboard-Data-Science-Career-Track/blob/master/Capstone%20Project%201/Capstone%20Project%201%20Milestone%20Code.ipynb)__. 

