### Package Imports and Options

In [8]:
from os import getcwd
import glob
from pprint import pprint

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

In [12]:
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', -1)

### Load Data

In [3]:
dic = pd.read_excel('data/LCDataDictionary.xlsx')

In [5]:
csv_files = glob.glob(f'{getcwd()}/data/*.csv')
df = pd.concat((pd.read_csv(f, header=1, low_memory=False) for f in csv_files))

### Analysis

The goal of this analysis is to create a trivial model to predict default or deliquency rates for borrowers in the Lending Club dataset. This model will serve as a starting point for more robust analysis. 

The Lending Club dataset provides a field called `loan_status` which reports whether loans are fully paid, current, delinquent or in default, etc. This is our variable of interest.

In [6]:
loan_status = df.loan_status

In [7]:
loan_status.value_counts()

Fully Paid                                             928104
Current                                                798972
Charged Off                                            233602
Late (31-120 days)                                      23027
In Grace Period                                         11950
Late (16-30 days)                                        5636
Does not meet the credit policy. Status:Fully Paid       1988
Does not meet the credit policy. Status:Charged Off       761
Default                                                    22
Name: loan_status, dtype: int64

To start, I will just use a few of the features that seem relevant based on my limited domain knowledge:

In [19]:
features = ['annual_inc', 'avg_cur_bal', 'int_rate', 'emp_length',
            'funded_amnt', 'grade', 'home_ownership', 'loan_amnt', 
            'term', 'installment', 'sub_grade', 'verification_status', 
            'purpose', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 
            'pub_rec', 'revol_bal', 'revol_util', 'total_acc']

In [20]:
dic[dic['LoanStatNew'].isin(features)]

Unnamed: 0,LoanStatNew,Description
4,annual_inc,The self-reported annual income provided by the borrower during registration.
7,avg_cur_bal,Average current balance of all accounts
13,delinq_2yrs,The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
16,dti,"A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income."
19,emp_length,Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
23,funded_amnt,The total amount committed to that loan at that point in time.
25,grade,LC assigned loan grade
26,home_ownership,"The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER"
32,inq_last_6mths,The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
33,installment,The monthly payment owed by the borrower if the loan originates.
