# Lending Club Case Study

## 1. Introduction

The goal of this case study is to determine atleast 5 driver variables which are strong indicators of default.

---

## 2. Setup & Data Understanding

### 2a. Loading the data for analysis

In [81]:
# Import the libraries required for performing the analysis
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

In [82]:
# Read the dataset file into a Pandas Dataframe
loan_dataset = pd.read_csv('./loan.csv', low_memory=False)

In [83]:
# Sample the dataset to get a preview of different columns that exist in the dataset
loan_dataset.sample(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
5719,980395,1203578,30000,30000,30000.0,60 months,11.71%,662.95,B,B3,...,,,,,0.0,0.0,,,,
16964,723487,918478,27300,27300,27300.0,60 months,17.51%,685.99,E,E4,...,,,,,1.0,0.0,,,,
7652,878380,1093156,7200,7200,7200.0,36 months,11.49%,237.4,B,B4,...,,,,,0.0,0.0,,,,
32074,486784,620418,10000,10000,9000.0,36 months,7.51%,311.1,A,A4,...,,,,,0.0,0.0,,,,
10729,827524,1036480,4900,4900,4900.0,36 months,6.99%,151.28,A,A3,...,,,,,0.0,0.0,,,,


### 2b. Understanding the data

In [84]:
loan_dataset.shape

(39717, 111)

In [85]:
# Load the data dictionary as a Pandas Dataframe for easier reference during analysis
data_dictionary = pd.read_excel('./Data_Dictionary.xlsx', sheet_name='LoanStats')

In [86]:
# Create a method that can be called with the column names to get the column definitions
def get_column_definition(column_list):
    return data_dictionary[data_dictionary.LoanStatNew.isin(column_list)]

In [87]:
# Verify that the above method is working
get_column_definition(['loan_status'])

Unnamed: 0,LoanStatNew,Description
42,loan_status,Current status of the loan


In [88]:
loan_dataset.loan_status.unique()

array(['Fully Paid', 'Charged Off', 'Current'], dtype=object)

- The dataset has __111 columns__ and __39, 717 rows__. 
- From the data sample from step 2a, it looks like there are a lot of columns on the right which are empty. 
- Looking at the data dictionary, the target variable seems to be `loan_status`, which indicates the current status of a loan. If the value of `loan_status` for a particular loan is `Charged Off`, it indicates a default. 

---

## 3. Data Cleaning

### 3a. Drop Empty and mostly empty columns

- From step 2a, it can be seen that there are a lot of columns that are empty. Since these empty columns do not add any value to the analysis they can be dropped. 

In [89]:
# For each column, determine the percentage of values that are empty 
# and exclude the columns which don't have any empty values. 
percentage_of_missing_values_by_column = (loan_dataset.isna().sum() / loan_dataset.shape[0]) * 100
percentage_of_missing_values_by_column[percentage_of_missing_values_by_column > 50]

mths_since_last_delinq             64.662487
mths_since_last_record             92.985372
next_pymnt_d                       97.129693
mths_since_last_major_derog       100.000000
annual_inc_joint                  100.000000
dti_joint                         100.000000
verification_status_joint         100.000000
tot_coll_amt                      100.000000
tot_cur_bal                       100.000000
open_acc_6m                       100.000000
open_il_6m                        100.000000
open_il_12m                       100.000000
open_il_24m                       100.000000
mths_since_rcnt_il                100.000000
total_bal_il                      100.000000
il_util                           100.000000
open_rv_12m                       100.000000
open_rv_24m                       100.000000
max_bal_bc                        100.000000
all_util                          100.000000
total_rev_hi_lim                  100.000000
inq_fi                            100.000000
total_cu_t

- Although it can be seen that `mths_since_last_delinq` has approximately __64.6%__ of the values missing, it could be one of the driver variables for loan default. 
- Most of the other columns have more than 90% of the values missing and hence can be dropped.

In [90]:
# Dropping columns where number of missing values is greater than 90%. 
columns_to_drop = list(percentage_of_missing_values_by_column[percentage_of_missing_values_by_column > 90].index)
loan_dataset.drop(columns_to_drop, axis='columns', inplace=True)

In [91]:
loan_dataset.shape

(39717, 55)

As can be seen from the new shape, 111 - 55 = __56 columns__ (~50% of the columns) have been dropped as the number of missing values were greater than 90%. 

### 3b. Drop columns with only one value

In [92]:
unique_value_count_by_column = {column: loan_dataset[column].unique().size for column in loan_dataset.columns}
single_valued_columns = [ k for (k,v) in unique_value_count_by_column.items() if v == 1 ]
single_valued_columns

['pymnt_plan',
 'initial_list_status',
 'policy_code',
 'application_type',
 'acc_now_delinq',
 'delinq_amnt']

The following columns have only a single value and hence can be dropped from the analysis. 

In [93]:
# Drop the columns that only have a single value as they aren't useful for analysis. 
loan_dataset.drop(single_valued_columns, axis='columns', inplace=True)

In [94]:
loan_dataset.shape

(39717, 49)

- After removing the 6 columns having only one value, the total count of columns now stands at 49. 

### 3c. Fixing column data types

In [95]:
# Get the column information for dataframe
loan_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39717 entries, 0 to 39716
Data columns (total 49 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          39717 non-null  int64  
 1   member_id                   39717 non-null  int64  
 2   loan_amnt                   39717 non-null  int64  
 3   funded_amnt                 39717 non-null  int64  
 4   funded_amnt_inv             39717 non-null  float64
 5   term                        39717 non-null  object 
 6   int_rate                    39717 non-null  object 
 7   installment                 39717 non-null  float64
 8   grade                       39717 non-null  object 
 9   sub_grade                   39717 non-null  object 
 10  emp_title                   37258 non-null  object 
 11  emp_length                  38642 non-null  object 
 12  home_ownership              39717 non-null  object 
 13  annual_inc                  397

From the info output above, it can be noticed that many of the columns are of data type `object`. In order to proceed in a organized fashion, the analysis can begin with the columns having `object` datatype. 

In [96]:
# Show all the columns that are of data type object
loan_dataset.select_dtypes('object').head(1).transpose()

Unnamed: 0,0
term,36 months
int_rate,10.65%
grade,B
sub_grade,B2
emp_title,
emp_length,10+ years
home_ownership,RENT
verification_status,Verified
issue_d,Dec-11
loan_status,Fully Paid


- There are 20 columns which are of data type `object`. 
- Looking at the columns, it appears the following columns can be better analyzed with a different data type. 
    - `term`: The duration of the loan should be of integer data type.
    - `int_rate`: The loan interest rate should be of float data type. 
    - `emp_length`: Although it seems like this can be an integer, it's better left off as a string since it can be considered as an ordered categorical variable.
    - `issue_d`: Loan Issue Date can be transformed to 2 columns: `issue_month` and `issue_year`.

### 3d. Handling missing values

In [97]:
# Get the column wise percentage of missing values again. 
percentage_of_missing_values_by_column = (loan_dataset.isna().sum() / loan_dataset.shape[0]) * 100

In [101]:
# Columns that have missing values with the percentage of values that are missing
percentage_of_missing_values_by_column[percentage_of_missing_values_by_column > 0].sort_values(ascending=False)

mths_since_last_delinq        64.662487
desc                          32.580507
emp_title                      6.191303
emp_length                     2.706650
pub_rec_bankruptcies           1.754916
last_pymnt_d                   0.178765
collections_12_mths_ex_med     0.140998
chargeoff_within_12_mths       0.140998
revol_util                     0.125891
tax_liens                      0.098195
title                          0.027696
last_credit_pull_d             0.005036
dtype: float64

---

## 4. Univariate Analysis

---

## 5. Segmented Univariate Analysis

---

## 6. Bivariate Analysis

---

## 7. Conclusion