# Lending Club Case Study

## Problem Context

#### The consumer finance company serves as the largest online loan platform, offering personal, business, and medical financing with easily accessible lower interest rates. Like other lenders, issuing loans to risky applicants is the primary cause of financial loss, termed credit loss, where 'charged-off' borrowers who default cause the most significant loss to lenders. This credit loss results from borrowers refusing to pay or disappearing with owed money, marking them as defaulters.


## Objective and Goal

#### To mitigate financial loss by utilizing Exploratory Data Analysis techniques to identify key factors driving loan defaults. Exploring how data analysis can reduce risky loan applicants.

## Business Understanding

#### The bank's decision involves two types of risks:

1. Denying the loan to an applicant likely to repay leads to lost business for the company.
2. Approving the loan for an applicant unlikely to repay, essentially a potential defaulter, may result in financial loss for the company.

## 1. Importing Libraries and Dataset

In [1]:
#importing libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [2]:
#reading data
loan_data = pd.read_csv("loan.csv")

## 2. Information about the Dataset

In [3]:
#previewing first 5 rows of the data
loan_data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,,,,,0.0,0.0,,,,
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,,,,,0.0,0.0,,,,
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,,,,,0.0,0.0,,,,
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,,,,,0.0,0.0,,,,
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,,,,,0.0,0.0,,,,


In [None]:
#descriptive statistics
loan_data.describe()

In [None]:
#data type of each column
loan_data.dtypes

In [None]:
#previewing column names
loan_data.columns

In [None]:
print("Initial data frame size : ", loan_data.shape)

## 3. Cleaning Dataset

### a. removing null value columns(axis 1)

In [None]:
loan_data.dropna(axis = 1, how = 'all', inplace = True)
loan_data.head()

In [None]:
print("Data frame size after removing null value columns : ", loan_data.shape)

### b. dropping unique value columns


In [None]:
#checking unique and null value columns which will have no impact to analysis
loan_data.nunique().sort_values()

In [None]:
#dropping unique and null value columns
loan_data = loan_data.loc[:,loan_data.nunique()>1]

In [None]:
print("Data frame size after removing unique value columns : ", loan_data.shape)

### c. dropping irrelevant columns

#### i. Eliminating columns that are computed after loan approval and therefore hold no relevance to the analysis. Listed below are the features identified for post loan approval that are still existing in the dataframe after a & b cleaning steps:
##### 'total_rec_late_fee', 'collection_recovery_fee', 'total_acc','earliest_cr_line','total_pymnt','revol_bal','open_acc','total_pymnt_inv','out_prncp','revol_util','inq_last_6mths','total_rec_int','pub_rec','last_pymnt_amnt','total_rec_prncp','delinq_2yrs','out_prncp_inv','last_pymnt_d','last_credit_pull_d','recoveries','mths_since_last_delinq', 'last_pymnt_d '

In [None]:
loan_data=loan_data.drop(['total_rec_late_fee','collection_recovery_fee','total_acc','earliest_cr_line','total_pymnt','revol_bal','open_acc','total_pymnt_inv','out_prncp','revol_util','inq_last_6mths','total_rec_int','pub_rec','last_pymnt_amnt','total_rec_prncp','delinq_2yrs','out_prncp_inv',
 'recoveries','mths_since_last_delinq','last_pymnt_d'],axis=1)

In [None]:
print("Data frame size after dropping post loan approval columns : ", loan_data.shape)

#### ii. These features or columns don't contribute to the occurrence of loan defaults due to their irrelevant information, hence they will be removed. they are:
#### 'emp_title', 'title', 'url', 'last_credit_pull_d', 'zip_code', 'addr_state', 'member_id', 'id', 'desc'.

In [None]:
loan_data = loan_data.drop(['emp_title', 'title', 'url', 'last_credit_pull_d', 'zip_code', 'addr_state', 'member_id', 'id', 'desc'], axis=1)

In [None]:
print("Data frame size after dropping irrelevant columns : ", loan_data.shape)

In [None]:
#displaying column names after data cleaning
loan_data.columns

### d. filtering null values

#### i. dropping all columns which have more than 70% null values

In [None]:
loan_data = loan_data.loc[:,loan_data.isnull().sum()/loan_data.shape[0]*100<70]

In [None]:
print("Data frame size after dropping columns which has 70%+ null values: ", loan_data.shape)

In [None]:
#previewing column names
loan_data.columns

#### ii. handling columns which have null values < 70%

In [None]:
# checking null/missing values in the dataframe
loan_data.isnull().sum()

#### Based on the observed null values in column pub_rec_bankruptcies  and emp_length, we can opt to either remove or rectify them, considering the column's relevance to the analysis objective.

In [None]:
#analysis on pub_rec_bankruptcies
loan_data.pub_rec_bankruptcies.value_counts()

In [None]:
#replacing the null values with zeros would be a better approach for this column
loan_data.pub_rec_bankruptcies.fillna(0, inplace=True)

In [None]:
#analysis on emp_length
loan_data.emp_length.value_counts()

In [None]:
#as per the data dictionary we can replace '10+ years' with '10', '< 1 year' with '0' and 'x years' with 'x'

In [None]:
#removing null values in emp_length columns
loan_data = loan_data.dropna(subset=['emp_length'])

In [None]:
#convert the column to string for easy processing
loan_data['emp_length'] = loan_data['emp_length'].astype(str)

In [None]:
#replace the 10+ and <1 values
loan_data.replace(to_replace='10+ years', value="10", inplace=True)
loan_data.replace(to_replace="< 1 year", value="0", inplace=True)

In [None]:
#replace x years with x
loan_data.emp_length = pd.to_numeric(loan_data.emp_length.apply(lambda x:x.split()[0]))

In [None]:
loan_data.emp_length.value_counts()

### e. eliminating duplicate rows in the dataframe

In [None]:
loan_data = loan_data.drop_duplicates()

In [None]:
print("Data frame size after data cleaning: ", loan_data.shape)

## 4. Treating Outliers

### i. Examining outlier for continuous variables using box plots and eliminate if needed

#### identified continuous columns for analysis: dti, int_rate, annual_inc, loan_amnt

In [None]:
#analysis on column dti
px.box(loan_data,x='dti',title='Dispersion of the Debt-to-Income Ratio',labels={'dti':'DTI ratio'}).show()

### No outliers are present in the dti, so we can proceed with the analysis.

In [None]:
## starting with int_rate
## using plotly for interactive interaction and value retrival from chart for upper fence.
px.box(loan_data,x='int_rate',title='Distribution of Interest Rate',labels={'int_rate':'Interest Rate'}).show()

### No outliers are present in the int_rate, so we can proceed with the analysis..

In [None]:
#analysis on column annual_inc
px.box(loan_data,x='annual_inc',title='Dispersion of the Borrower\'s Annual Income',labels={'annual_inc':'Annual Income'}).show()

### The upper fence is calculated to be 145.9k, while the maximum is 6000k, significantly exceeding the upper fence. Therefore, we will eliminate the outliers in the annual_inc column.

In [None]:
#examining the pattern of values in annual_inc through a line chart to determine the suitable quantile for removing outliers.
px.line(sorted(loan_data.annual_inc),title='Pattern in Annual Income',labels={'value':'Annual Income','index':'Data Points'}).show()

In [None]:
loan_data.annual_inc.quantile([0.25, 0.5, 0.75,0.90, 0.95, 0.96, 0.97,0.98, 0.99])

### As evident from the line chart and quantile distribution that the annual_inc demonstrates an exponential increase around the 99th percentile. Therefore, it is advisable to exclude values surpassing the 99th percentile.

In [None]:
##eliminating outliers in annual_inc that exceed the 99th percentile
loan_data = loan_data[loan_data.annual_inc<=np.percentile(loan_data.annual_inc,99)]

In [None]:
#analysis on column loan_amnt
px.box(loan_data,x='loan_amnt',title='Distribution of Loan Amount',labels={'loan_amnt':'Loan Amount'}).show()

### The upper fence is determined to be 29.175k, while the maximum is 35k, indicating a marginal difference from the upper fence. As a result, it is expected to have minimal impact on the analysis.

In [None]:
print("Data frame size after removing outliers: ", loan_data.shape)

## 5. Rectifying Data Types and Creating New Columns