# Predicting Loan Default

We build a classifier to predict the probability of default for a given loan. We use loan data obtained by Lending Club from 2007-2017 which can be found on Kaggle.

## Packages

In [10]:
import numpy as np 
import optuna 
import plotly
import plotly.express as px
import polars as pl 
import polars.selectors as cs

## get file path of the data
from private import FILE_PATH

In [24]:
# %pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-15.0.2-cp311-cp311-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pyarrow
Successfully installed pyarrow-15.0.2
Note: you may need to restart the kernel to use updated packages.


## Data

In [2]:
## load file
loans = pl.read_csv(FILE_PATH, ignore_errors=True)

## drop those that have null id 
loans = loans.drop_nulls(subset=["id"])

In [3]:
print(loans.glimpse())

Rows: 2260668
Columns: 151
$ id                                         <i64> 68407277, 68355089, 68341763, 66310712, 68476807, 68426831, 68476668, 67275481, 68466926, 68616873
$ member_id                                  <str> None, None, None, None, None, None, None, None, None, None
$ loan_amnt                                  <f64> 3600.0, 24700.0, 20000.0, 35000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0
$ funded_amnt                                <f64> 3600.0, 24700.0, 20000.0, 35000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0
$ funded_amnt_inv                            <f64> 3600.0, 24700.0, 20000.0, 35000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0
$ term                                       <str> ' 36 months', ' 36 months', ' 60 months', ' 60 months', ' 60 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months'
$ int_rate                                   <f64> 13.99, 11.99, 10.78, 14.85, 22.45, 13.44, 9.17, 8.49, 6.49

## Cleaning and Feature Elimination

In this section we will perform significant cleaning on the data, including feature elimination of most variables. 

### Getting the Default Column 

Since we want to predict default, we have to look at the `loan_status` column. In this column, we use `Charged Off` as our proxy for default. We will encode each of these as a `1` in the new default column. Of the remaining rows, only those that are `Fully Paid` will be called `0`. Everything else will be dropped. 

In [4]:
loans_df = (
    loans.with_columns(
        (pl.col("loan_status") == "Charged Off")
        .map_elements(np.uint8).alias("default")
    )
)

# filter 
loans_df = (
    loans_df.filter(
        (pl.col("loan_status") == "Fully Paid") | 
        (pl.col("loan_status") == "Charged Off")
    )
)

# drop loan status 
loans_df = loans_df.drop("loan_status")

### Feature Elimination

Given the large number of features, we will perform significant feature elimination. We use the following methodology: 

1. Eliminate features with more that 25% missing values. 
1. Eliminate features that appear to be irrelevant to default.

In [5]:
## eliminate features with more than 25% missing
null_fractions = (loans_df.null_count() / loans_df.shape[0]) > 0.25
drop_list = [col.name for col in loans_df.iter_columns() 
             if null_fractions[0,col.name] == True]
loans_df = loans_df.drop(drop_list)

We will keep only the features which intuitvely appear useful for predicting default. After consulting the data dictionary, we keep the following features: 

In [6]:
loans_df.glimpse()

Rows: 1345310
Columns: 93
$ id                         <i64> 68407277, 68355089, 68341763, 68476807, 68426831, 68476668, 67275481, 68466926, 68616873, 68338832
$ loan_amnt                  <f64> 3600.0, 24700.0, 20000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0, 1400.0
$ funded_amnt                <f64> 3600.0, 24700.0, 20000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0, 1400.0
$ funded_amnt_inv            <f64> 3600.0, 24700.0, 20000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0, 1400.0
$ term                       <str> ' 36 months', ' 36 months', ' 60 months', ' 60 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months'
$ int_rate                   <f64> 13.99, 11.99, 10.78, 22.45, 13.44, 9.17, 8.49, 6.49, 11.48, 12.88
$ installment                <f64> 123.03, 820.28, 432.66, 289.91, 405.18, 637.58, 631.26, 306.45, 263.74, 47.1
$ grade                      <str> 'C', 'C', 'B', 'F', 'C', 'B', 'B', 'A', 'B', 'C'
$ sub_

We keep features which contain relevant credit detail of a lender, including income, credit scores, debt-to-income ratio. We also keep features which are available to investors when considering an investment in the loan, such as interest rate, loan grade, home ownership, employment. Basically, we pick features that would be commonly found on a loan application and would be submitted by the borrower. 

In [7]:
keep_list = [
 'id', 'loan_amnt', 'term', 'int_rate', 'installment',
 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership',
 'annual_inc', 'verification_status', 'purpose', 'title', 'annual_inc',
 'last_pymnt_amnt', 'num_actv_rev_tl', 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_old_rev_tl_op', 'avg_cur_bal', 'acc_open_past_24mths', 'zip_code',
 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low',
 'fico_range_high', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies',
 'initial_list_status', 'revol_bal', 'revol_util', 'total_acc', 
 'bc_open_to_buy', 'bc_util', 'default'
]

drop_list = [col.name for col in loans_df.iter_columns() 
             if col.name not in keep_list]

loans_df = loans_df.drop(drop_list)

loans_df.shape

(1345310, 37)

We have been able to reduce the number of features to 37. 

## EDA and Feature Selection

In this section, we perform EDA on the reduced dataset and use our analysis to further trim the features in our model. 

In [8]:
loans_df.glimpse()

Rows: 1345310
Columns: 37
$ id                    <i64> 68407277, 68355089, 68341763, 68476807, 68426831, 68476668, 67275481, 68466926, 68616873, 68338832
$ loan_amnt             <f64> 3600.0, 24700.0, 20000.0, 10400.0, 11950.0, 20000.0, 20000.0, 10000.0, 8000.0, 1400.0
$ term                  <str> ' 36 months', ' 36 months', ' 60 months', ' 60 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months', ' 36 months'
$ int_rate              <f64> 13.99, 11.99, 10.78, 22.45, 13.44, 9.17, 8.49, 6.49, 11.48, 12.88
$ installment           <f64> 123.03, 820.28, 432.66, 289.91, 405.18, 637.58, 631.26, 306.45, 263.74, 47.1
$ grade                 <str> 'C', 'C', 'B', 'F', 'C', 'B', 'B', 'A', 'B', 'C'
$ sub_grade             <str> 'C4', 'C1', 'B4', 'F1', 'C3', 'B2', 'B1', 'A2', 'B5', 'C2'
$ emp_title             <str> 'leadman', 'Engineer', 'truck driver', 'Contract Specialist', 'Veterinary Tecnician', 'Vice President of Recruiting Operations', 'road driver', 'SERVICE MANAGE

### Categorical Variables 

We first look at the categorical variables: `grade`, `subgrade`, `emp_title`, `emp_length`, `home_ownership`, `verification_status`, `purpose`, `title`, `zip_code`, `addr_state`, `earliest_cr_line`, `initial_list_status`. 

In [13]:
cat_vars = loans_df.select(~cs.by_dtype(pl.NUMERIC_DTYPES)).columns

## note to_pandas requires pyarrow
loans_df[cat_vars].to_pandas().describe(include='all')

Unnamed: 0,term,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,purpose,title,zip_code,addr_state,earliest_cr_line,initial_list_status
count,1345310,1345310,1345310,1259525,1266799,1345310,1345310,1345310,1328651,1345309,1345310,1345310,1345310
unique,2,7,35,378353,11,6,3,14,61682,943,51,739,2
top,36 months,B,C1,Teacher,10+ years,MORTGAGE,Source Verified,debt_consolidation,Debt consolidation,945xx,CA,Aug-2001,w
freq,1020743,392741,85494,21268,442199,665579,521273,780321,660960,15005,196528,9391,784010


As we can see, `emp_title`, `title`, `zip_code`, and `earliest_cr_line` all have many unique values (> 100). While these features may be useful, they are too granular for us to consider. We will therefore drop these features from our dataset. 

For further work, a sentiment analysis of the `emp_title` and `title` could be useful. Also, since geography is a useful metric, we will instead use `addr_state` as a more general feature of this. 

In [14]:
loans_df = loans_df.drop(["emp_title", "title", "zip_code", "earliest_cr_line"])