# Predicting Loan Default

We build a classifier to predict the probability of default for a given loan. We use loan data obtained by Lending Club from 2007-2017, which can be found on Kaggle.

The dataset contains 150 explanatory variables for which we will use feature selection.

## Packages

In [1]:
import polars as pl 
import numpy as np 
import optuna 
import plotly
import plotly.express as px

## get file path of the data
from private import FILE_PATH

## Data

In [2]:
## load file
loans = pl.read_csv(FILE_PATH, ignore_errors=True)

## drop those that have null id 
loans = loans.drop_nulls(subset=["id"])

## Cleaning and Feature Elimination

Giving the large number of features, we will perform signifcant feature elimination. We use the following methodology: 

1. Eliminate features with more that 25% missing values. 
2. Eliminate features with low correlation with the predicted variable. 

In [4]:
## define the default category we will predict
default_categories = ['Default', 
                      'Charged Off', 
                      'Does not meet the credit policy. Status:Charged Off']
loans = loans.with_columns(
    (pl.col("loan_status")
     .is_in(default_categories)
     .map_elements(np.uint8)
     .alias("default"))
)
loans = loans.drop("loan_status")