# Loan default prediction

The dataset comes from [this kaggle competition](https://www.kaggle.com/c/credit-default-prediction-ai-big-data/overview)

Our approach:
* Standard imports
* Read in data and look at summary statistics
* EDA (distributions, relationships)
* Model selection, model building (may try a few)
* Answer the 'so what'

## Data cleaning notebook - to be converted to .py script later

**TODO:**
* To drop:
    * accounts with loan balance of 100,000,000
    * 'Leaky' variables: drop credit score, ID
* Deal with missing values
    * Income - median
    * Years in current job - median
    * Months since last delinquency - might be able to assume these are people without previous delinquencies, which means we may be able to put in a very high value (e.g. 999) and create a related dummy variable
    * Bankruptcies - if null then assume no bankruptcies
    * (credit score missing values don't matter because we will drop this anyway)
* Extract number from years in current job
* Make dummy variables out of categorical vairables
    * 'purpose' seems like a free text field - may make a dummy variable for is/isn't debt consolidation
    * Home ownership - 'have mortgage' can grouped with home mortgage
    * Term - 'short/long term' flag
* Deal with imbalanced dataset

In [5]:
# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline

### SHOULD REALLY ADD SOME UNIT TESTS HERE

# drop accounts with large loan balance
def drop_large_loans(df):
    """
    Drop rows with loan balance of 99,999,999
    """
    return df[df["Current Loan Amount"] != 99999999].copy()

# drop credit score
def drop_cols(df, drop_cols=["Id", "Credit Score"]):
    """
    Drop the columns specified; default is ID and credit score but these
    can be specified manually if required
    """
    return df.drop(drop_cols, axis = 1)

# Years in current job -> numeric
def convert_job_year_col(df):
    """
    Convert years in current job column to numeric by extract numbers
    """
    # first need to manually map <1 year to zero so extracting numbers doesn't equate them
    df.loc[df["Years in current job"].str.strip() == "< 1 year", "Years in current job"] = 0

    # extract number from string
    df["Years in current job"] = (
        df["Years in current job"]
        .str.extract("(\d+)", expand=False)
        .astype("float64", errors="ignore") # match one or more digits
    )
    
    return df

# deal with months since last delinquency


# impute zero for records without bankruptcy


# impute median for relevant columns (income, years in current job)
def impute_median(df, cols_to_fill=["Annual Income", "Years in current job"]):
    """
    Fills the specified columns with the median of the column
    """
    # TO ADD: unit to test to check year in current job in numeric
    # this checks that previous function has been applied
    assert df["Years in current job"].dtype == "float64"
    
    for col in cols_to_fill:
        df[col].fillna((df[col].median()), inplace=True)
    
    # check columns have no nulls
    for col in cols_to_fill:
        assert df[col].isnull().sum() == 0, f"Still null values in column: {col}!"
    
    return df


# Make flag out of "purpose" column


# Categorical variables (home ownership, term)


# Imbalanced target class



In [2]:
# read in data
df = pd.read_csv("data/train.csv")

# pipe cleaning functions together
df_clean = (
    df
    .pipe(drop_large_loans)
    .pipe(drop_cols)
    .pipe(convert_job_year_col)
    .pipe(impute_median)
)

In [3]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6630 entries, 1 to 7499
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                6630 non-null   object 
 1   Annual Income                 6630 non-null   float64
 2   Years in current job          6630 non-null   float64
 3   Tax Liens                     6630 non-null   float64
 4   Number of Open Accounts       6630 non-null   float64
 5   Years of Credit History       6630 non-null   float64
 6   Maximum Open Credit           6630 non-null   float64
 7   Number of Credit Problems     6630 non-null   float64
 8   Months since last delinquent  3048 non-null   float64
 9   Bankruptcies                  6619 non-null   float64
 10  Purpose                       6630 non-null   object 
 11  Term                          6630 non-null   object 
 12  Current Loan Amount           6630 non-null   float64
 13  Cur

In [4]:
df = drop_large_loans(df)
df = drop_cols(df)
df = convert_job_year_col(df)
df = impute_median(df)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6630 entries, 1 to 7499
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                6630 non-null   object 
 1   Annual Income                 6630 non-null   float64
 2   Years in current job          6630 non-null   float64
 3   Tax Liens                     6630 non-null   float64
 4   Number of Open Accounts       6630 non-null   float64
 5   Years of Credit History       6630 non-null   float64
 6   Maximum Open Credit           6630 non-null   float64
 7   Number of Credit Problems     6630 non-null   float64
 8   Months since last delinquent  3048 non-null   float64
 9   Bankruptcies                  6619 non-null   float64
 10  Purpose                       6630 non-null   object 
 11  Term                          6630 non-null   object 
 12  Current Loan Amount           6630 non-null   float64
 13  Cur