# Credit Default Prediction

The purpose of this project is to identify if individuals will default or re-pay loans back based on many different factors. Many individial with little or no credit history are either rejected when they apply for a loan or they are being taken advantage of by re-paying the loan with extremely high interest rates. I will use data such as credit card balance, age, job title, annual pay, education to examine if clients are capable of repaying their loans based on those factors.

## 1. Data

Home Credit is an international consumer finance provider with operations in 9 countries.  The data from Home Credit Default Risk Kaggle competition will be used for this capstone. The Kaggle competition for Home Credit default risk can be found [here](https://www.kaggle.com/c/home-credit-default-risk/data).   There following 7 tables were available from Home Credit.  Below is a general overview of each table:

- application_{train|test}.csv
    - This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
Static data for all applications. One row represents one loan in our data sample.

- bureau.csv
    - All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).

- bureau_balance.csv
    - Monthly balances of previous credits in Credit Bureau.

- POS_CASH_balance.csv
    - Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.

- credit_card_balance.csv
    - Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.

- previous_application.csv
    - All previous applications for Home Credit loans of clients who have loans in our sample.

- installments_payments.csv
    - Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
    
For this project I started by cleaning and exploring the data in the application_train.csv file since it contains so much data about each applicant and I wanted to explore the difference between using simple features such as the age of the applicant, total credit loaned, total credit payed off, if the applicant is a home_owner, education level.  Initially I used only the data in the application_train.csv and then I added data from bureau.csv and other engineered data. 

## 2. Data Cleaning

Upon loading the data the application dataframe has 307511 entries/applications and 122 columns (features). And the burea dataframe has 1,716,427 entries and 17 columns.
I started by examining the quality of the data by using .describe(), and checking the percentage of missing values/column. 

Even though some features are not required to apply and get approved for a loan (such as if the house you live in brick or not), I am curious whether those features will improve or worsen our predictions. I will prepare be using the dataset (using the same models) in 3 different ways:
1. the application dataframe only, without removing any columns and imputing the missing values.
2. drop the features (columns) that are not important or required.
3. add new features from the burea dataframe and examine if engineered features will help different models in providing more accurate results. 

Below is a summary of the main steps I took to clean the main issues I found:
- **Number of years columns:** All the columns with data regarding number of years were in a negative value in days - for instance the age column for an applicant who is 20 years old is available as -7300.
 - Divided the number of days by 365 
 - Took the absoloute value of each result from the previous step
- **Missing values:** There are 122 columns in total in the application dataframe, 49 columns contained beween 48% and 69% null values. I decided to ask an expert in the field which of these columns are actually requirded while evaluating if an individual will be granted a loan or not.  It was concluded that the 49 columns are in fact not needed.
 - 49 columns were droped from the data.
 - The mean or median was be imputed for columns with 45% or less missing values. 
- **Outliers:** Some values were definetly wrong (such as YEARS_EMPLOYED=1000.7 years)
    - There were 55374 observations where the years employed is 1000.7 years.
    - Extremely high values were replaced by the median - such an applicant having 19 children
    - YEARS_EMPLOYED contains a value of 1000.7 years - this row was deleted
- **Reduce number of columns:** There are 20 columns that flag whether a documnt has been submitted or now (1 for submitted and 0 for not).  The sum of all of those columns will be added with a new column (TOTAL_DOCUMENTS).
    - Deleted 20 columns FLAG_DOCUMENT_2 to FLAG_DOCUMENT_21
- **Categorial Variables:** 
    - 31.5% of OCCUPATION_TYPE were missing values, imputed the most occuring category ('Laborers').
    - 4 rows contained 'XNA' as Gender type, those rows were deleted.
    - 0.42% of NAME_TYPE_SUITE were missing values, imputed the mode.
    
- After discussing my project with a domain expert in finance, it was concluded that alot of the features are neither needed not required by financial insustutions to apply for a loan.  I was still curious to see if those features would affect the prediction model positively or negatively. So I decided to use 2 different dataframes for this project:
    - The first data frame will include all 122 features and I imputed all the missing values using simple imputer. 
    - The second data frame will have only those features that are in fact required when applying for a loan - which ended up being 53 columns. 

## 3. Exploratory Data Analysis

### a) Distribution of paid vs. unpaid loans

The first step of the EDA for this project is to take a closer look at the distribution of paid and unpaid loans.  In Figure 1 we can see that 91.9% of the loans are paid, while only 8.1% is unpaid. Going forward we'll take a closer look at features that may influence unpaid loans. 

#### Summary of Findings:
- 91.9% of Loans are paid back while 8.1% of the applicants fail to pay their loans back. (Figure 1)
- Based on our Target (0 = Paid and 1 = Unpaid Loan), there is an imbalance with the data.
- Females represent 65.8% of applicants and 34.2% are males.
    - 4.6% of females fail to pay the loan back (7.5% of the total female population default on the loan)
    - 3.5% of males fail to pay the loan back (11.4% of the total male population default on the loan)
- 70% of applicants don't have any children and 19.9% have 1 child, 8.7% have 2 children
    - applicants who don't have childred represent 66.6% of unpaid loans (5.4% of the total population - paid and unpaid)
- 90.6% of loans are cash loans, while 9.4% are revolving loans
    - 7.6% of unpaid loans are cash loans while 0.5% are revolving loans
- 49.3% of applicants have an occupation type of Laborers, 10.4% are 'sales staff' and 9.0% are 'core staff'
- 71.1% of applicants hold a seconday education and 24.3% have higher education.
    - secondary education holders represent 6.3% of unpaid loans (highest) but they also hold the highest number of total applications. 
- 69.4% of applicants are realty owners while 66% of applicants don't own a car. 
- There is a strong correlation between the amount of credit an applicant applies for and the price of good they are buying using the loan. 
- The three variable with the strongest negative correlation with TARGET are EXT_SOURCE1, 2 and 3 and TARGET - meaning as the EXT_SOURCE increases the unpaid loans decrease. According to the documentation EXT_SOURCE represent a "normalized score from external data source". 
    - Taking the mean of the 3 EXT_SOURCE variables shows a greater negative correlation of 0.22.
- Age has a big impact loan repayment, younger applicants are more likely to fail in paying the loans back. As the applicant's age increases, the percentage of non-payment significantly decreases.

 ![Distribution paid and unpaid loans](Charts/PaidUnpaid1.png)

![Loan Distribution by Gender](Charts/PaidUnpaidByGender.png)

![Distribution_EXT_SOURCES](Charts/EXT_KDE2.png)

 ![Failure to pay age groups](Charts/age.png)

## 4. Pre-Processing

For data preprocessing a few steps were taken:
 - Created a new dataframe to hold all columns with data type "OBJECT"
 - Created a new dataframe to hold all numerical values (inorder to scale them)
 - Created a new dataframe to hold all flags (flags are numerical values but they don't need to be scaled)
 - Used SimpleImputer to impute any NaN values that I might have missed during the cleaning process 
 - Use MinMax scaler to scale the data - there were too many values in different scales (such as income, age, number of children)
 - Used get_dummies to transform categorial variables

## 5. Modeling

I will be using 2 different datafames for the prediction of our model and I will test 3 different algorithms to determine whether the algorithm provides better predictions with more features (122 features) or with less features (53 features):
    1. Logistic regression
    2. Gradient Boosting
    3. Random Forest
    

### 5.1 Linear Regression

#### 5.1.1 Logistic Regression with 122 features

The accuracy of the logistic regression model on the test set was 49%, which is pretty low.  When I looked further to analyze the findings, it seems like the model didn't predict any Target=1 values.  Even though the data is balanced, I am not sure why none of the unpaid loans were predicted. 

![logistic regression report](Charts/logReg1.png)

![logistic regression ROC](Charts/logReg1-ROC.png)

#### 5.1.2 Logistic Regression with 53 features

The results with reduced columns is exactly the same as the results found above with the complete dataset. 

![logistic regression report](Charts/logReg2.png)

![logistic regression ROC](Charts/logReg2-ROC.png)

### 5.2 Gradient Boosting

#### 5.2.1 Gradient boosting with 122 features

The accuracy score for gradient boosting is 61.4% - which is a much better performance than the logistic regression model, however the results are still not accurate enough. 

Accuracy score (training): 0.623

Accuracy score (validation): 0.614

Accuracy: 0.614

![Gradient boosting ROC](Charts/GB-ROC2.png)

#### 5.2.2 Gradient boosting with 53 features

The accuracy score for gradient boosting is 62.3% - which is a much better performance than the logistic regression model, however the results are still not accurate enough.  

Accuracy score (training): 0.629

Accuracy score (validation): 0.623

Accuracy: 0.623

![Gradient boosting ROC](Charts/GB-ROC1.png)

The performance for the GB model is slightly better with less features - accuracy score increase from 61.4% to 62.3%

### 5.3 Random Forest

#### 5.3.1 Random Forest with 122 features

So far this is the best performing model with 68.2% accuracy score and ROC curve area is 0.742.

Random Forest: Accuracy=0.682

Random Forest: f1-score=0.682

![Randome Forest ROC](Charts/RF_ROC1.png)

#### 5.3.2 Random Forest with 53 Features

The accuracy with the reduced columns is exactly the same as the accurracy with the 122 features dataset.  

Random Forest: Accuracy=0.682

Random Forest: f1-score=0.682

![Random Forest ROC](Charts/RF_ROC2.png)

### 6.0 Variable Importance

Both models for random forest were the best performing models.  However, both models had EXT_Source3 and EXT_Source2 to be the variable with the highest importance.  

![Variable Importance](Charts/RF_Var_importance.png)

### 7.0 Future Work

- Create a user interface where users are able to enter their information (age, income, total owing credit, number of active loans, years of employment) and get a prediction of the likelihood of defaulting on a loan.
- Filter the datasets and remove EXT_SOURCE and check performance.
- Use the dataset with added engineered features and check whether the accuracy increases or decreases. 
