# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


- The past payments (x6-x11) can be used as an indicator of risky behavior. People with higher than 3 can be considered to be defaulting on their 

In [10]:
# import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as polt
%matplotlib inline



## 1. Data Cleaning

In [42]:
df = pd.read_csv('training_data.csv')
pd.set_option("display.max_columns", 100)

In [43]:
df.head()


Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [44]:
df.Y.value_counts()

0                             17471
1                              5028
default payment next month        1
Name: Y, dtype: int64

In [45]:
# Split data to be used in the models
# Create matrix of features
X = df.drop('Y', axis = 1) # grabs everything else but 'Survived'


# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [46]:
df['X3'].value_counts()

2            10516
1             7919
3             3713
5              208
4               90
6               42
0               11
EDUCATION        1
Name: X3, dtype: int64

## 2. EDA

In [47]:
df.columns.to_list()

['Unnamed: 0',
 'X1',
 'X2',
 'X3',
 'X4',
 'X5',
 'X6',
 'X7',
 'X8',
 'X9',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'Y']

In [48]:
df.rename(columns={"X1":"credit_given",
                   "X2":"gender",
                   "X3":"education",
                   "X4":"marital_status",
                   "X5":"age",
                   "X6":"past_pay_sept",
                   "X7":"past_pay_aug",
                   "X8":"past_pay_july",
                   "X9":"past_pay_june",
                   "X10":"past_pay_may",
                   "X11":"past_pay_april",
                   "X12":"due_pay_sept",
                   "X13":"due_pay_aug",
                   "X14":"due_pay_july",
                   "X15":"due_pay_june",
                   "X16": "due_pay_may",
                   "X17":"due_pay_april",
                   "X18":"amount_paid_sept",
                   "X19":"amount_paid_aug",
                   "X20":"amount_paid_july",
                   "X21":"amount_paid_june",
                   "X22":"amount_paid_may",
                   "X23":"amount_paid_april",
                  }, inplace = True)

In [49]:
df.drop(columns=["Unnamed: 0"], inplace = True)
df.head()

Unnamed: 0,credit_given,gender,education,marital_status,age,past_pay_sept,past_pay_aug,past_pay_july,past_pay_june,past_pay_may,past_pay_april,due_pay_sept,due_pay_aug,due_pay_july,due_pay_june,due_pay_may,due_pay_april,amount_paid_sept,amount_paid_aug,amount_paid_july,amount_paid_june,amount_paid_may,amount_paid_april,Y
0,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [50]:
df.describe()

Unnamed: 0,credit_given,gender,education,marital_status,age,past_pay_sept,past_pay_aug,past_pay_july,past_pay_june,past_pay_may,past_pay_april,due_pay_sept,due_pay_aug,due_pay_july,due_pay_june,due_pay_may,due_pay_april,amount_paid_sept,amount_paid_aug,amount_paid_july,amount_paid_june,amount_paid_may,amount_paid_april,Y
count,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500,22500
unique,81,3,8,5,56,12,12,12,12,11,11,17665,17341,17116,16768,16326,16093,6631,6569,6258,5757,5750,5789,3
top,50000,2,2,2,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
freq,2534,13572,10516,12026,1243,11057,11804,11823,12330,12706,12233,1492,1849,2129,2390,2594,3000,3905,4036,4440,4840,5015,5418,17471


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22500 entries, 0 to 22499
Data columns (total 24 columns):
credit_given         22500 non-null object
gender               22500 non-null object
education            22500 non-null object
marital_status       22500 non-null object
age                  22500 non-null object
past_pay_sept        22500 non-null object
past_pay_aug         22500 non-null object
past_pay_july        22500 non-null object
past_pay_june        22500 non-null object
past_pay_may         22500 non-null object
past_pay_april       22500 non-null object
due_pay_sept         22500 non-null object
due_pay_aug          22500 non-null object
due_pay_july         22500 non-null object
due_pay_june         22500 non-null object
due_pay_may          22500 non-null object
due_pay_april        22500 non-null object
amount_paid_sept     22500 non-null object
amount_paid_aug      22500 non-null object
amount_paid_july     22500 non-null object
amount_paid_june     22500 non-

In [52]:
df.shape

(22500, 24)

In [62]:
df.astype(dtype=float)

Unnamed: 0,credit_given,gender,education,marital_status,age,past_pay_sept,past_pay_aug,past_pay_july,past_pay_june,past_pay_may,past_pay_april,due_pay_sept,due_pay_aug,due_pay_july,due_pay_june,due_pay_may,due_pay_april,amount_paid_sept,amount_paid_aug,amount_paid_july,amount_paid_june,amount_paid_may,amount_paid_april,Y
0,220000.0,2.0,1.0,2.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,222598.0,222168.0,217900.0,221193.0,181859.0,184605.0,10000.0,8018.0,10121.0,6006.0,10987.0,143779.0,1.0
1,200000.0,2.0,3.0,2.0,29.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,0.0
2,180000.0,2.0,1.0,2.0,27.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,80000.0,1.0,2.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,51372.0,51872.0,47593.0,43882.0,42256.0,42527.0,1853.0,1700.0,1522.0,1548.0,1488.0,1500.0,0.0
4,10000.0,1.0,2.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,8257.0,7995.0,4878.0,5444.0,2639.0,2697.0,2000.0,1100.0,600.0,300.0,300.0,1000.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22495,40000.0,2.0,2.0,1.0,38.0,0.0,0.0,3.0,2.0,2.0,2.0,35183.0,39197.0,39477.0,39924.0,39004.0,41462.0,4600.0,1200.0,1400.0,0.0,3069.0,0.0,1.0
22496,350000.0,1.0,1.0,1.0,42.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3800.0,3138.0,4150.0,3750.0,1362.0,8210.0,3138.0,4160.0,3750.0,2272.0,8210.0,9731.0,0.0
22497,100000.0,2.0,3.0,2.0,46.0,1.0,-1.0,2.0,2.0,-1.0,0.0,0.0,203.0,203.0,0.0,7856.0,16544.0,203.0,0.0,0.0,7856.0,10000.0,865.0,0.0
22498,20000.0,2.0,3.0,1.0,50.0,-1.0,-1.0,-1.0,-1.0,-2.0,-2.0,5141.0,3455.0,6906.0,0.0,0.0,0.0,3754.0,6906.0,290.0,0.0,0.0,0.0,1.0


In [59]:
df.loc[df['credit_given'] =="LIMIT_BAL"]

Unnamed: 0,credit_given,gender,education,marital_status,age,past_pay_sept,past_pay_aug,past_pay_july,past_pay_june,past_pay_may,past_pay_april,due_pay_sept,due_pay_aug,due_pay_july,due_pay_june,due_pay_may,due_pay_april,amount_paid_sept,amount_paid_aug,amount_paid_july,amount_paid_june,amount_paid_may,amount_paid_april,Y


In [60]:
df.drop(index = 18381, inplace = True)

KeyError: '[18381] not found in axis'

In [61]:
df.astype(dtype=int)

Unnamed: 0,credit_given,gender,education,marital_status,age,past_pay_sept,past_pay_aug,past_pay_july,past_pay_june,past_pay_may,past_pay_april,due_pay_sept,due_pay_aug,due_pay_july,due_pay_june,due_pay_may,due_pay_april,amount_paid_sept,amount_paid_aug,amount_paid_july,amount_paid_june,amount_paid_may,amount_paid_april,Y
0,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
1,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
2,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
4,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22495,40000,2,2,1,38,0,0,3,2,2,2,35183,39197,39477,39924,39004,41462,4600,1200,1400,0,3069,0,1
22496,350000,1,1,1,42,-1,-1,-1,-1,-1,-1,3800,3138,4150,3750,1362,8210,3138,4160,3750,2272,8210,9731,0
22497,100000,2,3,2,46,1,-1,2,2,-1,0,0,203,203,0,7856,16544,203,0,0,7856,10000,865,0
22498,20000,2,3,1,50,-1,-1,-1,-1,-2,-2,5141,3455,6906,0,0,0,3754,6906,290,0,0,0,1


## 3. Feature Engineering

## 4. Feature Selection

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

## 6. Model Evaluation

## 7. Final Model