# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


In [3]:
# import libraries

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.options.display.max_columns = 500 

## 1. Data Cleaning

In [4]:
df = pd.read_csv('training_data.csv' , index_col=0)

In [5]:
oldnames = list(df.columns)

#change data types

In [6]:
oldnames

['X1',
 'X2',
 'X3',
 'X4',
 'X5',
 'X6',
 'X7',
 'X8',
 'X9',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16',
 'X17',
 'X18',
 'X19',
 'X20',
 'X21',
 'X22',
 'X23',
 'Y']

In [7]:
df.columns = ['pay_1' if (name == 'PAY_0') else 'Y' if (name == list(df.loc['ID'])[-1])  else name.lower() for name in list(df.loc['ID'])]

In [8]:
df.head()

Unnamed: 0,limit_bal,sex,education,marriage,age,pay_1,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,Y
28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [9]:
df = df.drop(['ID'], axis=0)

In [10]:
df = df.applymap(lambda x: int(x))

In [11]:
df.shape

(22499, 24)

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others). 0 is divorced?
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
The measurement scale for the repayment status is: -2= no payment due, -1 = pay duly; 0 = paid minimum, 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 

In [12]:
df.education.unique()

array([1, 3, 2, 4, 6, 5, 0])

In [13]:
#df.pay_1.plot(kind='bar')

In [14]:
#df[df.Y==1]

In [15]:
#df.columns

In [16]:
#df.Y.value_counts()

## 2. EDA

In [17]:
#df.limit_bal.describe()

In [18]:
#df.limit_bal.plot(kind='hist')

In [19]:
df['billamt_to_limit']=df.bill_amt1/(df.limit_bal)

In [20]:
df['billamt1_to_billamt6']=df.bill_amt1/(df.bill_amt6+1)


In [21]:
df['avg_stat']=(df.pay_1+df.pay_2+df.pay_3+df.pay_4+df.pay_5+df.pay_6)/6

In [22]:
df['pay_to_bill']=(df.pay_amt1+df.pay_amt2+df.pay_amt3+df.pay_amt4+df.pay_amt5+df.pay_amt6)/(df.bill_amt1+df.bill_amt2+df.bill_amt3+df.bill_amt4+df.bill_amt5+df.bill_amt6+1)

In [23]:
df['avg_pay']=(df.pay_amt1/(df.bill_amt1+1)+df.pay_amt2/(df.bill_amt2+1)+df.pay_amt3/(df.bill_amt3+1)+df.pay_amt4/(df.bill_amt4+1)+df.pay_amt5/(df.bill_amt5+1)+df.pay_amt6/(df.bill_amt6+1))/6


## 3. Feature Engineering

In [24]:
for i in range(1,6):
    df['bill_change_rate'+str(i)]=(df['bill_amt'+str(i)]-df['bill_amt'+str(i+1)])/(df['bill_amt'+str(i+1)]+1)
                                                

In [25]:
df['rate_aggr'] = sum([(df['bill_change_rate'+str(i)]**2)**0.5 for i in range(1,6)])

In [26]:
#df.columns

In [27]:
df['total_bill'] = sum([df['bill_amt'+str(i)] for i in range(1,6)])
df['total_paid'] = sum([df['pay_amt'+str(i)] for i in range(1,6)])

In [28]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(value=0, inplace=True)

In [29]:
df['thirties']= np.where((df.age.values>29) & (df.age.values<40),1,0)
df['forties'] = np.where((df.age.values>39) & (df.age.values<50),1,0)
df['fifties'] = np.where((df.age.values>49) & (df.age.values<60),1,0)
df['sixties'] = np.where((df.age.values>59) & (df.age.values<70),1,0)
df['seventies'] = np.where((df.age.values>69) & (df.age.values<80),1,0)

In [30]:
df['missed_payment']=np.where((df.pay_1>0)|(df.pay_2>0)|(df.pay_3>0)|(df.pay_4>0)|(df.pay_5>0)|(df.pay_6>0),1,0)

In [31]:
df.marriage = np.where(df.marriage==0, 3, df.marriage)

In [32]:
df.education = np.where((df.education>4)|(df.education==0),4, df.education)

In [33]:
#df = pd.get_dummies(df, columns=['pay_'+str(i) for i in range(1,7)])

In [34]:
df.corr()['Y']

limit_bal              -0.155958
sex                    -0.037953
education               0.037384
marriage               -0.032121
age                     0.014586
pay_1                   0.324772
pay_2                   0.266810
pay_3                   0.241575
pay_4                   0.219143
pay_5                   0.208229
pay_6                   0.193485
bill_amt1              -0.016858
bill_amt2              -0.011762
bill_amt3              -0.012578
bill_amt4              -0.009667
bill_amt5              -0.007187
bill_amt6              -0.005506
pay_amt1               -0.071469
pay_amt2               -0.057635
pay_amt3               -0.054053
pay_amt4               -0.054540
pay_amt5               -0.054176
pay_amt6               -0.055296
Y                       1.000000
billamt_to_limit        0.089255
billamt1_to_billamt6   -0.009966
avg_stat                0.286357
pay_to_bill            -0.007305
avg_pay                -0.023866
bill_change_rate1      -0.005655
bill_chang

## 4. Feature Selection

In [35]:
# Split data to be used in the models
# Create matrix of features
X = df.drop('Y', axis = 1) # grabs everything else but 'Survived'


# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [36]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures
from sklearn.linear_model import Lasso, Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE

In [37]:
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [38]:

scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test) 

In [39]:
sm = SMOTE()
X_train, y_train = sm.fit_sample(X_train, y_train)

In [40]:
y_train.value_counts()

1    13971
0    13971
Name: Y, dtype: int64

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

In [41]:
from sklearn.ensemble import RandomForestClassifier

# creating our parameters to test
rand_forest = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_depth = 9, max_features = 6, class_weight='balanced')
rand_forest.fit(X_train, y_train)

RandomForestClassifier(class_weight='balanced', max_depth=9, max_features=6)

In [42]:
y_pred = rand_forest.predict(X_train)
print(f1_score(y_pred, y_train))
y_pred = rand_forest.predict(X_test)
print(f1_score(y_pred, y_test))

0.7826620636747219
0.5281105990783409


In [43]:
#w/ SMOTE
y_pred = rand_forest.predict(X_train)
print(f1_score(y_pred, y_train))
y_pred = rand_forest.predict(X_test)
print(f1_score(y_pred, y_test))

0.7826620636747219
0.5281105990783409


### Feature Importance


In [48]:
rand_forest.oob_score

False

In [49]:
importance=rand_forest.feature_importances_
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

Feature: 0, Score: 0.02702
Feature: 1, Score: 0.00249
Feature: 2, Score: 0.02260
Feature: 3, Score: 0.00766
Feature: 4, Score: 0.01027
Feature: 5, Score: 0.16762
Feature: 6, Score: 0.06362
Feature: 7, Score: 0.03921
Feature: 8, Score: 0.02643
Feature: 9, Score: 0.02849
Feature: 10, Score: 0.02080
Feature: 11, Score: 0.01529
Feature: 12, Score: 0.01168
Feature: 13, Score: 0.01053
Feature: 14, Score: 0.00945
Feature: 15, Score: 0.01073
Feature: 16, Score: 0.01063
Feature: 17, Score: 0.02091
Feature: 18, Score: 0.02212
Feature: 19, Score: 0.01504
Feature: 20, Score: 0.01371
Feature: 21, Score: 0.01394
Feature: 22, Score: 0.01833
Feature: 23, Score: 0.02442
Feature: 24, Score: 0.01316
Feature: 25, Score: 0.09857
Feature: 26, Score: 0.01796
Feature: 27, Score: 0.01090
Feature: 28, Score: 0.01314
Feature: 29, Score: 0.01378
Feature: 30, Score: 0.01277
Feature: 31, Score: 0.01169
Feature: 32, Score: 0.01150
Feature: 33, Score: 0.01478
Feature: 34, Score: 0.01554
Feature: 35, Score: 0.03890
Fe

NameError: name 'pyplot' is not defined

In [51]:
feature_importances = pd.DataFrame(rand_forest.feature_importances_, index =df.drop('Y', axis = 1).columns,  columns=['importance']).sort_values('importance', ascending=False)

In [52]:
feature_importances

Unnamed: 0,importance
pay_1,0.167621
missed_payment,0.110918
avg_stat,0.098574
pay_2,0.063618
pay_3,0.039208
total_paid,0.038898
pay_5,0.028491
limit_bal,0.027025
pay_4,0.026431
billamt_to_limit,0.024418


#### serialization

In [49]:
import pickle

In [50]:
model_pickle_path = 'rfc_model.pk1'

model_pickle = open(model_pickle_path, 'wb')
pickle.dump(rand_forest,model_pickle)
model_pickle.close()
model_pickle_path = 'rfc_scaler.pk1'

model_pickle = open(model_pickle_path, 'wb')
pickle.dump(scaler,model_pickle)
model_pickle.close()

In [None]:
'''
k_range = list(range(1, 32, 2))
k_scores=[]

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    
    f1 = metrics.f1_score(y_test, y_pred)
    
    k_scores.append(f1)

'''

In [None]:
#list(zip(k_range, k_scores))

In [None]:
#knn = KNeighborsClassifier(n_neighbors=31)
#knn.fit(X_train, y_train)

In [None]:
#y_pred=knn.predict(X_test)

In [None]:
#acc = accuracy_score(y_test, y_pred)

In [None]:
#acc

Logistic Regression

In [None]:
'''
logreg = LogisticRegression(class_weight='balanced', C=0.8, n_jobs=-1)

logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
'''

In [None]:
'''
f1 = f1_score(y_test, y_pred)
f1
'''

In [None]:
'''
logreg.coef_
'''

In [None]:
'''
dtc= DecisionTreeClassifier(criterion='entropy')
'''

In [None]:
'''
dtc = DecisionTreeClassifier(max_depth=6)

dtc.fit(X_train,y_train)


y_pred_train = dtc.predict(X_train)

y_pred_test = dtc.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Training F1 Score:", f1_score(y_train, y_pred_train))
print("Testing F1 Score:", f1_score(y_test, y_pred_test))
'''

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [None]:

'''
param_grid = { 
    'n_estimators': [100,300,500,700,1000],
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(2,10)),
    'max_features': list(range(3,7))
}
grid_tree=GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1', verbose=1, n_jobs=-1)
grid_tree.fit(X_train, y_train)
'''

## 6. Model Evaluation

In [None]:
'''
print(grid_tree.best_score_)
print(grid_tree.best_params_)
print(grid_tree.best_estimator_)
'''

In [None]:
'''
y_pred = grid_tree.best_estimator_.predict(X_test)

# Model F1, how often is the classifier correct?
print("F1:", f1_score(y_test, y_pred))
'''

In [47]:

import xgboost as xgb

In [None]:
'''
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [100,300,500],
              'learning_rate': [0.1,0.07,0.05,0.03,0.01],
              'max_depth': list(range(3,14,2)),
              'colsample_bytree': [0.5,0.45,0.4],
              'min_child_weight': [1, 2, 3]
             }
gsearch1 = GridSearchCV(
    estimator = clf_xgb,
    param_grid = param_dist, 
    scoring='f1',
    n_jobs=-1,
    verbose=1,
    iid=False, 
    cv=5)
'''

In [None]:
'''
gsearch1.fit(X_train, y_train)
'''

In [80]:
print(gsearch1.best_params_)
print(gsearch1.best_score_)
preds = gsearch1.best_estimator_.predict(X_test)


test_f1 = f1_score(y_test, preds)
test_acc = accuracy_score(y_test, preds)

print("Accuracy: %f" % (test_acc))
print("F1: %f" % (test_f1))

{'colsample_bytree': 0.5, 'learning_rate': 0.01, 'max_depth': 15, 'min_child_weight': 1, 'n_estimators': 500}
0.8506878300390399
Accuracy: 0.813556
F1: 0.506180


In [48]:

x_model = xgb.XGBClassifier(objective = 'binary:logistic', colsample_bytree= 0.5, learning_rate= 0.01, max_depth= 15, min_child_weight= 1, n_estimators= 500)
x_model.fit(X_train, y_train)
preds= x_model.predict(X_test)

test_f1 = f1_score(y_test, preds)
test_acc = accuracy_score(y_test, preds)

print("Accuracy: %f" % (test_acc))
print("F1: %f" % (test_f1))

KeyboardInterrupt: 

## 7. Final Model