# Decision Tree to classify defaulters


Data taken from the UCI data repository (Taiwan Dataset)

Decision Tree :
        This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel alpha Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. 
        
Therefore, among the six data mining techniques, I have compared various classical algorithms and concluded Decision 
Tree Model to be best.

Customer default payment prediction using an optimized Decision Tree Model with the top 15 features selected among 22 for best results. F1-Score achieved: 0.5 on test set. Concluded that a Neural Network can learn more complex patterns.


## Importing libraries

In [0]:
import copy
# linear algebra
import numpy as np 
# data processing
import pandas as pd 
# Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from imblearn.over_sampling import SMOTE

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Loading data

In [0]:
#Loading Data
################################################################
def load_data():
    test_df = pd.read_csv("test.csv")
    train_df = pd.read_csv("train.csv")
    #Dropping 'Unnamed: 0' column and setting ID to index
    train_df_copy=copy.copy(train_df.drop(['Unnamed: 0'],axis=1))
    train=copy.copy(train_df_copy.set_index('ID'))
    test_df_copy=copy.copy(test_df.drop(['Unnamed: 0'],axis=1))
    test=copy.copy(test_df_copy.set_index('ID'))
    return(train,test)
################################################################

## Data Preperation

### Preprocessing

In [0]:
#Standardizing the Non categorical Data
################################################################
def scale(X):
    rescale=['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
    X_std=copy.deepcopy(X)
    for i in rescale:
        k=i
        train_mean=np.mean(X_std.loc[:,k])
        train_std=np.std(X_std.loc[:,k])
        X_std.loc[:,k]=((X_std.loc[:,k]-train_mean)/train_std)
    return(X_std)
################################################################


#Oversampling the under represented class
################################################################
def oversample(X,y):
    X_copy=copy.deepcopy(X)
    X_resampled, y_resampled = SMOTE().fit_resample(X, y)
    X_resampled1=pd.DataFrame(X_resampled,columns=X_copy.columns)
    return(X_resampled1,y_resampled)
################################################################

### Feature Selection

In [0]:
#Selecting K Best features
################################################################
def select_features(X_train,Y_train,X_test,Y_test,X,Test):
    #Select 15 best (Based on multiple runs)
    select_feature = SelectKBest(mutual_info_classif, k=15).fit(X_train, Y_train)
    selected_features_df = pd.DataFrame({'Feature':list(X_train.columns),'Scores':select_feature.scores_})
    selected_features_df_sorted=selected_features_df.sort_values(by='Scores', ascending=False)
    Kbest_features=selected_features_df_sorted.iloc[0:15,0]
    
    #Selecting K feature columns in Train and Val df
    X_train15=copy.deepcopy(X_train[Kbest_features])
    X_val15=copy.deepcopy(X_val[Kbest_features])
    X_15=copy.deepcopy(X[Kbest_features])
    Test_15=copy.deepcopy(Test[Kbest_features])
    return(X_train15,X_val15,X_15,Test_15)
################################################################ 

## Cross Validation function

In [0]:
#F1Score with CrossValidation calculation on Train set and F1 Score of Train set against Validation Set
################################################################
def f1score_(X_train, X_test, Y_train, Y_test,model):
    model.fit(X_train,Y_train)
    cv_score=np.mean(cross_validate(model, X_train, Y_train, cv=6,scoring='f1',return_train_score=True)['test_score'])
    y_pred1 = model.predict_proba(X_test)
    y_pred = y_pred1.argmax(axis=-1)
    return(cv_score,f1_score(y_pred, Y_test))
################################################################

#F1Score with CrossValidation on entire Training and Validation set 
#Fitting model with data
################################################################
def f1score_m(X_train,Y_train,test,model):
    model.fit(X_train,Y_train)
    y_pred1 = model.predict_proba(test)
    y_pred = y_pred1.argmax(axis=-1)
    cv_score=np.mean(cross_validate(model, X_train, Y_train, cv=6,scoring='f1',return_train_score=True)['test_score'])
    return(cv_score,y_pred)
################################################################

## Main

### Data

In [0]:
Train, Test=load_data()
X = copy.copy(Train.drop("default payment next month", axis=1))
Y = copy.copy(Train["default payment next month"])
X

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
3,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
4,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
5,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
6,50000,1,1,2,37,0,0,0,0,0,...,57608,19394,19619,20024,2500,1815,657,1000,1000,800
7,500000,1,1,2,29,0,0,0,0,0,...,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770
8,100000,2,2,2,23,0,-1,-1,0,0,...,601,221,-159,567,380,601,0,581,1687,1542
9,140000,2,3,1,28,0,0,2,0,0,...,12108,12211,11793,3719,3329,0,432,1000,1000,1000
11,200000,2,3,2,34,0,0,2,0,0,...,5535,2513,1828,3731,2306,12,50,300,3738,66
13,630000,2,2,2,41,-1,0,-1,-1,-1,...,6500,6500,6500,2870,1000,6500,6500,6500,2870,0


### Model training and Evaluation

In [0]:
#STANDARDIZING: Standardizing Non Categorical data
X_std=scale(X)
#TRAIN-TEST SPLIT : Using a Train set, Validate set split of 0.40 (Based on multiple runs with different splits)
X_train, X_val, Y_train, Y_val = train_test_split(X_std, Y, test_size=0.35,random_state=42)
#SMOTE OVERSAMPLING : Oversampling underrepresented class in Train set   
X_resampled,Y_resampled=oversample(X_train,Y_train)
#FEATURE SELECTION : Selecting 15 best features in Train set(Based on multiple runs) 
X_train15,X_val15,X_15,Test_15=select_features(X_resampled,Y_resampled,X_val,Y_val,X_resampled,Test)

#CLASSIFIER : Decision Tree classifier (Checked on multiple classifiers, Decision Tree gave best results)
#Best max_depth is 6
clf=DecisionTreeClassifier(random_state=42,max_depth=6,criterion='gini')
#SCORING : F1_score with k best features
fscore_kbest=(f1score_(X_train15, X_val15, Y_resampled, Y_val,clf))
print("Cross Val F1_score on Train set and F1_score on Validation set",fscore_kbest)
#FITTING : Final fit of model with whole Train set with oversampling
f1_score_15best,Y_predicted=f1score_m(X_15,Y_resampled,Test_15,clf)
print("F1_score of full Train set",f1_score_15best)

Cross Val F1_score on Train set and F1_score on Validation set (0.6949873556177072, 0.5213675213675214)
F1_score of full Train set 0.6949873556177072


### Saving

In [0]:
Test['Prediction']=Y_predicted
Test.to_csv('Test_predicted.csv')

F1 score on Validation set doesn't change much.



On validation set :

    Without Oversampling: F1 score ~ (0.48)
    With Oversampling: F1 score ~ (0.50)
    
    
Within Train set :
    
    Without Oversampling: F1 score ~ (0.48)
    With Oversampling: F1 score ~ (0.70)