## Finishing the task (a.k.a homework)

Try to improve the performance of a classifier on `data` with the following conditions:

1. use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification
1. use at least 2 other preprocessing techniques (your choice!) on the data set and comment the classification results
1. run all classification test at least twice - once for unbalanced original data, once for balanced data (choose a balancing technique), compare those results (give a comment)

In [25]:
import numpy as np
import pandas as pd

from scipy.stats import norm, ttest_ind
from scipy.optimize import minimize_scalar

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import f_regression, mutual_info_regression, RFECV
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_log_error, make_scorer, mean_squared_error, auc
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import model_selection, linear_model, metrics, preprocessing, feature_selection
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

### The data

In [6]:
data = pd.read_csv('default_of_credit_card_clients.csv',sep=';') #Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
data.describe() #data description: see https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

data_copy = data.copy()

### 1) Use binning (on features of your choice, with your choice of parameters) and comment on its effects on classification


- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- X4: Marital status (1 = married; 2 = single; 3 = others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

#### Let's use binning in order to improve the performance of the KNN classifier.

We will focus on the features, because they seem pretty easy to track and to monitor : 
- X1: Amount of the given credit
- X5: Age (year) 

In [7]:
features = ['X1', 'X5']
results = []

#Lets get some information regarding these two features before binning
for i in features: 
    display(data[i].describe())

count      30000.000000
mean      167484.322667
std       129747.661567
min        10000.000000
25%        50000.000000
50%       140000.000000
75%       240000.000000
max      1000000.000000
Name: X1, dtype: float64

count    30000.000000
mean        35.485500
std          9.217904
min         21.000000
25%         28.000000
50%         34.000000
75%         41.000000
max         79.000000
Name: X5, dtype: float64

In [8]:
# X1's range is 10³ - 10⁶ 
# Just to try it out, lets use the quantile based binning because we could create 10 intervals from 10000-1000000 quantile binning may be more efficient here.

labels_X1 = []
for y in range(1, 11):
        labels_X1.append(y)

binned_X1 = pd.qcut(data['X1'], 10, labels = labels_X1)

# X5's range is 21 - 79, so we could create a bin like (20-30, 30-40, 40-50, 50-60, 60-70, 70-80).
bin_age = pd.IntervalIndex.from_tuples([(20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80)])

labels_X5 = []
for k in range(1, 7):
        labels_X5.append(k)
        
binned_X5 = pd.cut(data['X5'], 6, labels = labels_X5)

In [9]:
def knn_clf(mydata, cols):
    
    #Removing the feature from the dataset
    X = mydata.drop([cols], 1)
    #Isolating the feature
    y = mydata[cols]
    
    # split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    #Instantiate the model with 3 neighbors. 
    knn = KNeighborsClassifier(n_neighbors=3)

    # Fit the model on the training data.
    knn.fit(X_train, y_train)

    #Predict the response
    pred = knn.predict(X_test)

    #Evaluate the accuracy

    accur = accuracy_score(y_test, pred)
    results.append(accur)
    
    print("The accuracy for the feature", cols, "is :", accur)

In [22]:
# Run the accuracy test on our selected features
for i in features:
    knn_clf(data, i)

The accuracy for the feature X1 is : 0.22053333333333333
The accuracy for the feature X5 is : 0.052


In [54]:
data_copy['X1'] = binned_X1.astype(np.int64)
data_copy['X5'] = binned_X5.astype(np.int64)

# Run the accuracy test on our selected features
for i in features:
    knn_clf(data_copy, i)

The accuracy for the feature X1 is : 0.3184
The accuracy for the feature X5 is : 0.35733333333333334


In [47]:
print("The accuracy is better after binning especially for the feature X5")

The accuracy is better after binning especially for the feature X5


### 2) Use at least 2 other preprocessing techniques (your choice!) on the data set and comment the classification results

Here we will use : 
- Standardization
- Min-max normalization

In [48]:
def knn_clf_stand(mydata, cols, i):
    
    #Removing the feature from the dataset
    X = mydata.drop([cols], 1)
    X = StandardScaler().fit_transform(X)
    #Isolating the feature
    y = mydata[cols]
    
    # split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    #Instantiate the model with 3 neighbors. 
    knn = KNeighborsClassifier(n_neighbors=3)

    # Fit the model on the training data.
    knn.fit(X_train, y_train)

    #Predict the response
    pred = knn.predict(X_test)

    #Evaluate the accuracy

    accur = accuracy_score(y_test, pred)
    results.append(accur)
    
    if(i==0):
        print("The basic accuracy for the feature", cols, "is :", accur)
    else:
        print("The accuracy with binning for the feature", cols, "is :", accur)


In [49]:
# Run the accuracy test on our selected features
for i in features:
    knn_clf_stand(data, i, 0)
    
# Run the accuracy test on our selected features
for i in features:
    knn_clf_stand(data_copy, i, 1)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


The basic accuracy for the feature X1 is : 0.14493333333333333


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


The basic accuracy for the feature X5 is : 0.0652


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


The accuracy with binning for the feature X1 is : 0.26213333333333333


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


The accuracy with binning for the feature X5 is : 0.4376


In [36]:
def knn_clf_norm(mydata, cols, i):
    
    #Removing the feature from the dataset
    X = mydata.drop([cols], 1)
    X = MinMaxScaler().fit_transform(X)
    #Isolating the feature
    y = mydata[cols]
    
    # split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    #Instantiate the model with 3 neighbors. 
    knn = KNeighborsClassifier(n_neighbors=3)

    # Fit the model on the training data.
    knn.fit(X_train, y_train)

    #Predict the response
    pred = knn.predict(X_test)

    #Evaluate the accuracy

    accur = accuracy_score(y_test, pred)
    results.append(accur)
    if(i==0):
        print("The basic accuracy for the feature", cols, "is :", accur)
    else:
        print("The accuracy with binning for the feature", cols, "is :", accur)


In [51]:
# Run the accuracy test on our selected features
for i in features:
    knn_clf_norm(data, i, 0)

for i in features:
    knn_clf_norm(data_copy, i, 1)

  return self.partial_fit(X, y)


The basic accuracy for the feature X1 is : 0.12333333333333334


  return self.partial_fit(X, y)


The basic accuracy for the feature X5 is : 0.06493333333333333


  return self.partial_fit(X, y)


The accuracy with binning for the feature X1 is : 0.2538666666666667


  return self.partial_fit(X, y)


The accuracy with binning for the feature X5 is : 0.4388


In [55]:
print("The accuracy is better for the X5 feature with both of the above preprocessing techniques, but is lower for the X1 feature.")

The accuracy is better for the X5 feature with both of the above preprocessing techniques, but is lower for the X1 feature.
