<center>
  <h2 style="color:#000000";> Impact of feature engineering on models' performance</h2>

<h3 style="color:#000000";> Abdelwahid Benslimane</h3>
    <h3 style="color:#000000";> wahid.benslimane@gmail.com</h3>
</center>

<br>


The purpose of the present experiment is to show that a little work on the data, both in terms of dimensionality and format, can improve the performance of a model, provided that hyper-parameters may also be modified.<br>

Also, depending on the model, the optimal data transformation may differ. 
    
The procedure followed was simple and involved using the German Credit dataset available on many sites and which contains both numeric and categorical data. The dataset categorizes 1,000 individuals characterized by 20 variables (target variable not included) as either customers at risk or not at risk for granting credit. Among the individuals, 300 are considered at risk, and the 700 others are considered as not at risk. The dataset has no missing value. It can be downloaded from there: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

I carried out several different transformations:

- First, I discretized the real-valued numerical variables (or numerical variables with very many values) so that the dataset contained only categorical variables. Thindataset obtained will be called data0 in the following explanations. It contains 20 variables (target variable not included).
    

- I created a first binary dataset using the complete disjunctive array (one-hot encoding) obtained from data0. This dataset contains 77 binary variables, corresponding to the sum of the modalities of all the categorical variables in data0. This is the first data set


- I performed a multiple component analysis (MCA) on data0 to obtain a dataset of real-valued synthetic variables. The maximum number of factors is 77, but the MCA retained only 76, because the first 76 components contain the total inertia of the data. This is the second dataset. 


- I selected the first 35 components, which together contain more than 77% of the data inertia. This is the third data set.


- I transformed the String variables in data0 into ordinal variables, as the models used only work with numerical values. This fourth dataset therefore contains 20 variables, just like data0.

<h3><u>It is important to mention that using ordinal variables in this specific case is not totally relevant as not all the variables have their values that can support an order relation, but it is worth checking the performances doing that.</u></h3>

For the demonstration I worked on 2 fairly simple models, the decision tree and SVM with the kernel trick. 
For the decision tree, a first test was carried out with a complete tree, then a pruned decision tree in order to improve the tree's generalization capacity by limiting its depth.<br>

As for the SVM, tests were made with several kernels and several values for each parameter of each kernel.<br>

Model optimization was simply performed using a grid search.

<h3> I) Data loading, discretization of numerical variables and creation of the binary dataset</3>

In [1]:
#Prince library (for factor analysis) can be installed from a Jupyther notebook using the folowwing comands:

#import sys
#!{sys.executable} -m pip install prince

In [2]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split



#data prep

#building a DataFrame from the structured data file

df = pd.read_table('german.data', delim_whitespace=True, names=('Status of existing checking account', 'Duration (months)', 
                                                                'Credit history', 'Purpose', 'Credit amount', \
                                                                'Saving account/bounds', 'Present employment since', \
                                                                'Install. rate of disposable income', 'Personal status and sex', \
                                                                'Other debtors / guarantors', 'Present residence since', \
                                                                'Property', 'Age (years)', 'Other installment plans', \
                                                                'Housing', 'Number of existing credits at this bank', \
                                                                'Job', \
                                                                'Number of people being liable to provide maintenance for', \
                                                                'Telephone', 'Foreign worker', 'Customer risk category') )

#"Customer risk category" is the target variable

#convert integer variables with a small number of finite values to string type

df["Number of people being liable to provide maintenance for"] = df["Number of people being liable to provide maintenance for"]\
.astype(str)
df["Install. rate of disposable income"] = df["Install. rate of disposable income"].astype(str)
df["Present residence since"]= df["Present residence since"].astype(str)
df["Number of existing credits at this bank"]= df["Number of existing credits at this bank"].astype(str)

#transforming other numerical variables into categorical variables 
#by replacing values by the intervals to which they belong  
df["Age (years)"] = pd.qcut(df["Age (years)"], 3, labels=['young', 'medium-aged', 'senior'])
df["Duration (months)"] = pd.qcut(df["Duration (months)"], 3, labels=['low', 'medium', 'high'])
df["Credit amount"] = pd.qcut(df["Credit amount"], 3, labels=['low', 'medium', 'high'])

#extraction of the target variable then conversion of value 2 to 1 and value 1 to 0
Y = df["Customer risk category"].copy() -1

#extraction of the explanatory variables 
data0 = df.drop(columns = "Customer risk category")

#one hot encoding of the explanatory varaiables
X_dummy = pd.get_dummies(data0)

print("Number of columns of the complete disjonctive array (one hot encoding):")
print(np.shape(X_dummy)[1])
print("Number of individuals categorized as customers at risk:")
print(sum(Y))




Number of columns of the complete disjonctive array (one hot encoding):
77
Number of individuals categorized as customers at risk:
300


<h3>II) MCA on data0 and analysis of 40 first components</h3>

In [3]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

import prince

#Multiple Component Analysis to get convert categorial variables 
#into real numerical variables
#the sum of all the values for all the explanatory variables is 77
#therefore, the max number of compenant analysis is 77
mca = prince.MCA(n_components=77,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=40
)

mca = mca.fit(data0)
X = mca.transform(data0)

#explained inertia by the 40 first compoents 
mca.eigenvalues_summary[:40]

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.159,5.58%,5.58%
1,0.113,3.97%,9.55%
2,0.106,3.73%,13.27%
3,0.09,3.14%,16.41%
4,0.081,2.83%,19.24%
5,0.08,2.80%,22.05%
6,0.075,2.62%,24.67%
7,0.073,2.58%,27.24%
8,0.07,2.45%,29.69%
9,0.069,2.41%,32.10%


The 35 first components obtained following the application of an MCA carry 77.66 % of the data inertia. Therefore I create both a dataset with the complete synthetic variables and a dataset with the major 35 synthetic variables

In [4]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#The 35 first components explain more than 77 % of the inertia, therefore
#I create a dataset with the only 35 first related variables 
#I also create a dataset with the complete variables

X_real = X.to_numpy()[:,:35]

X_real_complete = X.to_numpy()



<h3> III) Creation of the dataset with ordinal variables from data0</h3>

In [5]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com
#It is important to mention that using ordinal variables in this specific case 
#is not totally relevant as not all the variables have their values that can support 
#an order relation, but it is worth checking the performances doing that
from sklearn.preprocessing import OrdinalEncoder

#ordinal encoding of the explanatory variables
encoder = OrdinalEncoder()
X_ordinal = encoder.fit_transform(data0)


<br><u> Shapes of all the datasets summarized</u>:</br>

In [6]:
#Abdelwahid Benslimane

print("Shape of the one-hot encoded (binary) dataset:")
print(np.shape(X_dummy))
print("Shape of the complete real-valued dataset:")
print(np.shape(X_real_complete))
print("Shape of the real-valued dataset carrying 77.66 % of the inertia:")
print(np.shape(X_real))
print("Shape of dataset with ordinal variables:")
print(np.shape(X_ordinal))

Shape of the one-hot encoded (binary) dataset:
(1000, 77)
Shape of the complete real-valued dataset:
(1000, 76)
Shape of the real-valued dataset carrying 77.66 % of the inertia:
(1000, 35)
Shape of dataset with ordinal variables:
(1000, 20)


<h3> IV) Split of the data into train and test datasets</h3>

In [7]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#train and test split

X_dummy_train, X_dummy_test, y_dummy_train, y_dummy_test = train_test_split(X_dummy, Y, train_size=0.7, random_state=0)

X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(X_real, Y, train_size=0.7, random_state=0)

X_real_complete_train, X_real_complete_test, y_real_complete_train, y_real_complete_test = \
train_test_split(X_real_complete, Y, train_size=0.7, random_state=0)

X_ordinal_train, X_ordinal_test, y_ordinal_train, y_ordinal_test = \
train_test_split(X_ordinal, Y, train_size=0.7, random_state=0)

<h3>V) Decision tree</h3>
    
For each type of dataset, I printed the score obatined with a complete tree, the best score obtained once the tree was pruned, the optimal depth of the tree, the confusion matrix and the classification report. 

<b>It is important to mention that there is some randomness in the process of building the decision tree with the related module from sklearn that could affect the results, that is is why a good approach would be to generate several trees and to classify according to the major vote (the class assigned will be the most reccurent one), but this is note what will be done here, for a simplification purpose.</b>

<h3> V.1) Decision tree and real-valued dataset with 35 variables</h3>

In [8]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

#numerical variables (35 components)
clf = tree.DecisionTreeClassifier(random_state=0)
clf.fit(X_real_train, y_real_train)
print("Score with a tree of maximum size and real-valued dataset with 35 numerical variables: "+\
      str(clf.score(X_real_test, y_real_test)))


depth= []
for i in range(38):
    depth.append(i+1)
    
pgrid = {"max_depth": depth}

#looking for the best value for the max_depth parameter with 35 numerical variables 

grid_search = GridSearchCV(tree.DecisionTreeClassifier(random_state=1), param_grid=pgrid, cv=5, n_jobs=4)

grid_search.fit(X_real_train, y_real_train)


print("Score with optimal depth (obtained with a grid search): " + str(grid_search.best_estimator_.score(X_real_test, \
                                                                                                              y_real_test)))
print("\n")
print("Optimal depth: " + str(grid_search.best_estimator_.get_depth()))
print("\n")

y_pred = grid_search.predict(X_real_test)
print("Confusion matrix:")
print(confusion_matrix(y_real_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_real_test, y_pred))


Score with a tree of maximum size and real-valued dataset with 35 numerical variables: 0.67
Score with optimal depth (obtained with a grid search): 0.71


Optimal depth: 3


Confusion matrix:
[[187  27]
 [ 60  26]]


Classification report:
              precision    recall  f1-score   support

           0       0.76      0.87      0.81       214
           1       0.49      0.30      0.37        86

    accuracy                           0.71       300
   macro avg       0.62      0.59      0.59       300
weighted avg       0.68      0.71      0.69       300



<h3>V.2) Decision tree and complete real-valued dataset</h3>

In [9]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com


#numerical variables (76 components)
clf = tree.DecisionTreeClassifier(random_state=2)
clf.fit(X_real_complete_train, y_real_complete_train)
print("Score with a tree of maximum size and complete numerical variables: "+str(clf.score(X_real_complete_test, \
                                                                                            y_real_complete_test)))

#looking for the best value for the max_depth parameter with the omplete numerical variables 

grid_search = GridSearchCV(tree.DecisionTreeClassifier(random_state=3), param_grid=pgrid, cv=5, n_jobs=4)

grid_search.fit(X_real_complete_train, y_real_complete_train)


print("Score with optimal depth (obtained with a grid search): " + str(grid_search.best_estimator_.score(\
                                                                   X_real_complete_test, y_real_complete_test)))
print("\n")
print("Optimal depth: " + str(grid_search.best_estimator_.get_depth()))
print("\n")

y_pred = grid_search.predict(X_real_complete_test)
print("Confusion matrix:")
print(confusion_matrix(y_real_complete_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_real_complete_test, y_pred))
print()

Score with a tree of maximum size and complete numerical variables: 0.6166666666666667
Score with optimal depth (obtained with a grid search): 0.7033333333333334


Optimal depth: 2


Confusion matrix:
[[196  18]
 [ 71  15]]


Classification report:
              precision    recall  f1-score   support

           0       0.73      0.92      0.81       214
           1       0.45      0.17      0.25        86

    accuracy                           0.70       300
   macro avg       0.59      0.55      0.53       300
weighted avg       0.65      0.70      0.65       300




<h3>V.3) Decision tree and ordinal dataset</h3>

In [10]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#ordinal variables
clf = tree.DecisionTreeClassifier(random_state=4)
clf.fit(X_ordinal_train, y_ordinal_train)
print("Score for a tree of maximum size and ordinal dataset: "+str(clf.score(X_ordinal_test, y_ordinal_test)))

#looking for the best value for the max_depth parameter with ordinal variables

grid_search = GridSearchCV(tree.DecisionTreeClassifier(random_state=5), param_grid=pgrid, cv=5, n_jobs=4)

grid_search.fit(X_ordinal_train, y_ordinal_train)


print("Score with optimal depth (obtained with a grid search): " + str(grid_search.best_estimator_.score(X_ordinal_test, \
                                                                                                              y_ordinal_test)))
print("\n")
print("Optimal depth: " + str(grid_search.best_estimator_.get_depth()))
print("\n")

y_pred = grid_search.predict(X_ordinal_test)
print("Confusion matrix:")
print(confusion_matrix(y_ordinal_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_ordinal_test, y_pred))

Score for a tree of maximum size and ordinal dataset: 0.6833333333333333
Score with optimal depth (obtained with a grid search): 0.7366666666666667


Optimal depth: 3


Confusion matrix:
[[185  29]
 [ 50  36]]


Classification report:
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       214
           1       0.55      0.42      0.48        86

    accuracy                           0.74       300
   macro avg       0.67      0.64      0.65       300
weighted avg       0.72      0.74      0.72       300



<h3> V.4) Decision tree and binary dataset</h3>

In [11]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#dummy variables
clf = tree.DecisionTreeClassifier(random_state=6)
clf.fit(X_dummy_train, y_dummy_train)
print("Score for a tree of maximum size and binary dataset: "+str(clf.score(X_dummy_test, y_dummy_test)))

#looking for the best value for the max_depth parameter with dummy variables

grid_search = GridSearchCV(tree.DecisionTreeClassifier(random_state=7), param_grid=pgrid, cv=5, n_jobs=4)

grid_search.fit(X_dummy_train, y_dummy_train)


print("Score with optimal depth (obtained with a grid search): " + str(grid_search.best_estimator_.score(X_dummy_test, \
                                                                                                              y_dummy_test)))
print("\n")
print("Optimal depth: " + str(grid_search.best_estimator_.get_depth()))
print("\n")

y_pred = grid_search.predict(X_dummy_test)
print("Confusion matrix:")
print(confusion_matrix(y_dummy_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_dummy_test, y_pred))


Score for a tree of maximum size and binary dataset: 0.6233333333333333
Score with optimal depth (obtained with a grid search): 0.7433333333333333


Optimal depth: 4


Confusion matrix:
[[188  26]
 [ 51  35]]


Classification report:
              precision    recall  f1-score   support

           0       0.79      0.88      0.83       214
           1       0.57      0.41      0.48        86

    accuracy                           0.74       300
   macro avg       0.68      0.64      0.65       300
weighted avg       0.73      0.74      0.73       300



<h3>VI) SVM with kernel trick</h3>

<h3>VI.1) SVM with kernel trick and real-valued dataset with 35 variables</h3>

In [12]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#SVM with kernel trick
from sklearn import svm


#definition of different values for each parameter
#of each kernel (rbf, polynomial or linear)
#151 possible combinations

param_grid = [
    {'kernel': ['rbf'], 'gamma': ['auto', 'scale', 0.1, 1, 2, 5, 10, 12, 15, 17, 20], \
     'C': [0.01, 0.1, 1.0, 10, 20, 30, 50, 70, 100]},
    {'kernel': ['poly'], 'degree': [3, 10, 30], 'C': [0.01, 0.1, 1.0, 5, 7, 10, 12, 15, 17, 20, 30, 50, 70, 100]},
    {'kernel': ['linear'], 'C': [0.01, 0.1, 1.0, 10, 15, 20, 30, 50, 70, 100]}
]

#use of grid search to find the best parameters with 35 numerical variables

clf = GridSearchCV(svm.SVC(max_iter=100000), param_grid, cv=5, n_jobs=4)

clf.fit(X_real_train, y_real_train)

print("Best score with an SVM with kernel trick and the real-valued dataset with 35 variables: " + str(\
                                                                                        clf.best_estimator_.score(X_real_test,\
                                                                                        y_real_test)))
print("\n")
print("Kernel and optimal parameters: " + str(clf.best_params_))
print("\n")

y_pred = clf.best_estimator_.predict(X_real_test)
print("Confusion matrix:")
print(confusion_matrix(y_real_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_real_test, y_pred))


Best score with an SVM with kernel trick and the real-valued dataset with 35 variables: 0.74


Kernel and optimal parameters: {'C': 50, 'gamma': 'auto', 'kernel': 'rbf'}


Confusion matrix:
[[180  34]
 [ 44  42]]


Classification report:
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       214
           1       0.55      0.49      0.52        86

    accuracy                           0.74       300
   macro avg       0.68      0.66      0.67       300
weighted avg       0.73      0.74      0.73       300



<h3>VI.2) SVM with kernel trick and complete real-valued dataset</h3>

In [13]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#use of grid search to find the best parameters with the complete real variables:

clf = GridSearchCV(svm.SVC(max_iter=100000), param_grid, cv=5, n_jobs=4)

clf.fit(X_real_complete_train, y_real_complete_train)

print("Best score with an SVM with kernel trick and the complete real-valued dataset: " + str(clf.best_estimator_.score(\
                                                                                                    X_real_complete_test,\
                                                                                                    y_real_complete_test)))
print("\n")
print("Kernel and optimal parameters: " + str(clf.best_params_))
print("\n")

y_pred = clf.best_estimator_.predict(X_real_complete_test)
print("Confusion matrix:")
print(confusion_matrix(y_real_complete_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_real_complete_test, y_pred))

Best score with an SVM with kernel trick and the complete real-valued dataset: 0.76


Kernel and optimal parameters: {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}


Confusion matrix:
[[193  21]
 [ 51  35]]


Classification report:
              precision    recall  f1-score   support

           0       0.79      0.90      0.84       214
           1       0.62      0.41      0.49        86

    accuracy                           0.76       300
   macro avg       0.71      0.65      0.67       300
weighted avg       0.74      0.76      0.74       300



<h3>VI.3) SVM with kernel trick and ordinal dataset</h3>

In [14]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#use of grid search to find the best parameters with ordinal variables:

clf = GridSearchCV(svm.SVC(max_iter=100000), param_grid, cv=5, n_jobs=4)

clf.fit(X_ordinal_train, y_ordinal_train)

print("Best score with an SVM with kernel trick and the ordinal dataset: " + str(clf.best_estimator_.score(X_ordinal_test,\
                                                                                                            y_ordinal_test)))
print("\n")
print("Kernel and optimal parameters: " + str(clf.best_params_))
print("\n")

y_pred = clf.best_estimator_.predict(X_ordinal_test)
print("Confusion matrix:")
print(confusion_matrix(y_ordinal_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_ordinal_test, y_pred))

Best score with an SVM with kernel trick and the ordinal dataset: 0.77


Kernel and optimal parameters: {'C': 1.0, 'gamma': 0.1, 'kernel': 'rbf'}


Confusion matrix:
[[189  25]
 [ 44  42]]


Classification report:
              precision    recall  f1-score   support

           0       0.81      0.88      0.85       214
           1       0.63      0.49      0.55        86

    accuracy                           0.77       300
   macro avg       0.72      0.69      0.70       300
weighted avg       0.76      0.77      0.76       300



<h3>VI.4) SVM with kernel trick and binary dataset </h3>

In [15]:
#Abdelwahid Benslimane
#wahid.benslimane@gmail.com

#use of grid search to find the best parameters with ordinal variables:

clf = GridSearchCV(svm.SVC(max_iter=100000), param_grid, cv=5, n_jobs=4)

clf.fit(X_dummy_train, y_dummy_train)

print("Best score with an SVM with kernel trick and the binary dataset: " + str(clf.best_estimator_.score(X_dummy_test,\
                                                                                                            y_dummy_test)))
print("\n")
print("Kernel and optimal parameters: " + str(clf.best_params_))
print("\n")

y_pred = clf.best_estimator_.predict(X_dummy_test)
print("Confusion matrix:")
print(confusion_matrix(y_dummy_test, y_pred))
print("\n")
print("Classification report:")
print(classification_report(y_dummy_test, y_pred))

Best score with an SVM with kernel trick and the binary dataset: 0.7433333333333333


Kernel and optimal parameters: {'C': 20, 'gamma': 'auto', 'kernel': 'rbf'}


Confusion matrix:
[[179  35]
 [ 42  44]]


Classification report:
              precision    recall  f1-score   support

           0       0.81      0.84      0.82       214
           1       0.56      0.51      0.53        86

    accuracy                           0.74       300
   macro avg       0.68      0.67      0.68       300
weighted avg       0.74      0.74      0.74       300



<h3>VII) Synthesis of the results</h3>

The results turn out to be interesting. Indeed, the best result with a complete decision tree was obtained with the ordinal dataset. On the other hand, the best score obtained with an optimized (pruned) tree was with binary data. We can also observe that the optimal depth varies according to the dataset used. 

It is also interesting to note that the decision tree, complete or pruned, generalizes better when trained with the real-valued dataset containing only the 35 most important synthetic variables, carrying 77.66% of the total information, than with the complete real-valued dataset, but it is more ignificant with the complete tree.  

For SVM with kernel trick, the best score was obtained with the ordinal dataset. The optimal hyper-parameters of the model also differ according to the dataset used. Unlike the decision tree, the SVM with kernel trick generalizes better when trained with the complete real-valued dataset, than with the real-valued dataset containing only the 35 most important synthetic variables.

The precision is always greater when it comes to correctly classifying non-risky customers rather than risky customers, simply because the original dataset contains only 1000 individuals and risky customers are under-represented (30%).

It is important to mention again that there is some randomness in the process of building the decision tree with the related module from sklearn that could affect the results, that is is why a good approach would be to generate several trees and to classify according to the major vote (the class assigned will be the most reccurent one). 
