# Replicable Machine Learning Model to Determines Respondent Honesty

Author :
- Gita Rayung Andadari (2041509)
- Endrit Sveçla (2041500)
- Brenda Tellez (2041498)
- Michele Zanatta (2045739)

Psychologists often use a questionnaire to understand people's behavior based on their responses. Then, the result can be used in many ways, several examples being :

1. To determine the degree of someone's fitness in a job environment,
2. To determine if a person is ready or not before adopting a child,
3. To determine if someone has a specific type of mental disorder,
4. To determine if someone is a victim of an accident.

However, as these questionnaires' popularity rises, people are becoming more conscious about the relation between their answers and their consequences. Thus, some people fake their answers to get away with the "best" consequences. There are two types of faking answers:

1. Faking good: Behavior in which subjects present themselves in a **favorable manner**, endorsing desirable traits and rejecting undesirable ones.
In example 1, people lie in the job fitness questionnaire so the employer thinks they are a better person, increasing their job acceptance probability.

2. Faking bad: Behavior in which subjects present themselves in a **less favorable manner**, endorsing less desirable traits and rejecting desirable ones.

In example 3, people lie in the mental disorder questionnaire to receive certain health benefits.

Therefore, a system should distinguish whether a response to these questionnaires is honest or dishonest. This problem is a classic binary classification problem from a machine-learning perspective. Traditional machine learning algorithms, such as logistic regression, can be trained to distinguish whether someone is lying with decent accuracy. However, a new problem arose. Every questioner has its own property and a different tendency to fake good or bad. Hence, different machine-learning models had to be used each time. This makes psychologist question the reliability of the model proposed. The ideal method should be **replicable** across different types of datasets. A research study (or, in this project, a machine learning model) is considered replicable when the entire research process is conducted again, using the same methods but new data, and still yields the same results. This shows that the results of the original study are reliable.

In this project, our team attempted to find and propose a **replicable machine learning model to distinguish whether a questionnaire response is honest or dishonest**. Our model will be tested against 16 different datasets, 8 faking good and 8 faking bad. We explored three different avenues for our problems:

1. Machine learning algorithm paired with model-dependent feature selection ( SelectFromModel() and Forward Selection)
2. Model-independent feature selection (Chi-Square, Mutual Information)
3. Machine learning algorithm paired with agnostic feature selection (Permutation Importance)
4. Machine learning algorithm paired with dimensionality reduction techniques (Principal Component Analysis [PCA], Sparse [PCA] )

5 different machine learning algorithms used for the first and third avenues are: Logistic Regression, Random Forest, SVM, and KNN.

A method that produces high and stable accuracy across datasets will be considered as the replicable method.

We will walk you through the idea that we proposed using one dataset, DT_df_CC.csv
Later, we will apply the same method to all 16 datasets

# Preparation

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
# Import dataset from google drive
faking_good = {'DT_df_CC.csv' : 'https://drive.google.com/file/d/1DGibOwzy1sXa9wMmbwyzSAVoyyJhV-H9/view?usp=share_link'}

file_name = 'DT_df_CC.csv'
url = faking_good[file_name]
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

if file_name.split('.')[1] == 'csv':
    df = pd.read_csv(path)
    if df.shape[1] == 1:
        df = pd.read_csv(path, sep = ';')
else:
    df = pd.read_excel(path)

In [None]:
# Checking the shape, null values, and duplicates
print(df.shape)
print(df.isna().any().any())
print(df.duplicated().any())

(482, 28)
False
True


In [None]:
""" Function prepare_data: Used to split the data using train_test_split method
Input: Dataframe
Output: Splited dataframe. """
from sklearn.model_selection import train_test_split
def prepare_data(df):

    X = df.loc[:, df.columns != 'CONDITION']
    y = df.loc[:, 'CONDITION']
    y.replace({'H':1, 'D':0}, inplace=True)

    X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, train_size = 0.9, random_state=0)

    X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, train_size = 0.8, random_state=0)

    return X_train_val, X_test, y_train_val, y_test, X_train, X_val, y_train, y_val

In [None]:
# Preparing the data
X_train_val, X_test, y_train_val, y_test, X_train, X_val, y_train, y_val = prepare_data(df)


n_features = X_train_val.shape[1]
names = np.array(X_train_val.columns)

perc_best_features = 0.2
n_best_features = int(perc_best_features*n_features)

# Feature selection model dependent

Model-dependent feature selection attempts to perform the selection and the processing of the data simultaneously.

Methods:

1. Forward Model Selection:

Forward Model Selection (FMS) is a feature selection method that starts with an empty set of features and iteratively adds features to the set, one at a time, based on a predefined criterion, such as the improvement in model performance.
  
The goal of FMS is to find the smallest subset of features that results in the best performance of the model.

2. SelectFromModel Method:

SelectFromModel is a feature selection method that is implemented as a transformer in the scikit-learn library. It is designed to work with any estimator with a "coef_" or "feature_importances_" attribute after fitting.
  
The idea behind SelectFromModel is to use an estimator's coefficients or feature importances to select the most informative features.

For each model selection model, we will applied the below procedure:

1. Fit the models with 100% of features;
2. Apply model dependent features selection method and take the 20% best features. Fit the models with this set of features;
3. [just for interpretable models] Select 20% best features based on coefficients values;
4. Apply model agnostic, model independent features selection techniques, and dimensionality reduction methods taking the 20% best features. Fit the models with this set of features;
5. Compare the results.



In [None]:
# Importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

from math import exp
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel

from sklearn.feature_selection import chi2, SelectPercentile
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif



In [None]:
""" Function fit_predict: Is a method that combines the fit and predict methods
of a machine learning model into a single function call.
Inputs: model: A certain machine learning model
        X,y: Train data
        X_test,y_test: Test data
Output: acc_test_model: Accuracy of the model in the test data.  """
def fit_predict(model, X, y, X_test, y_test):

    model.fit(X, y)

    y_train_pred = model.predict(X)
    y_test_pred = model.predict(X_test)

    acc_train_model = accuracy_score(y, y_train_pred)
    acc_test_model = accuracy_score(y_test, y_test_pred)

    return acc_test_model

In [None]:
""" Function seq_fit_transform: Implements Forward Model Selection Method as
descriped previously.
Inputs: model: Machine leaning model
        X, y, X_test: Train and test data
        n_features_to_select: Number of features to select
Output: Selected train and test data. """
def seq_fit_transform(model, X, y, X_test, n_features_to_select = None):
    np.random.seed(0)

    sfs = SequentialFeatureSelector(model, direction='forward', cv = 5,
                                    n_features_to_select = n_features_to_select)

    sfs.fit(X, y)
    X_train_selection = sfs.transform(X)
    X_test_selection = sfs.transform(X_test)
    return X_train_selection, X_test_selection, sfs

In [None]:
""" Function model_sel: Implements Select From Model Method as
descriped previously.
Inputs: model: Machine leaning model
        X, y, X_test: Train and test data
        n_features_to_select: Number of features to select
Output: Selected train and test data. """
def model_sel(model, X, y, X_test, n_features_to_select = None):
    np.random.seed(0)

    sfs = SelectFromModel(model, max_features = n_features_to_select)

    sfs.fit(X, y)
    X_train_selection = sfs.transform(X)
    X_test_selection = sfs.transform(X_test)
    return X_train_selection, X_test_selection, sfs

## Logistic Regression (MD)

In [None]:
# Implementation of Logistic Regression with Forward Selection and SelectFromModel methods
lr = LogisticRegression()
acc_lr_100 = fit_predict(lr, X_train, y_train, X_test, y_test)


#forward selection
X_train_selection_lr, X_test_selection_lr, sfs_lr = seq_fit_transform(lr, X_train_val, y_train_val, X_test, n_best_features)
acc_lr_fw = fit_predict(lr, X_train_selection_lr, y_train_val, X_test_selection_lr, y_test)
support_lr_fw= X_train.columns[(sfs_lr.get_support())]

#model selection
X_train_selection_lr, X_test_selection_lr, sfs_lr = model_sel(lr, X_train_val, y_train_val, X_test, n_best_features)
acc_lr_ms = fit_predict(lr, X_train_selection_lr, y_train_val, X_test_selection_lr, y_test)
support_lr_ms = X_train.columns[(sfs_lr.get_support())]

In [None]:
# The features selected from the methods
# We observe that 'Psycho1' and 'Psycho5 H' features were picked by both methods
print(support_lr_fw)
print(support_lr_ms)

Index(['Mach6 ', 'Psycho1 ', 'Psycho5 H', 'Psycho6', 'Psycho8 '], dtype='object')
Index(['Mach3 ', 'Mach7 ', 'Psycho1 ', 'Psycho2 ', 'Psycho5 H'], dtype='object')


## Random Forest (MD)

In [None]:
#fine tuning the hyper parameter
rf = RandomForestClassifier(n_estimators = 25)

parameters_rf = {'max_depth': [4, 8, 12], 'min_samples_leaf' : [1, 3, 5], 'max_leaf_nodes' : [None, 3, 5]}
model_rf = GridSearchCV(rf, parameters_rf, cv = 5)
model_rf.fit(X_train_val, y_train_val)

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_estimators=25),
             param_grid={'max_depth': [4, 8, 12],
                         'max_leaf_nodes': [None, 3, 5],
                         'min_samples_leaf': [1, 3, 5]})

In [None]:
# Implementation of Random Forest Classifier with Forward Selection and SelectFromModel methods
model = RandomForestClassifier(n_estimators = 25,
                             max_depth = model_rf.best_params_['max_depth'],
                             min_samples_leaf = model_rf.best_params_['min_samples_leaf'],
                            max_leaf_nodes = model_rf.best_params_['max_leaf_nodes'])

rf = model
acc_rf_100 = fit_predict(model, X_train_val, y_train_val, X_test, y_test)


#forward selection
X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
acc_rf_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
support_rf_fw= X_train.columns[(sfs.get_support())]

#model selection
X_train_selection, X_test_selection, sfs = model_sel(model, X_train_val, y_train_val, X_test, n_best_features)
acc_rf_ms = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
support_rf_ms = X_train.columns[(sfs.get_support())]

In [None]:
# The features selected from the methods
# We observe that 'Mach7','Psycho1', and 'Psycho5' features were picked by both methods
print(support_rf_fw)
print(support_rf_ms)

Index(['Mach7 ', 'Psycho1 ', 'Psycho5 H', 'Psycho6', 'Narc1'], dtype='object')
Index(['Mach7 ', 'Mach9 ', 'Psycho1 ', 'Psycho2 ', 'Psycho5 H'], dtype='object')


## SVM (MD)

In [None]:
#fine tuning hyper parameter
svm = SVC()
C_values = np.log(np.linspace(1.1, exp(1), 5))
parameters_svm = {'C' : C_values}
23
model_svm = GridSearchCV(svm, parameters_svm, cv = 5)
model_svm.fit(X_train_val, y_train_val)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': array([0.09531018, 0.40850745, 0.64665336, 0.83885289, 1.        ])})

In [None]:
# Implementation of SVC with Forward Selection method
model = SVC(C = model_svm.best_params_['C'])
svm = model
acc_svm_100 = fit_predict(model, X_train, y_train, X_test, y_test)


#forward selection
X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
acc_svm_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
support_svm_fw= X_train.columns[(sfs.get_support())]

#model selection -- cant be done because the model does not have parameter coefficient

In [None]:
# The features selected from the method
print(support_svm_fw)

Index(['Psycho1 ', 'Psycho4 ', 'Psycho5 H', 'Narc7 ', 'Narc8 '], dtype='object')


## KNN (MD)

In [None]:
#fine tuning hyper parameter
knn = KNeighborsClassifier()
parameters_knn = {'n_neighbors' : [1, 5, 10, 20]}

model_knn = GridSearchCV(knn, parameters_knn, cv = 5)
model_knn.fit(X_train_val, y_train_val)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 5, 10, 20]})

In [None]:
# Implementation of KNN with Forward Selection method
model = KNeighborsClassifier(n_neighbors = model_knn.best_params_['n_neighbors'])
knn = model
acc_knn_100 = fit_predict(model, X_train, y_train, X_test, y_test)


#forward selection
X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
acc_knn_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
support_knn_fw= X_train.columns[(sfs.get_support())]

#model selection -- cant be done

In [None]:
# The features selected from the method
print(support_knn_fw)

Index(['Psycho1 ', 'Psycho2 ', 'Psycho5 H', 'Psycho8 ', 'Narc7 '], dtype='object')


## Comparison (MD)

Another

In [None]:
""" Function jaccard_similarity: Measures the similarity between two sets of data,
in this case between two results we obtained from different combinations of models and methods.
Input: Two lists of results
Output: Jaccard similarity coefficient. """
def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union


In [None]:
#forward selection
support_models_fw = [support_lr_fw, support_rf_fw, support_svm_fw, support_knn_fw]
results_js_fw = np.zeros(shape = (len(support_models_fw), len(support_models_fw)))
results_int_fw = np.zeros(shape = (len(support_models_fw), len(support_models_fw)))

for i in range(len(support_models_fw)):
    for j in range(len(support_models_fw)):
        results_js_fw[i, j] = round(jaccard_similarity(support_models_fw[i], support_models_fw[j]), 2)
        results_int_fw[i, j] = round(len(support_models_fw[i] & support_models_fw[j])/len(support_models_fw[i]), 2)

  results_int_fw[i, j] = round(len(support_models_fw[i] & support_models_fw[j])/len(support_models_fw[i]), 2)


In [None]:
#model selection
support_models_ms = [support_lr_ms, support_rf_ms]
results_js_ms = np.zeros(shape = (len(support_models_ms), len(support_models_ms)))
results_int_ms = np.zeros(shape = (len(support_models_ms), len(support_models_ms)))

for i in range(len(support_models_ms)):
    for j in range(len(support_models_ms)):
        results_js_ms[i, j] = round(jaccard_similarity(support_models_ms[i], support_models_ms[j]), 2)
        results_int_ms[i, j] = round(len(support_models_ms[i] & support_models_ms[j])/len(support_models_ms[i]), 2)

  results_int_ms[i, j] = round(len(support_models_ms[i] & support_models_ms[j])/len(support_models_ms[i]), 2)


In [None]:
# Jaccard Similarity results
models = ['lr', 'rf', 'svm', 'knn']
results1 = pd.DataFrame(results_js_fw, index = models, columns = models)
results2 = pd.DataFrame(results_int_fw, index = models, columns = models)
print('Jaccard similarity forward selection : \n', results1)
print('\n')
print('Intersection percentage forward selection: \n', results2)
print('\n')


jaccard_fs = results1.mean()
int_fs = results2.mean()

models = ['lr', 'rf']
results1 = pd.DataFrame(results_js_ms, index = models, columns = models)
results2 = pd.DataFrame(results_int_ms, index = models, columns = models)
print('Jaccard similarity model selection : \n', results1)
print('\n')
print('Intersection percentage model selection: \n', results2)
jaccard_ms = results1.mean()
int_ms = results2.mean()


Jaccard similarity forward selection : 
        lr    rf   svm   knn
lr   1.00  0.43  0.25  0.43
rf   0.43  1.00  0.25  0.25
svm  0.25  0.25  1.00  0.43
knn  0.43  0.25  0.43  1.00


Intersection percentage forward selection: 
       lr   rf  svm  knn
lr   1.0  0.6  0.4  0.6
rf   0.6  1.0  0.4  0.4
svm  0.4  0.4  1.0  0.6
knn  0.6  0.4  0.6  1.0


Jaccard similarity model selection : 
       lr    rf
lr  1.00  0.67
rf  0.67  1.00


Intersection percentage model selection: 
      lr   rf
lr  1.0  0.8
rf  0.8  1.0


In [None]:
# Accuracy results
acc_100 = [acc_lr_100, acc_rf_100, acc_svm_100, acc_knn_100]
acc_20_fw = [acc_lr_fw, acc_rf_fw, acc_svm_fw, acc_knn_fw]
acc_20_ms = [acc_lr_ms, acc_rf_ms, None, None]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms])

models = ['lr','rf','svm','knn']
results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection'])

print(results_accuracy)

                             lr        rf       svm       knn
100%                   0.734694  0.755102  0.755102  0.693878
20% forward selection   0.77551  0.653061  0.693878  0.714286
20% model selection    0.755102  0.714286      None      None


# Model independent feature selection

Model independent feature selection is a feature selection method where features are selected based on a correlation type statistical measures between input and output variables. In this project, two model independent feature selection are being considered:

1. Chi Square
2. Mutual Information

The procedure for model independent feature selection are:

1. Fit the models with 100% of features;
2. Calculate the correlation type statistical measures between input and output;
3. Select 20% best features based on metric in step number 2;
4. Train the machine learning model just using the selected features
5. Compare the results.


## Chi Squared Filter Method

We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores.
It determines if the association between two categorical variables of the sample would reflect their real association in the population.

In [None]:
# Chi square method and the features that were selected

model_chi2 = SelectKBest(chi2, k=5)

model_chi2.fit(X_train_val, y_train_val)
support_chi2 = model_chi2.get_support()

X_train_selected_chi2 = model_chi2.transform(X_train_val)
X_test_selected_chi2 = model_chi2.transform(X_test)
chi_support = names[support_chi2]

print('Chi support {}'.format(chi_support))

Chi support ['Mach5 ' 'Psycho1 ' 'Psycho2 ' 'Psycho5 H' 'Psycho8 ']


In [None]:
# Jaccard Similarity between the methods

results_js = np.zeros(shape = (len(support_models_fw), 1))
results_int = np.zeros(shape = (len(support_models_fw), 1))


for i in range(len(support_models_fw)):
        results_js[i] = round(jaccard_similarity(support_models_fw[i], chi_support), 2)
        results_int[i] = round(len(support_models_fw[i] & chi_support)/len(chi_support), 2)

models = ['lr', 'rf', 'svm', 'knn']
results5 = pd.DataFrame(results_js, index = models, columns = ['Chi2'])
results6 = pd.DataFrame(results_int, index = models, columns = ['Chi2'])
print('Jaccard similarity between forward selection and chi square filter method: \n', results5)
print('\n')
print('Intersection percentage between forward selection and chi square filter method: \n', results6)
print('\n')


results_js = np.zeros(shape = (len(support_models_ms), 1))
results_int = np.zeros(shape = (len(support_models_ms), 1))


for i in range(len(support_models_ms)):
        results_js[i] = round(jaccard_similarity(support_models_ms[i], chi_support), 2)
        results_int[i] = round(len(support_models_ms[i] & chi_support)/len(chi_support), 2)

models = ['lr', 'rf']
results5 = pd.DataFrame(results_js, index = models, columns = ['Chi2'])
results6 = pd.DataFrame(results_int, index = models, columns = ['Chi2'])
print('Jaccard similarity between model selection and chi square filter method: \n', results5)
print('\n')
print('Intersection percentage between model selection and chi square filter method: \n', results6)

  results_int[i] = round(len(support_models_fw[i] & chi_support)/len(chi_support), 2)


Jaccard similarity between forward selection and chi square filter method: 
      Chi2
lr   0.43
rf   0.25
svm  0.25
knn  0.67


Intersection percentage between forward selection and chi square filter method: 
      Chi2
lr    0.6
rf    0.4
svm   0.4
knn   0.8


Jaccard similarity between model selection and chi square filter method: 
     Chi2
lr  0.43
rf  0.43


Intersection percentage between model selection and chi square filter method: 
     Chi2
lr   0.6
rf   0.6


  results_int[i] = round(len(support_models_ms[i] & chi_support)/len(chi_support), 2)


In [None]:
#train the model with reduce feature
acc_lr3 = fit_predict(lr, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)

rf3 = rf
acc_rf3 = fit_predict(rf3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)

svm3 = svm
acc_svm3 = fit_predict(svm3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)

knn3 = knn
acc_knn3 = fit_predict(knn3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)

In [None]:
acc_chi2 = [acc_lr3, acc_rf3, acc_svm3, acc_knn3]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2])

models = ['lr', 'rf', 'svm', 'knn']
results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection','chi2'])

print(results_accuracy)

                             lr        rf       svm       knn
100%                   0.734694  0.755102  0.755102  0.693878
20% forward selection   0.77551  0.653061  0.693878  0.714286
20% model selection    0.755102  0.714286      None      None
chi2                   0.734694  0.755102  0.734694   0.77551


## Mutual Information

The entropy of a random variable is the average level of "information," "surprise," or "uncertainty."

Mutual Information (Information Gain)  calculates the statistical dependence between two variables to find the reduction in entropy from transforming a dataset somehow.

Feature selection: by evaluating the gain of each variable in the context of the target variable and picking the ones with the most information.


In [None]:
# feature selection - mutual information
fs = SelectKBest(score_func=mutual_info_classif, k=n_best_features)
fs.fit(X_train_val, y_train_val)

X_train_mi = SelectKBest(mutual_info_classif, k=6).fit_transform(X_train_val,y_train_val)
X_test_mi = SelectKBest(mutual_info_classif, k=6).fit_transform(X_test,y_test)

In [None]:
#train the model with reduce feature
acc_lr_mi = fit_predict(lr, X_train_mi, y_train_val, X_test_mi, y_test)

rf3 = rf
acc_rf_mi = fit_predict(rf, X_train_mi, y_train_val, X_test_mi, y_test)

svm3 = svm
acc_svm_mi = fit_predict(svm, X_train_mi, y_train_val, X_test_mi, y_test)

knn3 = knn
acc_knn_mi = fit_predict(knn, X_train_mi, y_train_val, X_test_mi, y_test)

In [None]:
acc_mi = [acc_lr_mi, acc_rf_mi, acc_svm_mi, acc_knn_mi]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2, acc_mi])

models = ['lr', 'rf', 'svm', 'knn']
results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection','chi2' ,'Mutual Info.'])
#results_accuracy.index.name = file_name
print(results_accuracy)

                             lr        rf       svm       knn
100%                   0.734694  0.755102  0.755102  0.693878
20% forward selection   0.77551  0.653061  0.693878  0.714286
20% model selection    0.755102  0.714286      None      None
chi2                   0.734694  0.755102  0.734694   0.77551
Mutual Info.           0.591837  0.734694  0.734694  0.693878


# Model agnostic feature selection

Model agnostic methods refer to a broader category of machine learning methods that do not rely on specific assumptions about the underlying distribution or structure of the data or the specific characteristics of the model being used.

These methods can be applied to a wide variety of data and models, and they are often considered to be more flexible and generalizable than model-specific methods.

Model agnostic separates the explanations from the machine learning model. The advantages of model-agnostic interpretation methods are:

1. Model flexibility
2. Explanation flexibility
3. Representation flexibility

## Feature permutation


Permutation feature importance measures the increase in the model’s prediction error after we permuted the feature’s values, which breaks the relationship between the feature and the true outcome.

A feature is “important” if shuffling its values increases the model error because, in this case, the model relied on the feature for the prediction.

A feature is “unimportant” if shuffling its values leaves the model error unchanged because, in this case, the model ignored the feature for the prediction.


In [None]:
from sklearn.inspection import permutation_importance

In [None]:
""" Function imporance_permutation: Implements Permutation Importance feature selection method
previous discussed together with a ML model.
Inputs: model: Machine learning model
        X_train_val: Training data
        perc_features_selected
Output: Features selected. """
def importance_permutation(model, X_train_val, perc_features_selected = None):
    n_features = X_train_val.shape[1]
    features_selected = np.full((n_features), False)

    if perc_features_selected:
        number_features = int(perc_features_selected*n_features)
        for i in model.importances_mean.argsort()[::-1][:number_features]:
            features_selected[i] = True
            #print(f"{names[i]:<8}\t"
            #     f"{model.importances_mean[i]:.3f}"
            #     f" +/- {model.importances_std[i]:.3f}")
    else:
        for i in model.importances_mean.argsort()[::-1]:
            if model.importances_mean[i] - 2 * model.importances_std[i] > 0:
                features_selected[i] = True
            #    print(f"{names[i]:<8}\t"
            #         f"{model.importances_mean[i]:.3f}"
            #         f" +/- {model.importances_std[i]:.3f}")

    return features_selected

In [None]:
# Execution of the method with each model
# Logistic Regression
lr4 = lr
lr4.fit(X_train_val, y_train_val)
lr4_pi = permutation_importance(lr4, X_train_val, y_train_val,
                           n_repeats=30,
                           random_state=0)

perm_lr = importance_permutation(lr4_pi,X_train_val,perc_features_selected = perc_best_features)

X_train_selected_lr_perm = X_train_val.loc[:, names[perm_lr]]
X_test_selected_lr_perm = X_test.loc[:, names[perm_lr]]

#train the model with reduced features
lr4 = lr
acc_lr4 = fit_predict(lr4, X_train_selected_lr_perm, y_train_val, X_test_selected_lr_perm, y_test)

# Random Forest
rf4 = rf
rf4.fit(X_train_val, y_train_val)
rf4_pi = permutation_importance(rf4, X_train_val, y_train_val,
                           n_repeats=30,
                           random_state=0)

perm_rf = importance_permutation(rf4_pi,X_train_val,perc_features_selected = perc_best_features)

X_train_selected_rf_perm = X_train_val.loc[:, names[perm_rf]]
X_test_selected_rf_perm = X_test.loc[:, names[perm_rf]]

rf4 = rf

acc_rf4 = fit_predict(rf4, X_train_selected_rf_perm, y_train_val, X_test_selected_rf_perm, y_test)

# SVM
svm4 = svm

svm4.fit(X_train_val, y_train_val)
svm4_pi = permutation_importance(svm4, X_train_val, y_train_val,
                           n_repeats=30,
                           random_state=0)

perm_svm = importance_permutation(svm4_pi,X_train_val,perc_features_selected = perc_best_features)

X_train_selected_svm_perm = X_train_val.loc[:, names[perm_svm]]
X_test_selected_svm_perm = X_test.loc[:, names[perm_svm]]

svm4 = svm

acc_svm4 = fit_predict(svm4, X_train_selected_svm_perm, y_train_val, X_test_selected_svm_perm, y_test)

# KNN
knn4 = knn

knn4.fit(X_train_val, y_train_val)
knn4_pi = permutation_importance(knn4, X_train_val, y_train_val,
                           n_repeats=30,
                           random_state=0)

perm_knn = importance_permutation(knn4_pi,X_train_val,perc_features_selected = perc_best_features)

X_train_selected_knn_perm = X_train_val.loc[:, names[perm_knn]]
X_test_selected_knn_perm = X_test.loc[:, names[perm_knn]]

knn4 = knn

acc_knn4 = fit_predict(knn4, X_train_selected_knn_perm, y_train_val, X_test_selected_knn_perm, y_test)

In [None]:
# Jaccard similarity results from permutation importance method
support_models3 = [perm_lr, perm_rf, perm_svm, perm_knn]

results_js = np.zeros(shape = (len(support_models3), len(support_models3)))
results_int = np.zeros(shape = (len(support_models3), len(support_models3)))

for i in range(len(support_models3)):
    for j in range(len(support_models3)):
        results_js[i, j] = round(jaccard_similarity(names[support_models3[i]], names[support_models3[j]]), 2)
        results_int[i, j] = round(len(names[support_models3[i] & support_models3[j]])/len(names[support_models3[i]]), 2)

models = ['lr', 'rf', 'svm', 'knn']
results7 = pd.DataFrame(results_js, index = models, columns = models)
results8 = pd.DataFrame(results_int, index = models, columns = models)
jaccard_pi = results7.mean()
int_pi = results7.mean()
print('Jaccard similarity: \n', results7)
print('\n')
print('Intersection percentage: \n', results8)

Jaccard similarity: 
        lr    rf   svm   knn
lr   1.00  0.67  0.67  0.43
rf   0.67  1.00  0.67  0.25
svm  0.67  0.67  1.00  0.25
knn  0.43  0.25  0.25  1.00


Intersection percentage: 
       lr   rf  svm  knn
lr   1.0  0.8  0.8  0.6
rf   0.8  1.0  0.8  0.4
svm  0.8  0.8  1.0  0.4
knn  0.6  0.4  0.4  1.0


In [None]:
# Accuracies of the models implemented until now
acc_perm = [acc_lr4, acc_rf4, acc_svm4, acc_knn4]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2,acc_mi, acc_perm])

results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% FW', '20% ms','chi2','Mutual info.' ,'Perm_imp'])

print(results_accuracy)

                    lr        rf       svm       knn
100%          0.734694  0.755102  0.755102  0.693878
20% FW         0.77551  0.653061  0.693878  0.714286
20% ms        0.755102  0.714286      None      None
chi2          0.734694  0.755102  0.734694   0.77551
Mutual info.  0.591837  0.734694  0.734694  0.693878
Perm_imp      0.755102  0.734694  0.734694  0.734694


# Dimensionality Reduction Techniques

Dimensionality reduction is a technique used to reduce the number of features or variables in a dataset. It is often used to remove redundant or irrelevant features, improve the interpretability of the data, and speed up the training of machine learning models.

## PCA

It is a linear dimensionality reduction technique that aims to find a new set of uncorrelated variables, called principal components, that best capture the variation in the original data. It is often used to visualize high-dimensional data in lower-dimensional space.

In [None]:
# Initiating sklearn PCA
from sklearn.decomposition import PCA
import numpy as np

pca = PCA(n_components = n_best_features)
# Fit PCA in the train set
pca.fit(X_train_val)

# Transform train and test set

X_train_pca = pca.transform(X_train_val)
X_test_pca = pca.transform(X_test)

In [None]:
# Create a dataframe from the transformed data
#X_train
X_train_pca_df = pd.DataFrame(X_train_pca, columns=['PC{}'.format(i+1) for i in range(X_train_pca.shape[1])])

#X_test
X_test_pca_df = pd.DataFrame(X_test_pca, columns=['PC{}'.format(i+1) for i in range(X_test_pca.shape[1])])

In [None]:
#train the models with reduced features
lr5 = lr
acc_lr_pca = fit_predict(lr5, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

rf5 = rf
acc_rf_pca = fit_predict(rf5, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

svm5 = svm
acc_svm_pca = fit_predict(svm5, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

knn5 = knn
acc_knn_pca = fit_predict(knn5, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

In [None]:
#Accuracies
acc_pca = [acc_lr_pca, acc_rf_pca, acc_svm_pca, acc_knn_pca]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2, acc_perm, acc_mi,acc_pca])

models = ['lr', 'rf', 'svm', 'knn']
results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection','chi2', 'permutation importance','mutual information','PCA'])
#results_accuracy.index.name = file_name
print(results_accuracy)

                              lr        rf       svm       knn
100%                    0.734694  0.755102  0.755102  0.693878
20% forward selection    0.77551  0.653061  0.693878  0.714286
20% model selection     0.755102  0.714286      None      None
chi2                    0.734694  0.755102  0.734694   0.77551
permutation importance  0.755102  0.734694  0.734694  0.734694
mutual information      0.591837  0.734694  0.734694  0.693878
PCA                      0.77551   0.77551  0.734694  0.734694


## Sparse PCA

It is a variant of PCA that aims to find the directions in the data that have the highest variance and project the data onto those directions while maintaining the sparsity of the features.

In [None]:
from sklearn.decomposition import SparsePCA

# Initialize the model with the desired number of components
sparse_pca = SparsePCA(n_components = n_best_features, alpha=0.01)


# fit the train set and transform train and test set
sparse_pca.fit(X_train_val)
X_train_spca = sparse_pca.transform(X_train_val)
X_test_spca = sparse_pca.transform(X_test)


In [None]:
#X_train
X_train_spca_df = pd.DataFrame(X_train_spca, columns=['PC{}'.format(i+1) for i in range(X_train_spca.shape[1])])

#X_test
X_test_spca_df = pd.DataFrame(X_test_spca, columns=['PC{}'.format(i+1) for i in range(X_test_spca.shape[1])])

In [None]:
#train the models with reduced features
lr6 = lr
acc_lr_spca = fit_predict(lr6, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

rf6 = rf
acc_rf_spca = fit_predict(rf6, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

svm6 = svm
acc_svm_spca = fit_predict(svm6, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

knn6 = knn
acc_knn_spca = fit_predict(knn6, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

In [None]:
# Accuracies
acc_spca = [acc_lr_spca, acc_rf_spca, acc_svm_spca, acc_knn_spca]

acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2, acc_perm, acc_mi,acc_pca, acc_spca])

models = ['lr', 'rf', 'svm', 'knn']
results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection','chi2', 'permutation importance','mutual information','PCA', 'Sparse PCA'])
results_accuracy['file_name'] = file_name

#results_accuracy.index.name = file_name
print(results_accuracy)

                              lr        rf       svm       knn     file_name
100%                    0.734694  0.755102  0.755102  0.693878  DT_df_CC.csv
20% forward selection    0.77551  0.653061  0.693878  0.714286  DT_df_CC.csv
20% model selection     0.755102  0.714286      None      None  DT_df_CC.csv
chi2                    0.734694  0.755102  0.734694   0.77551  DT_df_CC.csv
permutation importance  0.755102  0.734694  0.734694  0.734694  DT_df_CC.csv
mutual information      0.591837  0.734694  0.734694  0.693878  DT_df_CC.csv
PCA                      0.77551   0.77551  0.734694  0.734694  DT_df_CC.csv
Sparse PCA               0.77551  0.734694  0.734694   0.77551  DT_df_CC.csv


# Repeat to all dataset

In [None]:
""" The following function repeatability: Repeats the procedure to all the other 15 datasets.
Input: file_name: Name of the file
       path: File path """
def repeatability(file_name,path):
    results_accuracy = pd.DataFrame()

    if file_name.split('.')[1] == 'csv':
        df = pd.read_csv(path)
        if df.shape[1] == 1:
            df = pd.read_csv(path, sep = ';')

    X_train_val, X_test, y_train_val, y_test, X_train, X_val, y_train, y_val = prepare_data(df)

    n_features = X_train_val.shape[1]
    names = np.array(X_train_val.columns)

    perc_best_features = 0.2
    n_best_features = int(perc_best_features*n_features)

    ########################
    # Model dependent - LR #
    ########################
    lr = LogisticRegression()
    acc_lr_100 = fit_predict(lr, X_train, y_train, X_test, y_test)

    #forward selection
    X_train_selection_lr, X_test_selection_lr, sfs_lr = seq_fit_transform(lr, X_train_val, y_train_val, X_test, n_best_features)
    acc_lr_fw = fit_predict(lr, X_train_selection_lr, y_train_val, X_test_selection_lr, y_test)
    support_lr_fw= X_train.columns[(sfs_lr.get_support())]

    #model selection
    X_train_selection_lr, X_test_selection_lr, sfs_lr = model_sel(lr, X_train_val, y_train_val, X_test, n_best_features)
    acc_lr_ms = fit_predict(lr, X_train_selection_lr, y_train_val, X_test_selection_lr, y_test)
    support_lr_ms = X_train.columns[(sfs_lr.get_support())]

    ########################
    # Model dependent - RF #
    ########################
    #fine tuning the hyper parameter
    rf = RandomForestClassifier(n_estimators = 25)

    parameters_rf = {'max_depth': [4, 8, 12], 'min_samples_leaf' : [1, 3, 5], 'max_leaf_nodes' : [None, 3, 5]}
    model_rf = GridSearchCV(rf, parameters_rf, cv = 5)
    model_rf.fit(X_train_val, y_train_val)

    model = RandomForestClassifier(n_estimators = 25,
                                 max_depth = model_rf.best_params_['max_depth'],
                                 min_samples_leaf = model_rf.best_params_['min_samples_leaf'],
                                max_leaf_nodes = model_rf.best_params_['max_leaf_nodes'])

    rf = model
    acc_rf_100 = fit_predict(model, X_train_val, y_train_val, X_test, y_test)


    #forward selection
    X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
    acc_rf_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
    support_rf_fw= X_train.columns[(sfs.get_support())]

    #model selection
    X_train_selection, X_test_selection, sfs = model_sel(model, X_train_val, y_train_val, X_test, n_best_features)
    acc_rf_ms = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
    support_rf_ms = X_train.columns[(sfs.get_support())]

    ########################
    # Model dependent - SVM #
    ########################

    #fine tuning hyper parameter
    svm = SVC()
    C_values = np.log(np.linspace(1.1, exp(1), 5))
    parameters_svm = {'C' : C_values}
    23
    model_svm = GridSearchCV(svm, parameters_svm, cv = 5)
    model_svm.fit(X_train_val, y_train_val)

    model = SVC(C = model_svm.best_params_['C'])
    svm = model
    acc_svm_100 = fit_predict(model, X_train, y_train, X_test, y_test)


    #forward selection
    X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
    acc_svm_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
    support_svm_fw= X_train.columns[(sfs.get_support())]

    ########################
    # Model dependent - KNN #
    ########################

    #fine tuning hyper parameter
    knn = KNeighborsClassifier()
    parameters_knn = {'n_neighbors' : [1, 5, 10, 20]}

    model_knn = GridSearchCV(knn, parameters_knn, cv = 5)
    model_knn.fit(X_train_val, y_train_val)

    model = KNeighborsClassifier(n_neighbors = model_knn.best_params_['n_neighbors'])
    knn = model
    acc_knn_100 = fit_predict(model, X_train, y_train, X_test, y_test)


    #forward selection
    X_train_selection, X_test_selection, sfs = seq_fit_transform(model, X_train_val, y_train_val, X_test, n_best_features)
    acc_knn_fw = fit_predict(model, X_train_selection, y_train_val, X_test_selection, y_test)
    support_knn_fw= X_train.columns[(sfs.get_support())]

    #######################################
    # summarizing model dependent result  #
    #######################################

    acc_100 = [acc_lr_100, acc_rf_100, acc_svm_100, acc_knn_100]
    acc_20_fw = [acc_lr_fw, acc_rf_fw, acc_svm_fw, acc_knn_fw]
    acc_20_ms = [acc_lr_ms, acc_rf_ms, None, None]

    ##################################
    # Model independent - Chi Square #
    ##################################

    model_chi2 = SelectKBest(chi2, k=5)

    model_chi2.fit(X_train_val, y_train_val)
    support_chi2 = model_chi2.get_support()

    X_train_selected_chi2 = model_chi2.transform(X_train_val)
    X_test_selected_chi2 = model_chi2.transform(X_test)
    chi_support = names[support_chi2]

    #train the model with reduce feature
    lr3 = lr
    acc_lr3 = fit_predict(lr3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)
    rf3 = rf
    acc_rf3 = fit_predict(rf3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)
    svm3 = svm
    acc_svm3 = fit_predict(svm3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)
    knn3 = knn
    acc_knn3 = fit_predict(knn3, X_train_selected_chi2, y_train_val, X_test_selected_chi2, y_test)
    acc_chi2 = [acc_lr3, acc_rf3, acc_svm3, acc_knn3]

    ########################################
    # Model Agnostic - Feature Permutation #
    ########################################

    #print('Logistic Regression')
    lr4 = LogisticRegression()
    lr4.fit(X_train_val, y_train_val)
    lr4_pi = permutation_importance(lr4, X_train_val, y_train_val,
                               n_repeats=30,
                               random_state=0)

    perm_lr = importance_permutation(lr4_pi,X_train_val,perc_features_selected = perc_best_features)

    X_train_selected_lr_perm = X_train_val.loc[:, names[perm_lr]]
    X_test_selected_lr_perm = X_test.loc[:, names[perm_lr]]

    #train the model with reduced features
    lr4 = LogisticRegression()
    acc_lr4 = fit_predict(lr4, X_train_selected_lr_perm, y_train_val, X_test_selected_lr_perm, y_test)

    #print('\nRandom Forest')
    rf4 = rf
    rf4.fit(X_train_val, y_train_val)
    rf4_pi = permutation_importance(rf4, X_train_val, y_train_val,
                               n_repeats=30,
                               random_state=0)

    perm_rf = importance_permutation(rf4_pi,X_train_val,perc_features_selected = perc_best_features)

    X_train_selected_rf_perm = X_train_val.loc[:, names[perm_rf]]
    X_test_selected_rf_perm = X_test.loc[:, names[perm_rf]]

    rf4 = rf

    acc_rf4 = fit_predict(rf4, X_train_selected_rf_perm, y_train_val, X_test_selected_rf_perm, y_test)

    #print('\nSVM')
    svm4 = svm

    svm4.fit(X_train_val, y_train_val)
    svm4_pi = permutation_importance(svm4, X_train_val, y_train_val,
                               n_repeats=30,
                               random_state=0)

    perm_svm = importance_permutation(svm4_pi,X_train_val,perc_features_selected = perc_best_features)

    X_train_selected_svm_perm = X_train_val.loc[:, names[perm_svm]]
    X_test_selected_svm_perm = X_test.loc[:, names[perm_svm]]

    svm4 = svm

    acc_svm4 = fit_predict(svm4, X_train_selected_svm_perm, y_train_val, X_test_selected_svm_perm, y_test)


    #print('\nKNN')
    knn4 = knn

    knn4.fit(X_train_val, y_train_val)
    knn4_pi = permutation_importance(knn4, X_train_val, y_train_val,
                               n_repeats=30,
                               random_state=0)

    perm_knn = importance_permutation(knn4_pi,X_train_val,perc_features_selected = perc_best_features)

    X_train_selected_knn_perm = X_train_val.loc[:, names[perm_knn]]
    X_test_selected_knn_perm = X_test.loc[:, names[perm_knn]]

    knn4 = knn

    acc_knn4 = fit_predict(knn4, X_train_selected_knn_perm, y_train_val, X_test_selected_knn_perm, y_test)

    acc_perm = [acc_lr4, acc_rf4, acc_svm4, acc_knn4]

    ########################################
    # Model Independent - Mutual Information  #
    ########################################

    fs = SelectKBest(score_func=mutual_info_classif, k=n_best_features)
    fs.fit(X_train_val, y_train_val)

    X_train_mi = SelectKBest(mutual_info_classif, k=6).fit_transform(X_train_val,y_train_val)
    X_test_mi = SelectKBest(mutual_info_classif, k=6).fit_transform(X_test,y_test)

    #train the model with reduce feature
    #print('Logistic Regression')
    lr5 = lr
    acc_lr_mi = fit_predict(lr5, X_train_mi, y_train_val, X_test_mi, y_test)

    #print('Random Forest')
    rf5 = rf
    acc_rf_mi = fit_predict(rf5, X_train_mi, y_train_val, X_test_mi, y_test)

    #print('SVM')
    svm5 = svm
    acc_svm_mi = fit_predict(svm5, X_train_mi, y_train_val, X_test_mi, y_test)

    #print('KNN')
    knn5 = knn
    acc_knn_mi = fit_predict(knn5, X_train_mi, y_train_val, X_test_mi, y_test)

    acc_mi = [acc_lr_mi, acc_rf_mi, acc_svm_mi, acc_knn_mi]

    ########################
    # Model Agnostic - PCA #
    ########################

    # Create a dataframe from the transformed data
    pca = PCA(n_components=n_best_features)
    pca.fit(X_train_val)
    X_train_pca = pca.transform(X_train_val)
    X_test_pca = pca.transform(X_test)
    X_train_pca_df = pd.DataFrame(X_train_pca, columns=['PC{}'.format(i+1) for i in range(X_train_pca.shape[1])])

    #X_test
    # Create a dataframe from the transformed data
    X_test_pca_df = pd.DataFrame(X_test_pca, columns=['PC{}'.format(i+1) for i in range(X_test_pca.shape[1])])

    #train the model with reduce feature
    #print('Logistic Regression')
    lr6 = lr
    acc_lr_pca = fit_predict(lr6, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

    #print('Random Forest')
    rf6 = rf
    acc_rf_pca = fit_predict(rf6, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

    #print('SVM')
    svm6 = svm
    acc_svm_pca = fit_predict(svm6, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

    #print('KNN')
    knn6 = knn
    acc_knn_pca = fit_predict(knn6, X_train_pca_df, y_train_val, X_test_pca_df, y_test)

    acc_pca = [acc_lr_pca, acc_rf_pca, acc_svm_pca, acc_knn_pca]

    ###############################
    # Model Agnostic - Sparse PCA #
    ###############################

    sparse_pca = SparsePCA(n_components=n_best_features, alpha=0.01)

    # fit and transform the data in one step
    sparse_pca.fit(X_train_val)
    X_train_spca = sparse_pca.transform(X_train_val)
    X_test_spca = sparse_pca.transform(X_test)

    #X_train
    X_train_spca_df = pd.DataFrame(X_train_spca, columns=['PC{}'.format(i+1) for i in range(X_train_spca.shape[1])])

    #X_test
    X_test_spca_df = pd.DataFrame(X_test_spca, columns=['PC{}'.format(i+1) for i in range(X_test_spca.shape[1])])

    #train the model with reduce feature
    lr7 = lr
    acc_lr_spca = fit_predict(lr7, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

    rf7 = rf
    acc_rf_spca = fit_predict(rf7, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

    svm7 = svm
    acc_svm_spca = fit_predict(svm7, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

    knn7 = knn
    acc_knn_spca = fit_predict(knn7, X_train_spca_df, y_train_val, X_test_spca_df, y_test)

    acc_spca = [acc_lr_spca, acc_rf_spca, acc_svm_spca, acc_knn_spca]

    ###############################
    # Summarizing all result #
    ###############################
    acc_comp = np.array([acc_100, acc_20_fw, acc_20_ms, acc_chi2, acc_perm, acc_mi,acc_pca, acc_spca])

    models = ['lr', 'rf', 'svm', 'knn']
    results_accuracy = pd.DataFrame(acc_comp, columns = models, index = ['100%', '20% forward selection', '20% model selection','chi2', 'permutation importance','mutual information','PCA', 'Sparse PCA'])
    results_accuracy['file_name'] = file_name

    return results_accuracy

In [None]:
final_result = pd.DataFrame()

In [None]:
file_url = {'R_NEO_PI.csv' : 'https://drive.google.com/file/d/1XcPEKvV7f0EYuTmHNmqcg1lMBzMHbYdG/view?usp=share_link',
            'PRMQ_df.csv' : 'https://drive.google.com/file/d/1AQNixB6CtGyk-H6jVUIcBPJRojKVFCpC/view?usp=share_link',
            'PCL5_df.csv' : 'https://drive.google.com/file/d/1KMyTxops8osxDpsL7WY-fF2Fe2JB2AAg/view?usp=share_link',
            'NAQ_R.csv' : 'https://drive.google.com/file/d/1u6ChVqTcQZQqpYKIBfoSB0wZHYVqn1le/view?usp=share_link',
            'PHQ9_GAD7.csv' : 'https://drive.google.com/file/d/1i6Y2URPTo7GtmOcrkaAxyTy037SMafQv/view?usp=share_link',
            'PID5.csv' : 'https://drive.google.com/file/d/1jBToUOA3sUEdPh3Wenk9RHzm1zBQy9ua/view?usp=share_link',
            'IESR_df.csv' : 'https://drive.google.com/file/d/1cefqrKIn_C9ym40S3MsLSghTqjKo5rEt/view?usp=share_link',
            'RAW_DDDT.csv' : 'https://drive.google.com/file/d/1WrChF-LbVIhOeVTke_ZkWaUGwhEy_4Ys/view?usp=share_link',
            'IADQ_DF.csv' : 'https://drive.google.com/file/d/1tJrvQakEJj-fOpDru4gveVHPM6jw7csX/view?usp=share_link',
            'DT_df_CC.csv' : 'https://drive.google.com/file/d/1DGibOwzy1sXa9wMmbwyzSAVoyyJhV-H9/view?usp=share_link',
            'DT_df_JI.csv' : 'https://drive.google.com/file/d/1-H57PXuri9sKNtk_XpQjX9P3ey1fu54K/view?usp=share_link',
            'PRFQ_df.csv' : 'https://drive.google.com/file/d/1CwkrbPaGRSoaX7YpkZj0Vx6rqBZxbjub/view?usp=share_link',
            'BF_df_CTU.csv' : 'https://drive.google.com/file/d/15WC2c0SWZ_aQFxOhNWG8JqEDW5aid4cV/view?usp=share_link',
            'BF_df_OU.csv' : 'https://drive.google.com/file/d/1gCHDMwsRV2apCbE8r7InYOPZAohxp_sW/view?usp=share_link',
            'shortPID5.csv' : 'https://drive.google.com/file/d/19wOlnr9Td2-NWWb51513LPrW_yFbkVE0/view?usp=share_link',
            'BF_df_V.csv' : 'https://drive.google.com/file/d/1nfjxBodeLP3cZ0a4LE4nkw4oVYJ02ndu/view?usp=share_link'
             }

file_name_list = file_url.keys()

In [None]:
i = 1
for file_name in file_name_list:
    if file_name in ['R_NEO_PI.csv', 'PID5.csv']:
        continue
    url = file_url[file_name]
    path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
    print(str(i)+". "+file_name)

    result = repeatability(file_name,path)
    final_result = pd.concat([final_result, result])
    i+=1

In [None]:
final_result

Unnamed: 0,lr,rf,svm,knn,file_name
100%,0.734694,0.755102,0.755102,0.693878,DT_df_CC.csv
20% forward selection,0.77551,0.653061,0.693878,0.714286,DT_df_CC.csv
20% model selection,0.755102,0.714286,,,DT_df_CC.csv
chi2,0.734694,0.755102,0.734694,0.77551,DT_df_CC.csv
permutation importance,0.755102,0.734694,0.734694,0.734694,DT_df_CC.csv
...,...,...,...,...,...
chi2,0.734694,0.795918,0.734694,0.673469,BF_df_V.csv
permutation importance,0.734694,0.755102,0.795918,0.693878,BF_df_V.csv
mutual information,0.673469,0.693878,0.591837,0.612245,BF_df_V.csv
PCA,0.755102,0.734694,0.755102,0.755102,BF_df_V.csv


## Overal Method analysis

In [None]:
final_result_2 = final_result.copy()

In [None]:
final_result_2 = final_result_2.drop_duplicates(keep='first')

In [None]:
file_name_list =  file_url.keys()
file_name_list

dict_keys(['R_NEO_PI.csv', 'PRMQ_df.csv', 'PCL5_df.csv', 'NAQ_R.csv', 'PHQ9_GAD7.csv', 'PID5.csv', 'IESR_df.csv', 'RAW_DDDT.csv', 'IADQ_DF.csv', 'DT_df_CC.csv', 'DT_df_JI.csv', 'PRFQ_df.csv', 'BF_df_CTU.csv', 'BF_df_OU.csv', 'shortPID5.csv', 'BF_df_V.csv'])

In [None]:
final_result_t = pd.DataFrame()
for file_name in file_name_list:
    df_new = pd.DataFrame()

    dum = final_result_2.loc[final_result_2['file_name'] == file_name]
    dum = dum.loc[:, dum.columns != 'file_name']
    dum = dum.T
    (df_new := dum.unstack().to_frame().T).set_axis(
        [f"{i}_{j}" for i, j in df_new.columns], axis=1
    )

    df_new['file_name'] = file_name
    final_result_t = pd.concat([final_result_t, df_new])

In [None]:
final_result_t

Unnamed: 0_level_0,100%,100%,100%,100%,20% forward selection,20% forward selection,20% forward selection,20% forward selection,20% model selection,20% model selection,...,chi2,file_name,mutual information,mutual information,mutual information,mutual information,permutation importance,permutation importance,permutation importance,permutation importance
Unnamed: 0_level_1,knn,lr,rf,svm,knn,lr,rf,svm,knn,lr,...,svm,Unnamed: 13_level_1,knn,lr,rf,svm,knn,lr,rf,svm
0,0.836879,0.865248,0.914894,0.921986,0.879433,0.836879,0.907801,0.893617,,0.815603,...,0.886525,PRMQ_df.csv,0.907801,0.865248,0.921986,0.900709,0.843972,0.843972,0.886525,0.893617
0,0.878049,0.853659,0.829268,0.902439,0.853659,0.878049,0.878049,0.853659,,0.926829,...,0.853659,PCL5_df.csv,0.902439,0.682927,0.804878,0.926829,0.829268,0.902439,0.780488,0.878049
0,0.944444,0.944444,0.944444,0.944444,0.944444,0.944444,0.930556,0.944444,,0.930556,...,,NAQ_R.csv,,,,,0.930556,0.944444,0.902778,0.944444
0,0.982143,1.0,1.0,1.0,0.982143,0.973214,0.991071,0.982143,,0.964286,...,0.991071,PHQ9_GAD7.csv,1.0,0.991071,1.0,0.991071,0.991071,0.991071,0.955357,0.955357
0,0.972222,0.916667,0.972222,1.0,0.972222,0.972222,0.972222,0.972222,,0.972222,...,,IESR_df.csv,1.0,0.972222,1.0,1.0,0.944444,0.972222,0.972222,0.972222
0,0.747475,0.69697,0.808081,0.808081,0.787879,0.79798,0.767677,0.767677,,0.79798,...,0.787879,RAW_DDDT.csv,0.656566,0.59596,0.575758,0.575758,0.787879,0.79798,0.727273,0.727273
0,0.866667,0.911111,0.866667,0.844444,0.911111,0.866667,0.866667,0.866667,,0.866667,...,0.844444,IADQ_DF.csv,0.866667,0.888889,0.911111,0.888889,0.866667,0.866667,0.866667,0.866667
0,0.693878,0.734694,0.755102,0.755102,0.714286,0.77551,0.653061,0.693878,,0.755102,...,0.734694,DT_df_CC.csv,0.693878,0.714286,0.714286,0.591837,0.734694,0.755102,0.734694,0.734694
0,0.597701,0.609195,0.609195,0.597701,0.494253,0.563218,0.574713,0.528736,,0.574713,...,0.574713,DT_df_JI.csv,0.517241,0.45977,0.448276,0.45977,0.574713,0.609195,0.586207,0.586207
0,0.897059,0.897059,0.867647,0.926471,0.882353,0.838235,0.867647,0.852941,,0.867647,...,0.867647,PRFQ_df.csv,0.852941,0.794118,0.882353,0.794118,0.882353,0.838235,0.882353,0.838235


In [None]:
#take average accuracy for every method
final_result_t_mean = final_result_t.mean(axis=0, skipna= True)
final_result_t_mean.sort_values(ascending=False)

100%                    svm    0.859979
                        rf     0.851015
permutation importance  lr     0.850229
20% model selection     rf     0.850102
PCA                     svm    0.848719
100%                    lr     0.847996
PCA                     rf     0.847388
20% forward selection   svm    0.846986
permutation importance  svm    0.844839
20% forward selection   lr     0.843326
PCA                     lr     0.841695
100%                    knn    0.839671
chi2                    rf     0.839276
Sparse PCA              rf     0.836688
20% forward selection   rf     0.835542
chi2                    lr     0.835346
PCA                     knn    0.835242
Sparse PCA              svm    0.832580
permutation importance  knn    0.832240
20% model selection     lr     0.830804
Sparse PCA              lr     0.830057
20% forward selection   knn    0.828440
chi2                    svm    0.827758
permutation importance  rf     0.825486
chi2                    knn    0.825298


In [None]:
#take median accuracy for every method
final_result_t_median = final_result_t.median(axis=0, skipna= True)
final_result_t_median.sort_values(ascending=False)

100%                    svm    0.895664
PCA                     svm    0.886525
100%                    lr     0.881154
                        knn    0.872358
PCA                     rf     0.872340
chi2                    lr     0.870254
20% forward selection   rf     0.867157
permutation importance  rf     0.866667
mutual information      knn    0.866667
Sparse PCA              svm    0.866667
permutation importance  svm    0.866667
100%                    rf     0.866667
20% forward selection   knn    0.866546
PCA                     lr     0.865248
Sparse PCA              lr     0.865248
20% forward selection   svm    0.861905
chi2                    knn    0.860163
20% model selection     rf     0.860163
chi2                    rf     0.860163
Sparse PCA              rf     0.858156
20% model selection     lr     0.857246
permutation importance  lr     0.857246
20% forward selection   lr     0.852451
chi2                    svm    0.849051
PCA                     knn    0.844444


Now we are going to analyze the performance of the proposed model.

In general, there are 32 combinations of methods that we are proposing. The method with the highest mean and median will be denoted as the most stable. The reason for analyzing both the mean and median is because the mean is somewhat prone to outliers. We want to exclude these particular cases; hence we are investigating the median. However, since we want to find a stable method, we should not overlook edge cases. Hence, we are also considering the mean.

The two results above show that maintaining all the features tends to give a better classification performance. Especially when using the SVM algorithm.

However, we often deal with a vast dataset with rich features. Therefore, feature selection is necessary to minimize the processing time. The model Selection method paired with the random forest algorithm has the highest mean of accuracy for feature selection with a relatively similar value to its median. Sparse PCA paired with KNN has the highest median accuracy for feature selection. This method tends to work well but may not be a one-stop solution for all.





## Analysis based on fake good and fake bad datasets

In [None]:
#group by fake good fake bad

faking_good = {'DT_df_CC.csv' : 'https://drive.google.com/file/d/1DGibOwzy1sXa9wMmbwyzSAVoyyJhV-H9/view?usp=share_link',
           'DT_df_JI.csv' : 'https://drive.google.com/file/d/1-H57PXuri9sKNtk_XpQjX9P3ey1fu54K/view?usp=share_link',
           'PRFQ_df.csv' : 'https://drive.google.com/file/d/1CwkrbPaGRSoaX7YpkZj0Vx6rqBZxbjub/view?usp=share_link',
            'BF_df_CTU.csv' : 'https://drive.google.com/file/d/15WC2c0SWZ_aQFxOhNWG8JqEDW5aid4cV/view?usp=share_link',
            'BF_df_OU.csv' : 'https://drive.google.com/file/d/1gCHDMwsRV2apCbE8r7InYOPZAohxp_sW/view?usp=share_link',
              'shortPID5.csv' : 'https://drive.google.com/file/d/19wOlnr9Td2-NWWb51513LPrW_yFbkVE0/view?usp=share_link',
            'BF_df_V.csv' : 'https://drive.google.com/file/d/1nfjxBodeLP3cZ0a4LE4nkw4oVYJ02ndu/view?usp=share_link',
            'R_NEO_PI.csv' : 'https://drive.google.com/file/d/14iQKKDX1LZicize8f5ogn9EzbzjnzC-T/view?usp=share_link'
           }

faking_bad = {'PRMQ_df.csv' : 'https://drive.google.com/file/d/1AQNixB6CtGyk-H6jVUIcBPJRojKVFCpC/view?usp=share_link',
             'PCL5_df.csv' : 'https://drive.google.com/file/d/1KMyTxops8osxDpsL7WY-fF2Fe2JB2AAg/view?usp=share_link',
              'NAQ_R.csv' : 'https://drive.google.com/file/d/1dkLpSMhsej9NhqoNWZYQu9huL4j-voXH/view?usp=share_link',
              'PHQ9_GAD7.csv' : 'https://drive.google.com/file/d/1i6Y2URPTo7GtmOcrkaAxyTy037SMafQv/view?usp=share_link',
              'PID5.csv' : 'https://drive.google.com/file/d/1jBToUOA3sUEdPh3Wenk9RHzm1zBQy9ua/view?usp=share_link',
              'IESR_df.csv' : 'https://drive.google.com/file/d/1cefqrKIn_C9ym40S3MsLSghTqjKo5rEt/view?usp=share_link',
              'RAW_DDDT.csv' : 'https://drive.google.com/file/d/1WrChF-LbVIhOeVTke_ZkWaUGwhEy_4Ys/view?usp=share_link',
              'IADQ_DF.csv' : 'https://drive.google.com/file/d/1tJrvQakEJj-fOpDru4gveVHPM6jw7csX/view?usp=share_link'
             }

fake_good = faking_good.keys()
fake_bad = faking_bad.keys()


In [None]:
final_result_t_fg = pd.DataFrame()
for file_name in fake_good:
    df_new = pd.DataFrame()

    dum = final_result_2.loc[final_result_2['file_name'] == file_name]
    dum = dum.loc[:, dum.columns != 'file_name']
    dum = dum.T
    (df_new := dum.unstack().to_frame().T).set_axis(
        [f"{i}_{j}" for i, j in df_new.columns], axis=1
    )

    df_new['file_name'] = file_name
    final_result_t_fg = pd.concat([final_result_t_fg, df_new])

In [None]:
#take average accuracy for every method
final_result_t_fg_mean = final_result_t_fg.mean(axis=0, skipna= True)
final_result_t_fg_mean.sort_values(ascending=False)

100%                    lr     0.811977
chi2                    rf     0.807887
Sparse PCA              lr     0.807320
PCA                     lr     0.807320
                        rf     0.804594
100%                    svm    0.802616
Sparse PCA              svm    0.801030
PCA                     svm    0.799656
chi2                    lr     0.799388
Sparse PCA              rf     0.799129
20% model selection     rf     0.798957
permutation importance  svm    0.798588
                        lr     0.797773
100%                    rf     0.796947
Sparse PCA              knn    0.796769
20% forward selection   svm    0.796768
chi2                    svm    0.795645
20% forward selection   lr     0.791015
chi2                    knn    0.790710
100%                    knn    0.789646
PCA                     knn    0.784479
permutation importance  rf     0.780784
                        knn    0.779644
20% forward selection   rf     0.769078
20% model selection     lr     0.765303


In [None]:
final_result_t_fg_median = final_result_t_fg.median(axis=0, skipna= True)
final_result_t_fg_median.sort_values(ascending=False)

20% forward selection   svm    0.852941
permutation importance  lr     0.838235
chi2                    rf     0.826087
20% forward selection   lr     0.826087
                        rf     0.826087
permutation importance  svm    0.826087
20% model selection     rf     0.826087
chi2                    svm    0.804348
PCA                     rf     0.804348
permutation importance  knn    0.804348
chi2                    knn    0.804348
100%                    lr     0.804348
chi2                    lr     0.804348
100%                    svm    0.804348
mutual information      knn    0.804348
                        svm    0.794118
PCA                     lr     0.782609
Sparse PCA              lr     0.782609
PCA                     knn    0.782609
Sparse PCA              rf     0.782609
                        svm    0.782609
PCA                     svm    0.782609
Sparse PCA              knn    0.782609
mutual information      rf     0.782609
                        lr     0.782609


Looking at the method performance based on fake good datasets, maintaining all features is the most stable method, especially paired with the Logistic Regression method (mean=81%, median=80%). **The best method for feature selection for fake good is the Chi-square method paired with the random forest algorithm (mean=80,7%, median=82.6%).**

In [None]:
final_result_t_fb = pd.DataFrame()
for file_name in fake_bad:
    df_new = pd.DataFrame()

    dum = final_result_2.loc[final_result_2['file_name'] == file_name]
    dum = dum.loc[:, dum.columns != 'file_name']
    dum = dum.T
    (df_new := dum.unstack().to_frame().T).set_axis(
        [f"{i}_{j}" for i, j in df_new.columns], axis=1
    )

    df_new['file_name'] = file_name
    final_result_t_fb = pd.concat([final_result_t_fb, df_new])

In [None]:
#take average accuracy for every method

final_result_t_fb_mean = final_result_t_fb.mean(axis=0, skipna= True)
final_result_t_fb_mean.sort_values(ascending=False)



100%                    svm    0.917342
PCA                     svm    0.905959
100%                    rf     0.905082
20% forward selection   knn    0.904413
permutation importance  lr     0.902685
Sparse PCA              rf     0.902418
20% forward selection   rf     0.902006
20% model selection     rf     0.901248
PCA                     rf     0.897314
20% forward selection   svm    0.897204
20% model selection     lr     0.896306
20% forward selection   lr     0.895637
PCA                     knn    0.894466
permutation importance  svm    0.891090
100%                    knn    0.889697
mutual information      knn    0.888912
Sparse PCA              svm    0.887793
chi2                    lr     0.885687
permutation importance  knn    0.884837
100%                    lr     0.884014
chi2                    rf     0.883221
PCA                     lr     0.881800
mutual information      svm    0.880543
chi2                    knn    0.873721
                        svm    0.872716


In [None]:
final_result_t_fb_median = final_result_t_fb.median(axis=0, skipna= True)
final_result_t_fb_median.sort_values(ascending=False)

20% model selection     lr     0.926829
100%                    svm    0.921986
mutual information      rf     0.916548
100%                    rf     0.914894
mutual information      svm    0.913769
20% forward selection   knn    0.911111
100%                    lr     0.911111
20% model selection     rf     0.911111
20% forward selection   rf     0.907801
mutual information      knn    0.905120
permutation importance  lr     0.902439
Sparse PCA              rf     0.895745
PCA                     svm    0.894482
20% forward selection   svm    0.893617
permutation importance  svm    0.893617
chi2                    lr     0.888889
permutation importance  rf     0.886525
100%                    knn    0.878049
20% forward selection   lr     0.878049
mutual information      lr     0.877069
PCA                     knn    0.876751
Sparse PCA              svm    0.876596
PCA                     rf     0.875195
chi2                    knn    0.866667
                        rf     0.866667


For fake bad dataset, maintaining all the features during the training process is the most stable method, especially using the SVM method (mean=93%, median=94%); this is the same method as the all-time high mean accuracy across all datasets. If we take a closer look, from the median perspective, Sparse PCA paired with the random forest is giving the best performance. Furthermore, its average mean is also relatively high, one of 6 out of 32.

# Conclusion

We have conducted several experiments to propose the most stable method to Determine Respondent Honesty in a questionnaire. The stability of this method is a significant metric to consider the applicability of a model. The higher the mean and median method accuracy across all datasets, the more stable the method is.

After deep diving into the accuracy performance based on the fake good and fake bad datasets, the top of the 32 proposed methods applied to all datasets, using all the features paired with the SVM algorithm shown to have the most stable accuracy. Hence, our team considers the below procedure can always be applied to any dataset regardless of the class:

1. Preprocess dataset
2. Train the SVM algorithm with 100% of the features
3. Use the trained algorithm to make the prediction on an unseen dataset

The above procedure should give a decent result. Our experiment's overall accuracy mean is 86.6%, and the median is 90%.

However, we acknowledge that feature selection is sometimes needed to minimize the processing time. Some datasets are too big to be processed if we keep all the available features. To accommodate the need for feature selection, our team considers the below procedure can always be applied to any dataset regardless of the class :

1. Preprocess dataset (import, handle missing values, etc.)
2. Select the important features that can explain most of the data variance using Sparse PCA
3. Train the Random Forest algorithm with the dataset containing only important features from step 2
4. Use the trained algorithm to predict on an unseen dataset

In the overall dataset we can see that using 20% Model Selection with Random Foreset gives in average the same result as the procedure proposed, but this method can not be paired with all the models.

We have tried to distinguish the method accuracy based on the fake good and fake bad dataset categories. In general, the above procedure remains valid for both types. One thing to consider for the fake good dataset, one way to improve the dataset's accuracy is to consider the Chi-Square method for feature selection instead of Sparse PCA.

To close this project, we highlight that having a method that works reasonably well with all datasets is tricky. However, these findings can be a good base model for the analyst starting their analysis journey. We understand that it could be overwhelming to consider all the existing methods considering the vast amount of options available. Therefore, our proposed procedure can be taken into consideration as the beginning of something better in the future of the research.