# Model Statistical Analysis and Comparison

To conclude our project, in this section we will discuss how we compare the performance of our models using a Wilcoxon Hypothesis Test. Our objective is to determine if the models are statistically different in terms of their performance on a given metric. 

We start by importing relevant libraries and our dataset.

In [None]:
from kfold_and_metrics import *

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf
tf.keras.backend.clear_session()

import pandas as pd
import scipy.stats as ss

In [None]:
df = pd.read_csv("final.csv")
df = df.drop(columns=["id"])
print(df.shape)
df.head()

We will be considering the following hyperparameters, taken from the hypertuning process.

In [None]:
rf_params = {
    'n_estimators': 25,
	'max_depth': 25,
	'min_samples_leaf': 5,
	'criterion': "log_loss"
}

xgb_params = {
    'n_estimators': 25,
	'max_depth': 5,
	'min_child_weight': 15
}

nn_params = {
    'hidden_layer_nodes': 60,
	'hidden_layer_activation': "relu",
	'learning_rate': 0.01
}

With these, we compile the models.

In [None]:
rf = RandomForestClassifier(**rf_params)
xgb = XGBClassifier(**xgb_params)

nn = tf.keras.models.Sequential([
    tf.keras.layers.Input((100,), name="input"),
    tf.keras.layers.Dense(nn_params['hidden_layer_nodes'], activation=nn_params['hidden_layer_activation']),
    tf.keras.layers.Dense(2,activation='softmax')
])

nn.compile(
    optimizer=tf.keras.optimizers.SGD(nn_params['learning_rate']), 
    loss=tf.keras.losses.SparseCategoricalCrossentropy(), 
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

### Confusion Matrixes

Taking a first look at the confusion matrixes, we are able to have a first idea of the models performances.

#### Random Forest

In [None]:
rf_scores = k_fold_cv(model=rf, df=df, pca_components=100, show_confusion_matrix=True)

#### XGBoost

In [None]:
xgb_scores = k_fold_cv(xgb, df, pca_components=100, show_confusion_matrix=True)

#### Neural Network

In [None]:
nn_scores = k_fold_cv_keras(compiled_model=nn, df=df, pca_components=100, show_confusion_matrix=True)

### Main Metrics

We also took a first look at the F1, Accuracy and AUC scores.

#### Random Forest

In [None]:
rf_metrics = mean_std_results_k_fold_CV(rf_scores)
rf_metrics

#### XGBoost

In [None]:
xgb_metrics = mean_std_results_k_fold_CV(xgb_scores)
xgb_metrics

#### Neural Network

In [None]:
nn_metrics = mean_std_results_k_fold_CV(nn_scores)
nn_metrics

### Performance Comparison

We use the Wilcoxon test to compare the median values of these metrics between pairs of models. The Wilcoxon test is a non-parametric statistical test that assesses whether there is a statistically significant difference between two paired groups, for a fixed significance level. It is particularly suited for our use case, even with a large dataset. 

By performing this test on our model's performance metrics, we can make data-driven decisions about which models perform significantly better or worse for specific tasks. This analysis helps us choose the most suitable model for our problem and understand the statistical differences in performance.

#### Why a Wilcoxon test?

We chose this test considering the following:

- **Non-parametric Test**: The Wilcoxon test is robust to deviations from normality, which is advantageous when dealing with a large dataset where the assumption of normal distribution might not hold.

- **Paired Data**: We are comparing metrics from the same dataset, making it a paired comparison. This approach allows us to account for the specific characteristics of our data.

- **Statistical Significance**: By calculating p-values using the Wilcoxon test, we can determine whether the observed differences are statistically significant, for a fixed significance level, which is crucial for meaningful model comparison.

#### Implementation

To achieve our goal, we defined a function designed to compare the performances using a Wilcoxon Hypothesis Test, organizing the results into a set of dataframes for easy interpretation. This function follows the following steps:

1. Create dataframes for each metric such that each column holds the values for a model, and each row represents that model's value for that metric in a specific fold.

2. For each of the dataframes created in step 1, perform Wilcoxon's Hypothesis Test to test if there is a difference in the median values of the folds for each model on that metric.

3. Perform the test for each combination of two models and present the results in the form of (model 1, model 2, p-value) in a dataframe.

4. Organize the p-values in a dictionary, with each metric name as the key and the associated dataframe as the value.

In [None]:
def models_performance_comparison(scores_dict):

    metrics_dfs_of_fold_per_model = {}
    for model_name, model_metrics_folds_results in scores_dict.items():
        for metric_name, metric_folds_results in model_metrics_folds_results.items():
            if metric_name not in metrics_dfs_of_fold_per_model.keys():
                metrics_dfs_of_fold_per_model[metric_name] = pd.DataFrame(
                    index=[f"fold{i}" for i in range(1,len(metric_folds_results)+1)],
                    data=metric_folds_results,
                    columns=[model_name]
                )
            else: metrics_dfs_of_fold_per_model[metric_name][model_name] = metric_folds_results
    
    metrics_dfs_of_pvalues = {}
    for metric_name, folds_models_values in metrics_dfs_of_fold_per_model.items():
        pvalues_df_data = []
        for i in range(len(folds_models_values.columns)):
            for j in range(i+1, len(folds_models_values.columns)):
                model1 = folds_models_values.columns[i]
                model2 = folds_models_values.columns[j]
                pvalue = ss.wilcoxon(folds_models_values[model1].to_numpy(), folds_models_values[model2].to_numpy()).pvalue
                pvalues_df_data.append({"model1": model1, "model2": model2, "pvalue": pvalue})
        metrics_dfs_of_pvalues[metric_name] = pd.DataFrame(data=pvalues_df_data, columns=["model1", "model2", "pvalue"])

    return metrics_dfs_of_pvalues

We give our function the scores obtained previously.

In [None]:
scores_dict = {
    'RF': rf_scores,
    'XGB': xgb_scores,
    'NN': nn_scores
}
metrics_dfs_of_fold_per_model = models_performance_comparison(scores_dict)

### Results Interpretation

For the purposes of this project, we decided to consider a significance level of 0.05 (for a 95% confidence level).

#### Accuracy

In [None]:
metrics_dfs_of_fold_per_model['accuracy_score']
# Write conclusions from the results in the table

#### F1 Score

In [None]:
metrics_dfs_of_fold_per_model['f1_score']
# Write conclusions from the results in the table

#### AUC Score

In [None]:
metrics_dfs_of_fold_per_model['roc_auc_score']
# Write conclusions from the results in the table