# Regression Stats Tests

**Friedman Test and Nemenyi Test**

Friedman is supposedly good for R^2 and RMSE/MSE, so might lean towards it. Adopts H0 of "All models perform equally"
Nemenyi Post-Hoc - use after Friedman to ID which pairs of models differe (rather than as a whole). Outputs a critical difference diagram (useful for presentation)
Alt: use repeated ANOVA, but rarely robust for cheminformatics (noisy data)

# Classification Stats Tests

**Friedman Test and Nemenyi Test**

Same as above, works well for ROC AUC and F1-scoring

**Cochran's Q Test**

Supposedly quite good for binary classification, so should be applicable (all our data is 1/0 for class). Extenction of McNemar's for >2 models

**Permutative Testing**

Pairwise comparisons, useful for small sample sizes (irrelevant) or unknown distributions (very relevant)

### ***References***

- "Statistical Comparisons of Classifiers over Multiple Data Sets", J. ML. Research, 2006
- "Data Mining: Practical Machine Learning Tools and Techniques", 2011
- J. Chem. Inf, Generally

# Analysis
*Friedman Test Statistic* - Critical value for hypothesis to be defeated

*P-Value* - Critical probability for the hypothesis to be defeated

*Output Matrix* - If hypothesis is defeated, comparison of scoring between each model

These tests rely on the scores of the optimised validation sets. They should be tested based on (at least) two metrics, so that significant difference between **models** can be found, not just between **model outputs**. That is to say, a model can have an artificially high R^2 score which is statistically better than the other models without the model itself being statistically better. If multiple key metrics are statistically better, we can assume the model itself is better. If a significant difference between models is found (F-Test Stat > 4.00, P-Value < 0.05), the matrix of scores will be output.

In [None]:
!pip install scikit-posthocs

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import friedmanchisquare
import scikit_posthocs as sp

In [5]:
# Example: 10-fold CV results for 6 models
# models: LGB, XGB, RF, SVM, MLP, CatBoost

# Simulated scores: shape (folds, models) (for now, see below)
r2_scores = np.random.rand(10, 6)
mse_scores = np.random.rand(10, 6)
roc_auc_scores = np.random.rand(10, 6)
f1_scores = np.random.rand(10, 6)

# Will need an import function for data from parameter tuning (validation sets)
# Using tuned models allows us to assess the best performing model at its optimal performance
# Some models (should) improve more drastically from tuning

In [10]:
def run_friedman_nemenyi(metric_matrix, metric_name):
    print(f"\nTesting: {metric_name}")
    
    # Friedman test
    stat, p = friedmanchisquare(*[metric_matrix[:, i] for i in range(metric_matrix.shape[1])])
    print(f"Friedman test statistic: {stat:.4f}, p-value: {p:.4f}")
    
    if p < 0.05:
        print(f"Significant differences found. \n\nNemenyi post-hoc test for {metric_name}:")
        df = pd.DataFrame(metric_matrix, columns=models)
        nemenyi = sp.posthoc_nemenyi_friedman(df)
        print(nemenyi)
    else:
        print("No significant differences found.")

In [7]:
run_friedman_nemenyi(r2_scores, "R² (Regression)")
run_friedman_nemenyi(mse_scores, "MSE (Regression)")
run_friedman_nemenyi(roc_auc_scores, "ROC AUC (Classification)")
run_friedman_nemenyi(f1_scores, "F1-score (Classification)")


Testing: R² (Regression)
Friedman test statistic: 1.8857, p-value: 0.8647
No significant differences found.

Testing: MSE (Regression)
Friedman test statistic: 4.8000, p-value: 0.4408
No significant differences found.

Testing: ROC AUC (Classification)
Friedman test statistic: 5.8857, p-value: 0.3175
No significant differences found.

Testing: F1-score (Classification)
Friedman test statistic: 0.5714, p-value: 0.9893
No significant differences found.


In [11]:
# Import Function (It's standardised but a few lines long)
models=  ["LGB", "XGB", "MLP", "RF", "SVM", "CatBoost"]
def import_func(path):
    scores = pd.read_csv(path)
    scores = scores.to_numpy(copy= True)
    scores = scores[:, 1:]
    scores = scores.T
    return scores
# In this case, each column is a model and each row is an iteration
r2_scores = import_func("JAK1ParamTuning.csv")
run_friedman_nemenyi(r2_scores, "R² (Regression)")


Testing: R² (Regression)
Friedman test statistic: 16.0857, p-value: 0.0066
Significant differences found. 

Nemenyi post-hoc test for R² (Regression):
               LGB       XGB       MLP        RF       SVM  CatBoost
LGB       1.000000  0.845079  0.016639  0.999420  1.000000  0.427525
XGB       0.845079  1.000000  0.326040  0.958997  0.845079  0.984591
MLP       0.016639  0.326040  1.000000  0.046736  0.016639  0.755551
RF        0.999420  0.958997  0.046736  1.000000  0.999420  0.650490
SVM       1.000000  0.845079  0.016639  0.999420  1.000000  0.427525
CatBoost  0.427525  0.984591  0.755551  0.650490  0.427525  1.000000


In [12]:
# extracting ranks (WIP)
r2 = r2_scores.T

LGB = [r2[0]]
XGB = [r2[1]] 
MLP = [r2[2]] 
RF = [r2[3]]
SVM = [r2[4]]
CatBoost = [r2[5]]

rankings = {
    "LGB": [],
    "XGB": [], 
    "MLP": [], 
    "RF": [],
    "SVM": [],
    "CatBoost": []
}
del r2
print(rankings)

{'LGB': [], 'XGB': [], 'MLP': [], 'RF': [], 'SVM': [], 'CatBoost': []}


In [14]:
#The average (best) r2 value for each model:

avg_r2 = np.mean(r2_scores, axis=0)
print(avg_r2)

[0.57499046 0.54247067 0.07955306 0.55913691 0.56740179 0.543811  ]


Review


Ensemble model will be more robust anyways, no use in stats testing for now