## Setting up the environment, loading & splitting of the data, and class distribution analysis

In [2]:
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt
import xgboost as xgb
import pandas as pd
import numpy as np
import time

# Load the dataset and split
data = pd.read_csv("spambase.data", header=None)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

n_splits = 10
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)


models = {
    "Logistic Regression": LogisticRegression(max_iter=5000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBClassifier(eval_metric='logloss', random_state=42)
}

class_distribution = y.value_counts(normalize=True) * 100
print("Class Distribution (%):\n", class_distribution)


Class Distribution (%):
 57
0    60.595523
1    39.404477
Name: proportion, dtype: float64


The class distribution is <mark>**60.6:39.4**</mark>, which indicates a mild imbalance, just meeting the threshold. Given the nature of the data—classified as Spam or Not Spam — <mark>**Recall**</mark> is a critical metric for evaluating model performance. This is because, in spam detection, minimizing false negatives (missed spam emails) is crucial to ensure a reliable system. Since the F1 score combines both **Precision** and **Recall**, it provides a balanced assessment of the model's performance. Therefore, we choose the <mark>**F1 score**</mark> as the primary metric for further comparison and evaluation.

## Helper function definations 

Following three helper functions are used to 
- Perform Stratified k-fold  
- F1 score comparison 
- Average rank statistics to find the Friedman statistic\n 


In [6]:

# Function to perform stratified k-fold and log results
def stratified_k_fold_eval(model, model_name):
    # Create a DataFrame to store fold results
    results = []
    
    for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        # Split data
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        
        # Measure training time
        start_time = time.time()
        model.fit(X_train, y_train)
        train_time = time.time() - start_time
        
        # Predictions and metrics
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        
        # Store results
        results.append({
            "Fold": fold_idx,
            "Training Time (s)": train_time,
            "Accuracy": accuracy,
            "F1 Score": f1
        })
    
    # Convert results to a DataFrame
    results_df = pd.DataFrame(results)
    
    # Compute Mean and Std
    mean_row = {
        "Fold": "Mean",
        "Training Time (s)": results_df["Training Time (s)"].mean(),
        "Accuracy": results_df["Accuracy"].mean(),
        "F1 Score": results_df["F1 Score"].mean()
    }
    std_row = {
        "Fold": "Std",
        "Training Time (s)": results_df["Training Time (s)"].std(),
        "Accuracy": results_df["Accuracy"].std(),
        "F1 Score": results_df["F1 Score"].std()
    }
    results_df = pd.concat([results_df, pd.DataFrame([mean_row, std_row])], ignore_index=True)
    
    # Print the results
    print(f"\nResults for {model_name}:\n")
    print(results_df.to_string(index=False))
    
    return results_df

def f1_score_comparison(models):
    # Initialize storage for F1 Score results
    f1_results = {model_name: [] for model_name in models}
    
    for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), start=1):
        fold_f1_scores = {}
        
        for model_name, model in models.items():
            # Split data
            X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
            
            # Train model
            model.fit(X_train, y_train)
            
            # Predictions and metrics
            y_pred = model.predict(X_test)
            f1 = f1_score(y_test, y_pred)
            
            # Store F1 Score for this fold
            fold_f1_scores[model_name] = f1
            f1_results[model_name].append(f1)
        
        # Rank F1 scores for the current fold
        sorted_f1 = sorted(fold_f1_scores.items(), key=lambda x: x[1], reverse=True)
        rankings = {name: rank + 1 for rank, (name, _) in enumerate(sorted_f1)}
        
        # Store F1 Score with rank in the results
        for model_name in models:
            f1_results[model_name][-1] = f"{fold_f1_scores[model_name]:.4f} ({rankings[model_name]})"
    
    # Generate a DataFrame for F1 Score comparison
    table = {"Fold": [f"Fold {i}" for i in range(1, n_splits + 1)]}
    for model_name in models:
        table[model_name] = f1_results[model_name]
    
    # Calculate average rank for each model
    avg_rank_row = ["Avg Rank"]
    for model_name in models:
        # Extract ranks from strings
        ranks = [int(val.split('(')[-1].strip(')')) for val in f1_results[model_name]]
        avg_rank = sum(ranks) / len(ranks)
        avg_rank_row.append(f"{avg_rank:.2f}")
    
    # Add average rank row to the table
    table["Fold"].append(avg_rank_row[0])  # Add label to the Fold column
    for idx, model_name in enumerate(models.keys(), start=1):
        table[model_name].append(avg_rank_row[idx])
    
    # Convert table to DataFrame
    results_df = pd.DataFrame(table)
    
    # Print the table
    print("\nF1 Score Comparison:\n")
    print(results_df.to_string(index=False))
    
    return results_df

def calculate_rank_statistics(results_df):
    # Extract the number of folds (n) and number of models (k)
    n = len(results_df) - 1  # Exclude the last row (average ranks)
    k = len(results_df.columns) - 1  # Exclude the "Fold" column

    # Extract average ranks (last row excluding "Fold")
    avg_ranks = results_df.iloc[-1, 1:].astype(float).values

    # Calculate R_bar (average of the average ranks)
    R_bar = np.mean(avg_ranks)

    # Calculate N * SUM_j (Rj - R_bar)^2
    N = n
    term2 = N * np.sum((avg_ranks - R_bar) ** 2)

    # Calculate (1 / (n - (k - 1))) * SUM_ij (Rij - Rj)^2
    ranks_matrix = results_df.iloc[:n, 1:].map(lambda x: int(x.split('(')[-1].strip(')'))).values
    term3 = (1 / (n * (k - 1))) * np.sum((ranks_matrix - R_bar) ** 2)

    # Display results
    print(f"\n")
    print(f"\tR̅ = {R_bar:.4f}")
    print(f"\tN * Σj (Rj - R̅)^2 = {term2:.4f}")
    print(f"\t(1 / (n * (k - 1))) * Σij (Rij - Rj)^2 = {term3:.4f}")

    # Update DataFrame with R_bar on the top bar
    results_df.loc[-1] = ["R_bar"] + [f"{R_bar:.4f}"] * k
    results_df.index = results_df.index + 1
    results_df.sort_index(inplace=True)

    return results_df

In [7]:
# Run evaluation for all models
all_results = {}
for model_name, model in models.items():
    all_results[model_name] = stratified_k_fold_eval(model, model_name)
    
# Run F1 Score comparison
f1_comparison_table = f1_score_comparison(models)

# Calculate the Friedman statistic
updated_df = calculate_rank_statistics(f1_comparison_table)



Results for Logistic Regression:

Fold  Training Time (s)  Accuracy  F1 Score
   1           1.476165  0.915401  0.892562
   2           1.459002  0.923913  0.901961
   3           1.156858  0.932609  0.913165
   4           1.335482  0.936957  0.918768
   5           1.211622  0.913043  0.890110
   6           1.087201  0.936957  0.920110
   7           1.698693  0.934783  0.915730
   8           1.284989  0.932609  0.912676
   9           1.353487  0.936957  0.915452
  10           0.930187  0.915217  0.890141
Mean           1.299368  0.927844  0.907067
 Std           0.218571  0.009952  0.012163

Results for Random Forest:

Fold  Training Time (s)  Accuracy  F1 Score
   1           0.811631  0.952278  0.938889
   2           0.752213  0.956522  0.944134
   3           0.879899  0.960870  0.950000
   4           0.858138  0.967391  0.957983
   5           0.844638  0.945652  0.931129
   6           0.838471  0.956522  0.944134
   7           0.871480  0.958696  0.947368
   8        

Since the Friedman statistic (15) is greater than the critical value (6.2), we reject the **Null Hypothesis**. This suggests that there is a significant difference between the performance of the tested models for the given data set.

## Pairwise level analysis 

Critical difference calculation

$CD = q_\alpha \cdot \sqrt{\frac{k \cdot (k+1)}{6 \cdot N}}$  
$= 2.343 \cdot \sqrt{\frac{3 \cdot 4}{6 \cdot 10}}$  
$= 2.343 \cdot 0.45$  
$= 1.05$


### Final Analysis
For \( R_j = [1.5, 3.0, 1.5] \) and \( CD = 1.05 \), the output will be:

| Comparison                          | Rank Difference | Significant |
|---------------------|---------------|-------------------------------|
|Logistic Regression Vs Random Forest | 1.50            | True        |
|Random Forest Vs XGBoost             | 0.00            | False       |
|Logistic Regression Vs XGBoost       | 1.50            | True        |

### ExplanLogistic Regression
1. **Logistic Regression vs Random Forest:**  
   \( |1.5 - 3.0| = 1.5 \)  
   Since \( 1.5 > 1.05 \), this is **significant**.

2. **Random Forest Vs XGBoost:**  
   \( |1.5 - 1.5| = 0.0 \)  
   Since \( 0.0 \leq 1.05 \), this is **not significant**

3. **Logistic Regressionnt** Vs **XGBoost**  
   \( |3.0 - 1.5| = 1.5 \)  
   Since \( 1.5 > 1.05 \), this is **significant**.

This table summarizes the Nemenyi test results.
i test results.
