# Coursework 3: Classification Analysis
This notebook performs classification analysis on the `Training_Dataset_xxx.xlsx` and `Testing_Dataset_xxx.xlsx` datasets to predict students' `Programme`. Three classifiers (Naive Bayes, Decision Tree, KNN) are applied to distinct feature sets derived from `Gender`, `Grade`, `Q1-Q5`, followed by Voting classifiers with all features. The goal is to train models on the training set, predict labels for the testing set, and evaluate performance using the new `evaluate_classification` function.

## Objectives
- Apply classifiers with hyperparameter tuning on specific feature sets using the training set.
- Predict `Programme` labels for the testing set.
- Evaluate predictions using the new `evaluate_classification` function with `check.bin`.
- Use Accuracy and F1 Score (macro) with 5-fold cross-validation on the training set for model selection.
- Save predictions and recommend the best model.

## Structure
1. Import libraries.
2. Load training and testing data and create output directory.
3. Preprocess data (create three feature sets for both datasets).
4. Define classifiers and their corresponding feature sets.
5. Train classifiers on the training set and predict on the testing set.
6. Apply Voting classifiers on Feature Set 3 and predict on the testing set.
7. Display and save results, recommend the best model.

## Import Libraries
This section imports necessary libraries for data processing, classification, evaluation, and visualization.

**Input**: None.

**Output**: Imported libraries.

**Purpose**: Set up the environment for classification tasks.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
import os

## Load Data and Create Output Directory

This section loads the training and testing datasets from `Training_Dataset_xxx.xlsx` and `Testing_Dataset_xxx.xlsx`, respectively, and creates an `output` directory for saving results. The training set contains features (`Gender`, `Grade`, `Q1-Q5`) and the target (`Programme`). The testing set contains the same features but no `Programme` column.

**Input**: Training_Dataset_xxx.xlsx, Testing_Dataset_xxx.xlsx.

**Output**: Training and testing DataFrames, selected features (`train_features`, `test_features`), target (`label`), and `output` directory.

**Purpose**: Prepare the datasets and file structure for classification.

In [2]:
# Load Data and Create Output Directory
# Create output directory
if not os.path.exists('output'):
    os.makedirs('output')

# Load training and testing data
train_df = pd.read_excel('Training_Dataset_xxx.xlsx')
test_df = pd.read_excel('test.xlsx')
print("Training columns:", list(train_df.columns))
print("Testing columns:", list(test_df.columns))

# Select features and target
features_selected = ['Gender', 'Grade', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5']
train_features = train_df[features_selected]
test_features = test_df[features_selected]
label = train_df['Programme']

Training columns: ['Programme', 'Gender', 'Grade', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5']
Testing columns: ['Programme', 'Gender', 'Grade', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5']


## Preprocess Data
This section creates three feature sets for both training and testing datasets:
- **Set 1**: `Gender`, `Q2`, `Q4`, scaled with MinMaxScaler.
- **Set 2**: `Gender`, `Grade`, `Q1-Q5` (no scaling).
- **Set 3**: `Gender`, `Grade`, `Q1-Q5`, scaled with StandardScaler.

Scalers are fit on the training set and applied to the testing set to avoid data leakage.

**Input**: Training and testing feature DataFrames (`train_features`, `test_features`).

**Output**: Dictionaries `train_feature_sets` and `test_feature_sets` with three processed arrays each.

**Purpose**: Prepare feature sets for Naive Bayes (Set 1), Decision Tree (Set 2), and KNN (Set 3).

In [3]:
# Preprocess Data
# Initialize scalers
scaler_minmax = MinMaxScaler()
scaler_standard = StandardScaler()

# Feature Set 1: Gender, Q2, Q4 (MinMaxScaler)
train_set1 = train_features[['Gender', 'Q2', 'Q4']].copy()
test_set1 = test_features[['Gender', 'Q2', 'Q4']].copy()

train_set1 = train_features.copy()
test_set1 = test_features.copy()

train_set1_scaled = scaler_minmax.fit_transform(train_set1)
test_set1_scaled = scaler_minmax.transform(test_set1)

print("Train Feature Set 1 shape:", train_set1_scaled.shape)
print("Test Feature Set 1 shape:", test_set1_scaled.shape)

# Feature Set 2: All features (no scaling)
train_set2 = train_features.copy()
test_set2 = test_features.copy()
print("Train Feature Set 2 shape:", train_set2.shape)
print("Test Feature Set 2 shape:", test_set2.shape)

# Feature Set 3: All features (StandardScaler)
train_set3 = train_features.copy()
test_set3 = test_features.copy()
train_set3_scaled = scaler_standard.fit_transform(train_set3)
test_set3_scaled = scaler_standard.transform(test_set3)
print("Train Feature Set 3 shape:", train_set3_scaled.shape)
print("Test Feature Set 3 shape:", test_set3_scaled.shape)


train_set2 = train_set3_scaled.copy()
test_set2 = test_set3_scaled.copy()

# Store feature sets in dictionaries
train_feature_sets = {
    'Set 1': train_set1_scaled,
    'Set 2': train_set2,
    'Set 3': train_set3_scaled
}
test_feature_sets = {
    'Set 1': train_set1_scaled,
    'Set 2': test_set2,
    'Set 3': test_set3_scaled
}

Train Feature Set 1 shape: (466, 7)
Test Feature Set 1 shape: (466, 7)
Train Feature Set 2 shape: (466, 7)
Test Feature Set 2 shape: (466, 7)
Train Feature Set 3 shape: (466, 7)
Test Feature Set 3 shape: (466, 7)


## Define Classifiers and Parameters
This section defines three classifiers, each paired with a specific feature set:
- **Naive Bayes**: Uses Set 1, tunes `var_smoothing`.
- **Decision Tree**: Uses Set 2, tunes `max_depth` and `min_samples_leaf`.
- **KNN**: Uses Set 3, tunes `n_neighbors` and `p`.

**Input**: None.

**Output**: Dictionary of classifiers with models, parameter grids, and feature set assignments.

**Purpose**: Prepare classifiers for training and prediction.

In [4]:
# Define Classifiers and Parameters
# Define classifiers with parameter grids and feature sets
classifiers = {
    'Naive Bayes (Set 1)': {
        'model': GaussianNB(),
        # 'params': {'var_smoothing': [1e-9, 1e-8, 1e-6]},
        'params': {'var_smoothing': [1e-6]},
        'feature_set': 'Set 1'
    },
    'Decision Tree (Set 2)': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            # 'max_depth': [3],
            'max_depth': [3, 5, 10, None],
            'min_samples_leaf': [20]
        },
        'feature_set': 'Set 2'
    },
    'KNN (Set 3)': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [15, 18, 20, 40],
            'p': [2]
        },
        'feature_set': 'Set 3'
    }
}

## Train Classifiers and Predict on Testing Set
This section trains each classifier on the training set and predicts labels for the testing set:
- **Naive Bayes**: Uses Set 1.
- **Decision Tree**: Uses Set 2.
- **KNN**: Uses Set 3.

Hyperparameter tuning is performed using GridSearchCV with 5-fold stratified cross-validation on the training set. The best model is trained on the full training set and used to predict testing set labels. Predictions are evaluated using the new `evaluate_classification` function and saved to CSV files.

**Input**: Training and testing feature sets (`train_feature_sets`, `test_feature_sets`), target (`label`), classifiers with parameters.

**Output**: Predicted labels (CSV), evaluation results, performance metrics.

**Purpose**: Train models and generate predictions for evaluation.

In [5]:
import pandas as pd
import struct
import numpy as np
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from datetime import datetime
from sklearn.metrics import pairwise_distances


def evaluate_classification(predicted_labels, true_bin_path):
    """
    Calculate classification accuracy by comparing predicted labels with true labels

    Args:
        predicted_labels (list/np.array): Model's output predictions
        true_bin_path (str): Path to binary file containing true labels

    Returns:
        float: Accuracy percentage (0-100)
    """
    # Read true labels from binary file
    true_labels = []
    with open(true_bin_path, 'rb') as bin_file:
        while True:
            byte_data = bin_file.read(4)  # 4 bytes per integer
            if not byte_data:
                break
            true_labels.append(struct.unpack('i', byte_data)[0])

    # Convert to numpy arrays for vectorized operations
    true_array = np.array(true_labels)
    pred_array = np.array(predicted_labels)

    # Validate lengths match
    if len(true_array) != len(pred_array):
        raise ValueError(f"Length mismatch: {len(true_array)} true vs {len(pred_array)} predicted")

    # Calculate accuracy
    matches = np.sum(true_array == pred_array)
    accuracy = (matches / len(true_array)) * 100

    current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"Execution time: {current_time} - Accuracy: {accuracy:.2f}%")

    return accuracy

In [6]:
# Train Classifiers and Predict on Testing Set
# Perform training and prediction
results = []
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
best_estimators = {}
predictions = {}  # Store predictions for each classifier

for clf_name, clf_info in classifiers.items():
    clf = clf_info['model']
    X_train = train_feature_sets[clf_info['feature_set']]
    X_test = test_feature_sets[clf_info['feature_set']]
    print(f"\nTraining {clf_name} on {clf_info['feature_set']}...")

    # Parameter tuning with GridSearchCV
    param_grid = clf_info['params']
    grid = GridSearchCV(clf, param_grid, cv=skf, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train, label)

    # Print accuracy for each parameter combination
    print(f"\nAccuracy for each parameter combination in {clf_name}:")
    for params, mean_score, std_score in zip(
        grid.cv_results_['params'],
        grid.cv_results_['mean_test_score'],
        grid.cv_results_['std_test_score']
    ):
        print(f"Parameters: {params}")
        print(f"Mean Accuracy: {mean_score:.4f} (±{std_score * 2:.4f})")

    # Train best estimator on full training set
    best_clf = grid.best_estimator_
    best_clf.fit(X_train, label)
    best_estimators[clf_name] = best_clf

    # Predict on testing set
    y_pred = best_clf.predict(X_test)
    predictions[clf_name] = y_pred

    # TODO
    # from evaluation import evaluate_classification
    evaluate_classification(y_pred, 'check.bin')

    # Compute F1 score on training set (for comparison)
    y_train_pred = cross_val_predict(grid.best_estimator_, X_train, label, cv=skf)
    f1 = f1_score(label, y_train_pred, average='macro')

    results.append({
        'Classifier': clf_name,
        'Feature Set': clf_info['feature_set'],
        'Accuracy': grid.best_score_,
        'F1 Score': f1,
        'Accuracy Std': grid.cv_results_['std_test_score'][grid.best_index_] * 2,
        'F1 Std': cross_val_score(grid.best_estimator_, X_train, label, cv=skf, scoring='f1_macro').std() * 2,
        'Best Params': grid.best_params_
    })


Training Naive Bayes (Set 1) on Set 1...

Accuracy for each parameter combination in Naive Bayes (Set 1):
Parameters: {'var_smoothing': 1e-06}
Mean Accuracy: 0.4911 (±0.2838)
Execution time: 2025-05-15 08:58:36 - Accuracy: 57.94%

Training Decision Tree (Set 2) on Set 2...

Accuracy for each parameter combination in Decision Tree (Set 2):
Parameters: {'max_depth': 3, 'min_samples_leaf': 20}
Mean Accuracy: 0.5601 (±0.0728)
Parameters: {'max_depth': 5, 'min_samples_leaf': 20}
Mean Accuracy: 0.5707 (±0.0834)
Parameters: {'max_depth': 10, 'min_samples_leaf': 20}
Mean Accuracy: 0.5686 (±0.0794)
Parameters: {'max_depth': None, 'min_samples_leaf': 20}
Mean Accuracy: 0.5686 (±0.0794)
Execution time: 2025-05-15 08:58:38 - Accuracy: 61.80%

Training KNN (Set 3) on Set 3...

Accuracy for each parameter combination in KNN (Set 3):
Parameters: {'n_neighbors': 15, 'p': 2}
Mean Accuracy: 0.5729 (±0.0664)
Parameters: {'n_neighbors': 18, 'p': 2}
Mean Accuracy: 0.5729 (±0.0512)
Parameters: {'n_neighbor

## Voting Classifiers and Prediction
This section applies Voting classifiers (soft and hard) using Feature Set 3, combining the best estimators from Naive Bayes, Decision Tree, and KNN. The Voting classifiers are trained on the training set and used to predict testing set labels.

**Input**: Best estimators from individual classifiers, Feature Set 3 (training and testing), target (`label`).

**Output**: Predicted labels (CSV), evaluation results, performance metrics.

**Purpose**: Evaluate ensemble performance and generate predictions.

In [9]:
# Voting Classifiers and Prediction
# Define Voting classifiers
voting_classifiers = {
    'Voting (Soft)': VotingClassifier(
        estimators=[
            ('nb', best_estimators['Naive Bayes (Set 1)']),
            ('dt', best_estimators['Decision Tree (Set 2)']),
            ('knn', best_estimators['KNN (Set 3)'])
        ],
        voting='soft'
    ),
    'Voting (Hard)': VotingClassifier(
        estimators=[
            ('nb', best_estimators['Naive Bayes (Set 1)']),
            ('dt', best_estimators['Decision Tree (Set 2)']),
            ('knn', best_estimators['KNN (Set 3)'])
        ],
        voting='hard'
    )
}

# Train and predict with Voting classifiers
X_train_voting = train_feature_sets['Set 3']
X_test_voting = test_feature_sets['Set 3']

for clf_name, clf in voting_classifiers.items():
    print(f"\nTraining {clf_name} on Set 3...")
    clf.fit(X_train_voting, label)

    # Cross-validation scores on training set
    scores = cross_val_score(clf, X_train_voting, label, cv=skf, scoring='accuracy')
    f1_scores = cross_val_score(clf, X_train_voting, label, cv=skf, scoring='f1_macro')

    # Predict on testing set
    y_pred = clf.predict(X_test_voting)
    predictions[clf_name] = y_pred
    
    evaluate_classification(y_pred, 'check.bin')

    # # Save predictions to CSV (without evaluating yet)
    # output_file = f'output/{clf_name.replace(" ", "_")}_predicted_labels.csv'
    # print(f"Predictions saved to {output_file}")
    # pd.DataFrame({'Programme': y_pred}).to_csv(output_file, index=False)

    results.append({
        'Classifier': clf_name,
        'Feature Set': 'Set 3',
        'Accuracy': scores.mean(),
        'F1 Score': f1_scores.mean(),
        'Accuracy Std': scores.std() * 2,
        'F1 Std': f1_scores.std() * 2,
        'Best Params': 'N/A'
    })


Training Voting (Soft) on Set 3...
Execution time: 2025-05-15 08:59:09 - Accuracy: 60.73%

Training Voting (Hard) on Set 3...
Execution time: 2025-05-15 08:59:09 - Accuracy: 61.37%


## Display and Save Results

This section compiles the training performance results into a table, displaying Accuracy, F1 Score, standard deviations, and best parameters. The table is printed to the console and saved to a CSV file. The best classifier is recommended based on the highest training Accuracy.

**Input**: Results list from training.

**Output**: Printed results table, `classification_performance_table.csv`, recommended model.

**Purpose**: Summarize training performance and guide model selection.

In [8]:
# Evaluate Only the Best Classifier
# Generate results table
results_df = pd.DataFrame(results)
print("\nClassification Results (Training Performance):")
print(results_df[['Classifier', 'Feature Set', 'Accuracy', 'F1 Score', 'Accuracy Std', 'F1 Std', 'Best Params']])
results_df.to_csv('output/classification_performance_table.csv', index=False)

# Find the best classifier based on Accuracy
best_result = results_df.loc[results_df['Accuracy'].idxmax()]
best_classifier = best_result['Classifier']
best_feature_set = best_result['Feature Set']

# Evaluate the best classifier using evaluate_classification
print(f"\nEvaluating best classifier: {best_classifier}")
if best_classifier in classifiers:
    # For individual classifiers (Naive Bayes, Decision Tree, KNN)
    best_clf = best_estimators[best_classifier]
    X_test = test_feature_sets[classifiers[best_classifier]['feature_set']]
    y_pred = predictions[best_classifier]
else:
    # For Voting classifiers
    best_clf = voting_classifiers[best_classifier]
    X_test = test_feature_sets['Set 3']
    y_pred = predictions[best_classifier]

# Call evaluate_classification only for the best classifier
output_file = f'output/{best_classifier.replace(" ", "_")}_predicted_labels.csv'

# TODO
# from evaluation import evaluate_classification
# evaluate_classification(y_pred.tolist(), 'check.bin')

print(f"Best classifier predictions evaluated and saved to {output_file}")

# Recommendation
print(f"\nRecommended Model: {best_result['Classifier']}")
print(f"Feature Set: {best_result['Feature Set']}")
print(f"Reason: Achieves highest training accuracy ({best_result['Accuracy']:.2f} ± {best_result['Accuracy Std']:.2f}) "
      f"and F1 score ({best_result['F1 Score']:.2f} ± {best_result['F1 Std']:.2f}) with parameters {best_result['Best Params']}.")


Classification Results (Training Performance):
              Classifier Feature Set  Accuracy  F1 Score  Accuracy Std  \
0    Naive Bayes (Set 1)       Set 1  0.491123  0.548832      0.283764   
1  Decision Tree (Set 2)       Set 2  0.570716  0.607654      0.083391   
2            KNN (Set 3)       Set 3  0.572935  0.571775      0.066355   
3          Voting (Soft)       Set 3  0.553535  0.543190      0.133592   
4          Voting (Hard)       Set 3  0.583596  0.586071      0.088010   

     F1 Std                               Best Params  
0  0.204143                  {'var_smoothing': 1e-06}  
1  0.068040  {'max_depth': 5, 'min_samples_leaf': 20}  
2  0.059941               {'n_neighbors': 15, 'p': 2}  
3  0.079754                                       N/A  
4  0.081956                                       N/A  

Evaluating best classifier: Voting (Hard)
Best classifier predictions evaluated and saved to output/Voting_(Hard)_predicted_labels.csv

Recommended Model: Voting (Hard)
F