# Phase 5: Evaluation and Comparison

**Objective**: Compare baseline models (Phase 3) and mitigated models (Phase 4) to assess the effectiveness of imbalance mitigation strategies (SMOTE, Random Undersampling, NearMiss, Weighted Loss) for 3-class (Negative, Neutral, Positive) sentiment classification on the Bangla Sentiment Dataset. Evaluate performance on the test set, analyze source-specific performance (newspapers, social media, blogs) to test hypothesis H3 (source-specific differences in sentiment classification), and perform statistical tests to determine significant improvements.


### Step 1: Load Test Data and Models

- **Objective**: Load test TF-IDF matrix, labels, source metadata, and all models (baseline and mitigated)

In [3]:
import pandas as pd
import scipy.sparse as sp
import joblib
import os
import logging
from tqdm import tqdm

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Define paths
data_dir = "text_representation/"

files = {
    'tfidf_test': f"{data_dir}tfidf_test.npz",
    'labels_test': f"{data_dir}labels_test.csv",
}

# Check file existence
for name, path in files.items():
    if not os.path.exists(path):
        logging.error(f"Missing file: {path}")
        raise FileNotFoundError(f"Missing file: {path}")

# Load test data
tfidf_test = sp.load_npz(files['tfidf_test'])
y_test = pd.read_csv(files['labels_test'], encoding='utf-8')['Label'].values

logging.info("Test data loaded successfully")

# Validate shapes
assert tfidf_test.shape[0] == len(y_test), "Test data mismatch"
logging.info("Data shapes validated")

2025-06-25 10:56:14,740 - INFO - Test data loaded successfully
2025-06-25 10:56:14,742 - INFO - Data shapes validated


In [4]:
# Load models
model_dir_baseline = "models/baseline_models/"
model_dir_mitigated = "models/mitigated_models/"

model_configs = [
    ('baseline', model_dir_baseline, ['LogisticRegression_tuned_grid', 'SVM_tuned_grid', 'NaiveBayes_tuned_grid', 'RandomForest_tuned_grid']),
    ('mitigated', model_dir_mitigated, [
        f"{model}_{mitigation}_tuned"
        for model in ['LogisticRegression', 'SVM', 'NaiveBayes', 'RandomForest']
        for mitigation in ['smote', 'undersampled', 'nearmiss', 'weighted']
    ])
]
models = {}
for config_type, model_dir, model_names in model_configs:
    for name in tqdm(model_names, desc=f"Loading {config_type} models"):
        try:
            models[f"{config_type}_{name}"] = joblib.load(f"{model_dir}{name}.joblib")
            logging.info(f"Loaded model: {config_type}_{name}")
        except Exception as e:
            logging.error(f"Error loading {name}: {str(e)}")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
2025-06-25 10:58:40,317 - INFO - Loaded model: baseline_LogisticRegression_tuned_grid
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
2025-06-25 10:58:40,338 - INFO - Loaded model: baseline_SVM_tuned_grid
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
2025-06-25 10:58:40,352 - INFO - Loaded model: baseline_NaiveBayes_tuned_grid
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
2025-06-25 10:58:41,359 - INFO - Loaded model: baseline_RandomForest_tuned_grid
Loading baseline models: 100%|██████████| 4/4 [00:02<00:00,  1.42it/s]
Loading mitigated models:   0%|          | 0/16 [00:00<?, ?it/s]2025-06-25 10:58:41,373 - INFO - Loaded model: mitigated_LogisticRegression_smote_tuned
20

### Step 2: Evaluate Models on Test Set

- **Objective**: Compute accuracy, precision, recall, F1-score (weighted and per-class), and ROC-AUC for all models on the test set.

In [6]:
import os
import logging
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
import pandas as pd
from tabulate import tabulate

# Ensure evaluation directory exists
os.makedirs("evaluation", exist_ok=True)

# Binarize labels for ROC AUC
lb = LabelBinarizer()
y_test_bin = lb.fit_transform(y_test)

results = {
    'Model': [], 'Type': [], 'Mitigation': [],
    'Accuracy': [], 'F1_Weighted': [], 'F1_Negative': [], 
    'F1_Positive': [], 'F1_Neutral': [], 'ROC_AUC': []
}

for model_key, model in tqdm(models.items(), desc="Evaluating models"):
    try:
        config_type, name = model_key.split('_', 1)
        mitigation = name.split('_')[-1] if config_type == 'mitigated' else 'none'
        model_name = '_'.join(name.split('_')[:-1]) if config_type == 'mitigated' else name.replace('_tuned_grid', '')

        y_pred = model.predict(tfidf_test)
        y_pred_proba = model.predict_proba(tfidf_test)
        
        precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
        f1_per_class = precision_recall_fscore_support(y_test, y_pred)[2]
        roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr')

        results['Model'].append(model_name)
        results['Type'].append(config_type)
        results['Mitigation'].append(mitigation)
        results['Accuracy'].append(accuracy_score(y_test, y_pred))
        results['F1_Weighted'].append(f1)
        results['F1_Negative'].append(f1_per_class[0])
        results['F1_Positive'].append(f1_per_class[1])
        results['F1_Neutral'].append(f1_per_class[2])
        results['ROC_AUC'].append(roc_auc)

        logging.info(f"Evaluated {model_key}")
    except Exception as e:
        logging.error(f"Error evaluating {model_key}: {str(e)}")

# Save and show results
results_df = pd.DataFrame(results)
csv_path = "evaluation/comparative_results.csv"
results_df.to_csv(csv_path, index=False)
logging.info(f"Comparative results saved: {csv_path}")

Evaluating models:   0%|          | 0/20 [00:00<?, ?it/s]2025-06-25 11:05:26,289 - INFO - Evaluated baseline_LogisticRegression_tuned_grid


2025-06-25 11:05:31,714 - INFO - Evaluated baseline_SVM_tuned_grid
Evaluating models:  10%|█         | 2/20 [00:05<00:49,  2.73s/it]2025-06-25 11:05:31,736 - INFO - Evaluated baseline_NaiveBayes_tuned_grid
2025-06-25 11:05:32,233 - INFO - Evaluated baseline_RandomForest_tuned_grid
Evaluating models:  20%|██        | 4/20 [00:05<00:20,  1.27s/it]2025-06-25 11:05:32,248 - INFO - Evaluated mitigated_LogisticRegression_smote_tuned
2025-06-25 11:05:32,264 - INFO - Evaluated mitigated_LogisticRegression_undersampled_tuned
2025-06-25 11:05:32,280 - INFO - Evaluated mitigated_LogisticRegression_nearmiss_tuned
2025-06-25 11:05:32,296 - INFO - Evaluated mitigated_LogisticRegression_weighted_tuned
2025-06-25 11:05:34,653 - INFO - Evaluated mitigated_SVM_smote_tuned
Evaluating models:  45%|████▌     | 9/20 [00:08<00:08,  1.35it/s]2025-06-25 11:05:36,354 - INFO - Evaluated mitigated_SVM_undersampled_tuned
Evaluating models:  50%|█████     | 10/20 [00:10<00:08,  1.12it/s]2025-06-25 11:05:37,947 - IN

In [7]:
# Display as table in notebook output
from IPython.display import display, HTML
print("\n=== Evaluation Results ===\n")
print(tabulate(results_df, headers='keys', tablefmt='github', showindex=False))


=== Evaluation Results ===

| Model                           | Type      | Mitigation   |   Accuracy |   F1_Weighted |   F1_Negative |   F1_Positive |   F1_Neutral |   ROC_AUC |
|---------------------------------|-----------|--------------|------------|---------------|---------------|---------------|--------------|-----------|
| LogisticRegression              | baseline  | none         |   0.629677 |      0.612208 |      0.715596 |      0.521739 |     0.51715  |  0.753958 |
| SVM                             | baseline  | none         |   0.625806 |      0.612925 |      0.707692 |      0.530351 |     0.52551  |  0.734816 |
| NaiveBayes                      | baseline  | none         |   0.621935 |      0.595792 |      0.710098 |      0.503597 |     0.48433  |  0.744348 |
| RandomForest                    | baseline  | none         |   0.618065 |      0.593155 |      0.711602 |      0.478571 |     0.493151 |  0.743115 |
| LogisticRegression_smote        | mitigated | tuned        |   