# Phase 5: Evaluation and Comparison

**Objective**: Compare baseline models (Phase 3) and mitigated models (Phase 4) to assess the effectiveness of imbalance mitigation strategies (SMOTE, Random Undersampling, NearMiss, Weighted Loss) for 3-class (Negative, Neutral, Positive) sentiment classification on the Bangla Sentiment Dataset. Evaluate performance on the test set, analyze source-specific performance (newspapers, social media, blogs) to test hypothesis H3 (source-specific differences in sentiment classification), and perform statistical tests to determine significant improvements.


### Step 1: Load Test Data and Models

- **Objective**: Load test TF-IDF matrix, labels, source metadata, and all models (baseline and mitigated)

In [2]:
import pandas as pd
import scipy.sparse as sp
import joblib
import os
import logging
from tqdm import tqdm

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Define paths
data_dir = "text_representation/"
model_dir_baseline = "models/baseline_models/"
model_dir_mitigated = "models/mitigated_models/"
files = {
    'tfidf_test': f"{data_dir}tfidf_test.npz",
    'labels_test': f"{data_dir}labels_test.csv",
}

# Check file existence
for name, path in files.items():
    if not os.path.exists(path):
        logging.error(f"Missing file: {path}")
        raise FileNotFoundError(f"Missing file: {path}")

# Load test data
tfidf_test = sp.load_npz(files['tfidf_test'])
y_test = pd.read_csv(files['labels_test'], encoding='utf-8')['Label'].values

logging.info("Test data loaded successfully")

# Validate shapes
assert tfidf_test.shape[0] == len(y_test), "Test data mismatch"
logging.info("Data shapes validated")

2025-06-25 10:55:28,975 - INFO - Test data loaded successfully
2025-06-25 10:55:28,978 - INFO - Data shapes validated
