## Phase 3: Baseline Model Training 


**Objective**: Establish baseline performance for 3-class (Negative, Neutral, Positive) sentiment classification on the Bangla Sentiment Dataset without imbalance mitigation, using Logistic Regression, SVM, Naive Bayes (MultinomialNB), and Random Forest, with exhaustive hyperparameter tuning via GridSearchCV tailored to the dataset’s imbalance and visualizations (confusion matrices, ROC-AUC curves) to quantify class imbalance effects.

### Step 1: Load Preprocessed Data

- **Objective**: Load TF-IDF matrices and labels from Phase 2.

In [2]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
import os
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


# Define paths
data_dir = "text_representation/"
files = {
    'tfidf_train': f"{data_dir}tfidf_train.npz",
    'tfidf_val': f"{data_dir}tfidf_val.npz",
    'tfidf_test': f"{data_dir}tfidf_test.npz",
    'labels_train': f"{data_dir}labels_train.csv",
    'labels_val': f"{data_dir}labels_val.csv",
    'labels_test': f"{data_dir}labels_test.csv"
}

# Check file existence
for name, path in files.items():
    if not os.path.exists(path):
        logging.error(f"Missing file: {path}")
        raise FileNotFoundError(f"Missing file: {path}")

# Load TF-IDF matrices
tfidf_train = sp.load_npz(files['tfidf_train'])
tfidf_val = sp.load_npz(files['tfidf_val'])
tfidf_test = sp.load_npz(files['tfidf_test'])
logging.info("TF-IDF matrices loaded successfully")

# Load labels
y_train = pd.read_csv(files['labels_train'], encoding='utf-8')['Label'].values
y_val = pd.read_csv(files['labels_val'], encoding='utf-8')['Label'].values
y_test = pd.read_csv(files['labels_test'], encoding='utf-8')['Label'].values
logging.info("Labels loaded successfully")

# Validate shapes
assert tfidf_train.shape[0] == len(y_train), "Train data mismatch"
assert tfidf_val.shape[0] == len(y_val), "Validation data mismatch"
assert tfidf_test.shape[0] == len(y_test), "Test data mismatch"
logging.info("Data shapes validated")

# Print shapes and distribution
print("TF-IDF Train Shape:", tfidf_train.shape)
print("Labels Train Shape:", y_train.shape)
print("Label Distribution (Train):\n", pd.Series(y_train).value_counts(normalize=True) * 100)

2025-06-22 10:47:58,832 - INFO - TF-IDF matrices loaded successfully
2025-06-22 10:47:58,847 - INFO - Labels loaded successfully
2025-06-22 10:47:58,850 - INFO - Data shapes validated


TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193,)
Label Distribution (Train):
 0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64


### Step 2: Train Logistic Regression, SVM, Naive Bayes, and Random Forest

- **Objective**: Train baseline models on the imbalanced TF-IDF training set.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
import joblib
from tqdm import tqdm

In [4]:
# Create output directory
os.makedirs("models/baseline_models", exist_ok=True)
logging.info("Output directory created: models/baseline_models")

2025-06-22 10:53:31,405 - INFO - Output directory created: models/baseline_models


In [6]:
# Initialize models
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, multi_class='multinomial', penalty='l2', random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'NaiveBayes': MultinomialNB(),
    'RandomForest': RandomForestClassifier(random_state=42)
}

results = {} # store the performance of models
lb = LabelBinarizer()  # turns y_val into binary format for ROC-AUC

In [7]:
# Train and evaluate
for name, model in tqdm(models.items(), desc="Training Models"):
    try:
        # Train the model
        model.fit(tfidf_train, y_train)
        logging.info(f"{name} training completed")

        # Predict on validation set
        y_pred = model.predict(tfidf_val)

        # Compute evaluation metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(
            y_val, y_pred, average='weighted'
        )
        f1_per_class = precision_recall_fscore_support(y_val, y_pred)[2]

        # ROC-AUC Score (One-vs-Rest strategy)
        y_val_bin = lb.fit_transform(y_val)
        y_pred_proba = model.predict_proba(tfidf_val)
        roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')

        # Store results
        results[name] = {
            'accuracy': accuracy,
            'precision_weighted': precision,
            'recall_weighted': recall,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Save model to disk
        joblib.dump(model, f"models/baseline_models/{name}_baseline.joblib")
        logging.info(f"{name} model saved: {name}_baseline.joblib")

    except Exception as e:
        logging.error(f"Error training {name}: {str(e)}")
    

Training Models:   0%|          | 0/4 [00:00<?, ?it/s]

2025-06-22 10:57:52,029 - INFO - LogisticRegression training completed
2025-06-22 10:57:52,126 - INFO - LogisticRegression model saved: LogisticRegression_baseline.joblib
Training Models:  25%|██▌       | 1/4 [00:01<00:03,  1.04s/it]2025-06-22 10:58:49,098 - INFO - SVM training completed
2025-06-22 10:58:50,913 - INFO - SVM model saved: SVM_baseline.joblib
Training Models:  50%|█████     | 2/4 [00:59<01:10, 35.01s/it]2025-06-22 10:58:50,923 - INFO - NaiveBayes training completed
2025-06-22 10:58:50,945 - INFO - NaiveBayes model saved: NaiveBayes_baseline.joblib
2025-06-22 10:59:07,200 - INFO - RandomForest training completed
2025-06-22 10:59:07,591 - INFO - RandomForest model saved: RandomForest_baseline.joblib
Training Models: 100%|██████████| 4/4 [01:16<00:00, 19.13s/it]


In [9]:
# Print results
for name, metrics in results.items():
    print(f"\n{name} Validation Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


LogisticRegression Validation Metrics:
Accuracy: 0.5935
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.68304094 0.46938776 0.49376559]
ROC-AUC: 0.7188

SVM Validation Metrics:
Accuracy: 0.6103
Weighted F1: 0.5811
F1 per Class (Neg, Pos, Neu): [0.7063922  0.45864662 0.47645429]
ROC-AUC: 0.7183

NaiveBayes Validation Metrics:
Accuracy: 0.6155
Weighted F1: 0.5797
F1 per Class (Neg, Pos, Neu): [0.71518987 0.44186047 0.47093023]
ROC-AUC: 0.7241

RandomForest Validation Metrics:
Accuracy: 0.5832
Weighted F1: 0.5631
F1 per Class (Neg, Pos, Neu): [0.67730901 0.45296167 0.46632124]
ROC-AUC: 0.7129


### Step 3: Hyperparameter Tuning with GridSearchCV

- **Objective**: Optimize hyperparameters using GridSearchCV with `f1_weighted` scoring.

In [10]:
from sklearn.model_selection import GridSearchCV

# Grid parameter definitions
param_grids = {
    'LogisticRegression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'],
        'multi_class': ['multinomial'],
        'penalty': ['l2'],
        'max_iter': [1000]
    },
    'SVM': {
        'C': [0.01, 0.1, 1, 10, 100],
        'kernel': ['linear', 'rbf']
    },
    'NaiveBayes': {
        'alpha': [0.01, 0.1, 0.5, 1.0, 2.0]
    },
    'RandomForest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5]
    }
}

# Grid search loop
tuned_results = {}

for name, model in tqdm(models.items(), desc="Tuning Models (GridSearchCV)"):
    try:
        # Initialize GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=5,
            n_jobs=-1,
            verbose=1
        )

        # Fit on training data
        grid_search.fit(tfidf_train, y_train)
        logging.info(f"{name} GridSearchCV completed")

        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(tfidf_val)

        # Compute evaluation metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(
            y_val, y_pred, average='weighted'
        )
        f1_per_class = precision_recall_fscore_support(y_val, y_pred)[2]

        y_val_bin = lb.fit_transform(y_val)
        y_pred_proba = best_model.predict_proba(tfidf_val)
        roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')

        # Store results
        tuned_results[name] = {
            'best_params': grid_search.best_params_,
            'accuracy': accuracy,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Save model
        joblib.dump(best_model, f"models/baseline_models/{name}_tuned_grid.joblib")
        logging.info(f"{name} tuned model saved: {name}_tuned_grid.joblib")

    except Exception as e:
        logging.error(f"Error tuning {name}: {str(e)}")


Tuning Models (GridSearchCV):   0%|          | 0/4 [00:00<?, ?it/s]

Fitting 5 folds for each of 20 candidates, totalling 100 fits


2025-06-22 11:14:44,891 - INFO - LogisticRegression GridSearchCV completed
2025-06-22 11:14:45,021 - INFO - LogisticRegression tuned model saved: LogisticRegression_tuned_grid.joblib
Tuning Models (GridSearchCV):  25%|██▌       | 1/4 [00:13<00:41, 13.95s/it]

Fitting 5 folds for each of 10 candidates, totalling 50 fits


2025-06-22 11:23:07,405 - INFO - SVM GridSearchCV completed
2025-06-22 11:23:09,082 - INFO - SVM tuned model saved: SVM_tuned_grid.joblib
Tuning Models (GridSearchCV):  50%|█████     | 2/4 [08:38<10:04, 302.25s/it]2025-06-22 11:23:09,263 - INFO - NaiveBayes GridSearchCV completed
2025-06-22 11:23:09,280 - INFO - NaiveBayes tuned model saved: NaiveBayes_tuned_grid.joblib
Tuning Models (GridSearchCV):  75%|███████▌  | 3/4 [08:38<02:44, 164.33s/it]

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits


2025-06-22 11:24:57,322 - INFO - RandomForest GridSearchCV completed
2025-06-22 11:24:58,031 - INFO - RandomForest tuned model saved: RandomForest_tuned_grid.joblib
Tuning Models (GridSearchCV): 100%|██████████| 4/4 [10:26<00:00, 156.74s/it]


In [11]:
# Print summary
for name, metrics in tuned_results.items():
    print(f"\n{name} Tuned Validation Metrics:")
    print(f"Best Parameters: {metrics['best_params']}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


LogisticRegression Tuned Validation Metrics:
Best Parameters: {'C': 1, 'max_iter': 1000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'}
Accuracy: 0.5935
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.68304094 0.46938776 0.49376559]
ROC-AUC: 0.7188

SVM Tuned Validation Metrics:
Best Parameters: {'C': 10, 'kernel': 'rbf'}
Accuracy: 0.5935
Weighted F1: 0.5817
F1 per Class (Neg, Pos, Neu): [0.6811071  0.49180328 0.49275362]
ROC-AUC: 0.7081

NaiveBayes Tuned Validation Metrics:
Best Parameters: {'alpha': 0.5}
Accuracy: 0.6168
Weighted F1: 0.5935
F1 per Class (Neg, Pos, Neu): [0.71061453 0.47826087 0.49604222]
ROC-AUC: 0.7265

RandomForest Tuned Validation Metrics:
Best Parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
Accuracy: 0.5781
Weighted F1: 0.5543
F1 per Class (Neg, Pos, Neu): [0.67410714 0.44599303 0.44686649]
ROC-AUC: 0.7180


In [12]:
# Save all tuning results to CSV
pd.DataFrame(tuned_results).to_csv("models/baseline_models/tuned_results_grid.csv")
logging.info("Tuning results saved: tuned_results_grid.csv")

2025-06-22 11:26:07,604 - INFO - Tuning results saved: tuned_results_grid.csv


### Step 5: Evaluate on Test Set

- **Objective**: Evaluate tuned models on test set with confusion matrices and ROC-AUC curves.

In [13]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import (
    confusion_matrix,
    roc_curve,
    roc_auc_score,
    accuracy_score,
    precision_recall_fscore_support
)
import numpy as np
import pandas as pd
import joblib
from tqdm import tqdm
import logging

# Define class names
class_names = ['Negative', 'Positive', 'Neutral']

# Evaluate on test set
test_results = {}

for name in tqdm(['LogisticRegression', 'SVM', 'NaiveBayes', 'RandomForest'], desc="Evaluating Test Set"):
    try:
        # Load model
        model = joblib.load(f"models/baseline_models/{name}_tuned_grid.joblib")
        logging.info(f"Loaded model: {name}_tuned_grid.joblib")

        # Predict
        y_pred = model.predict(tfidf_test)

        # Compute metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(
            y_test, y_pred, average='weighted'
        )
        f1_per_class = precision_recall_fscore_support(y_test, y_pred)[2]
        y_test_bin = lb.fit_transform(y_test)
        y_pred_proba = model.predict_proba(tfidf_test)
        roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr')

        test_results[name] = {
            'accuracy': accuracy,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Confusion matrix plot
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(
            cm,
            annot=True,
            fmt='d',
            cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names
        )
        plt.title(f"{name} Confusion Matrix")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        plt.savefig(f"models/baseline_models/{name}_cm.png")
        plt.close()
        logging.info(f"Saved confusion matrix: {name}_cm.png")

        # ROC-AUC curves (one-vs-rest)
        plt.figure(figsize=(10, 8))
        for i in range(len(class_names)):
            fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_pred_proba[:, i])
            auc = roc_auc_score(y_test_bin[:, i], y_pred_proba[:, i])
            plt.plot(fpr, tpr, label=f"{class_names[i]} (AUC = {auc:.2f})")
        plt.plot([0, 1], [0, 1], 'k--', label='Chance')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f"{name} ROC-AUC Curves (One-vs-Rest)")
        plt.legend(loc='lower right')
        plt.grid(True)
        plt.savefig(f"models/baseline_models/{name}_roc_auc.png")
        plt.close()
        logging.info(f"Saved ROC-AUC plot: {name}_roc_auc.png")

    except Exception as e:
        logging.error(f"Error evaluating {name}: {str(e)}")

# Save test results
pd.DataFrame(test_results).to_csv("models/baseline_models/test_results.csv")
logging.info("Test results saved: test_results.csv")

Evaluating Test Set:   0%|          | 0/4 [00:00<?, ?it/s]2025-06-22 11:27:42,056 - INFO - Loaded model: LogisticRegression_tuned_grid.joblib
2025-06-22 11:27:42,819 - INFO - Saved confusion matrix: LogisticRegression_cm.png
2025-06-22 11:27:43,167 - INFO - Saved ROC-AUC plot: LogisticRegression_roc_auc.png
Evaluating Test Set:  25%|██▌       | 1/4 [00:01<00:03,  1.12s/it]2025-06-22 11:27:43,176 - INFO - Loaded model: SVM_tuned_grid.joblib
2025-06-22 11:27:45,375 - INFO - Saved confusion matrix: SVM_cm.png
2025-06-22 11:27:45,667 - INFO - Saved ROC-AUC plot: SVM_roc_auc.png
Evaluating Test Set:  50%|█████     | 2/4 [00:03<00:03,  1.93s/it]2025-06-22 11:27:45,674 - INFO - Loaded model: NaiveBayes_tuned_grid.joblib
2025-06-22 11:27:46,645 - INFO - Saved confusion matrix: NaiveBayes_cm.png
2025-06-22 11:27:47,004 - INFO - Saved ROC-AUC plot: NaiveBayes_roc_auc.png
Evaluating Test Set:  75%|███████▌  | 3/4 [00:04<00:01,  1.66s/it]2025-06-22 11:27:47,551 - INFO - Loaded model: RandomForest_

In [14]:
# Print test results
for name, metrics in test_results.items():
    print(f"\n{name} Test Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


LogisticRegression Test Metrics:
Accuracy: 0.6297
Weighted F1: 0.6122
F1 per Class (Neg, Pos, Neu): [0.71559633 0.52173913 0.5171504 ]
ROC-AUC: 0.7540

SVM Test Metrics:
Accuracy: 0.6258
Weighted F1: 0.6129
F1 per Class (Neg, Pos, Neu): [0.70769231 0.53035144 0.5255102 ]
ROC-AUC: 0.7348

NaiveBayes Test Metrics:
Accuracy: 0.6219
Weighted F1: 0.5958
F1 per Class (Neg, Pos, Neu): [0.71009772 0.50359712 0.48433048]
ROC-AUC: 0.7443

RandomForest Test Metrics:
Accuracy: 0.6181
Weighted F1: 0.5932
F1 per Class (Neg, Pos, Neu): [0.71160221 0.47857143 0.49315068]
ROC-AUC: 0.7431


In [15]:
# Update README
with open("models/baseline_models/README.txt", "w", encoding='utf-8') as f:
    f.write(
        "Baseline Model Outputs:\n"
        "- *_baseline.joblib: Initial trained models\n"
        "- *_tuned_grid.joblib: Tuned models (GridSearchCV, f1_weighted optimized)\n"
        "- tuned_results_grid.csv: Tuning results\n"
        "- test_results.csv: Test set metrics\n"
        "- *_cm.png: Confusion matrix plots\n"
        "- *_roc_auc.png: ROC-AUC curve plots"
    )

logging.info("README updated")

2025-06-22 11:28:04,744 - INFO - README updated
