## Phase 3: Baseline Model Training 


**Objective**: Establish baseline performance for sentiment classification on the Bangla Sentiment Dataset without imbalance mitigation, using Logistic Regression, SVM, Naive Bayes (MultinomialNB), and BanglaBERT, to quantify the impact of class imbalance.

### Step 1: Load Preprocessed Data

- **Objective**: Load TF-IDF matrices, labels, and BERT tokens from Phase 2 for model training.

In [2]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
    
# Load TF-IDF matrices
tfidf_train = sp.load_npz("text_representation/tfidf_train.npz")
tfidf_val = sp.load_npz("text_representation/tfidf_val.npz")
tfidf_test = sp.load_npz("text_representation/tfidf_test.npz")
    
# Load labels
y_train = pd.read_csv("text_representation/labels_train.csv")['Label'].values
y_val = pd.read_csv("text_representation/labels_val.csv")['Label'].values
y_test = pd.read_csv("text_representation/labels_test.csv")['Label'].values
    
# Load BERT tokens
bert_input_ids = np.load("text_representation/bert_input_ids.npy")
bert_attention_masks = np.load("text_representation/bert_attention_masks.npy")
    
# Verify shapes
print("TF-IDF Train Shape:", tfidf_train.shape)
print("Labels Train Shape:", y_train.shape)
print("BERT Input IDs Shape:", bert_input_ids.shape)
print("Label Distribution (Train):\n", pd.Series(y_train).value_counts(normalize=True) * 100)

TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193,)
BERT Input IDs Shape: (7743, 128)
Label Distribution (Train):
 0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64


### Step 2: Train Logistic Regression, SVM, and Naive Bayes Models

- **Objective**: Train baseline Logistic Regression, SVM, and Multinomial Naive Bayes models on the imbalanced TF-IDF training set.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Initialize models
models = {
    'NaiveBayes': MultinomialNB(),
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
}

In [5]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
from tqdm import tqdm
import joblib

# Train and evaluate
results = {} # store the performance of models
lb = LabelBinarizer()  # turns y_val into binary format for ROC-AUC

for name, model in tqdm(models.items(), desc="Training Models...."):
    # Train
    model.fit(tfidf_train, y_train)
        
    # Predict on validation set
    y_pred = model.predict(tfidf_val)
        
    # Compute metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_val, y_pred, average='weighted')
    precision_per_class, recall_per_class, f1_per_class, _ = precision_recall_fscore_support(y_val, y_pred)
        
    # ROC-AUC (one-vs-rest)
    y_val_bin = lb.fit_transform(y_val)
    y_pred_proba = model.predict_proba(tfidf_val)
    roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')
        
    # Store results
    results[name] = {
        'accuracy': accuracy,
        'precision_weighted': precision,
        'recall_weighted': recall,
        'f1_weighted': f1,
        'f1_per_class': f1_per_class,
        'roc_auc': roc_auc
    }
        
    # Save model
    joblib.dump(model, f"models/baseline_models/{name}_baseline.joblib")
    

Training Models....: 100%|██████████| 3/3 [01:05<00:00, 21.94s/it]


In [6]:
# Print results
for name, metrics in results.items():
    print(f"\n{name} Validation Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


NaiveBayes Validation Metrics:
Accuracy: 0.6155
Weighted F1: 0.5797
F1 per Class (Neg, Pos, Neu): [0.71518987 0.44186047 0.47093023]
ROC-AUC: 0.7241

LogisticRegression Validation Metrics:
Accuracy: 0.5935
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.68304094 0.46938776 0.49376559]
ROC-AUC: 0.7188

SVM Validation Metrics:
Accuracy: 0.6103
Weighted F1: 0.5811
F1 per Class (Neg, Pos, Neu): [0.7063922  0.45864662 0.47645429]
ROC-AUC: 0.7183
