## Phase 3: Baseline Model Training 


**Objective**: Establish baseline performance for sentiment classification on the Bangla Sentiment Dataset without imbalance mitigation, using Logistic Regression, SVM, Naive Bayes (MultinomialNB), and BanglaBERT, to quantify the impact of class imbalance.

### Step 1: Load Preprocessed Data

- **Objective**: Load TF-IDF matrices, labels, and BERT tokens from Phase 2 for model training.

In [11]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
    
# Load TF-IDF matrices
tfidf_train = sp.load_npz("text_representation/tfidf_train.npz")
tfidf_val = sp.load_npz("text_representation/tfidf_val.npz")
tfidf_test = sp.load_npz("text_representation/tfidf_test.npz")
    
# Load labels
y_train = pd.read_csv("text_representation/labels_train.csv")['Label'].values
y_val = pd.read_csv("text_representation/labels_val.csv")['Label'].values
y_test = pd.read_csv("text_representation/labels_test.csv")['Label'].values
    
# Load BERT tokens
bert_input_ids = np.load("text_representation/bert_input_ids.npy")
bert_attention_masks = np.load("text_representation/bert_attention_masks.npy")
    
# Verify shapes
print("TF-IDF Train Shape:", tfidf_train.shape)
print("Labels Train Shape:", y_train.shape)
print("BERT Input IDs Shape:", bert_input_ids.shape)
print("Label Distribution (Train):\n", pd.Series(y_train).value_counts(normalize=True) * 100)

TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193,)
BERT Input IDs Shape: (7743, 128)
Label Distribution (Train):
 0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64


### Step 2: Train Logistic Regression, SVM, and Naive Bayes Models

- **Objective**: Train baseline Logistic Regression, SVM, and Multinomial Naive Bayes models on the imbalanced TF-IDF training set.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Initialize models
models = {
    'NaiveBayes': MultinomialNB(),
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
}

In [3]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
from tqdm import tqdm
import joblib

# Train and evaluate
results = {} # store the performance of models
lb = LabelBinarizer()  # turns y_val into binary format for ROC-AUC

for name, model in tqdm(models.items(), desc="Training Models...."):
    # Train
    model.fit(tfidf_train, y_train)
        
    # Predict on validation set
    y_pred = model.predict(tfidf_val)
        
    # Compute metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_val, y_pred, average='weighted')
    precision_per_class, recall_per_class, f1_per_class, _ = precision_recall_fscore_support(y_val, y_pred)
        
    # ROC-AUC (one-vs-rest)
    y_val_bin = lb.fit_transform(y_val)
    y_pred_proba = model.predict_proba(tfidf_val)
    roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')
        
    # Store results
    results[name] = {
        'accuracy': accuracy,
        'precision_weighted': precision,
        'recall_weighted': recall,
        'f1_weighted': f1,
        'f1_per_class': f1_per_class,
        'roc_auc': roc_auc
    }
        
    # Save model
    joblib.dump(model, f"models/baseline_models/{name}_baseline.joblib")
    

Training Models....: 100%|██████████| 3/3 [01:05<00:00, 21.67s/it]


In [4]:
# Print results
for name, metrics in results.items():
    print(f"\n{name} Validation Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


NaiveBayes Validation Metrics:
Accuracy: 0.6155
Weighted F1: 0.5797
F1 per Class (Neg, Pos, Neu): [0.71518987 0.44186047 0.47093023]
ROC-AUC: 0.7241

LogisticRegression Validation Metrics:
Accuracy: 0.5935
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.68304094 0.46938776 0.49376559]
ROC-AUC: 0.7188

SVM Validation Metrics:
Accuracy: 0.6103
Weighted F1: 0.5811
F1 per Class (Neg, Pos, Neu): [0.7063922  0.45864662 0.47645429]
ROC-AUC: 0.7183


### Step 3: Hyperparameter Tuning

- **Objective**: Optimize hyperparameters for Logistic Regression, SVM, and Naive Bayes.

**USE RandomizedSearchCV**

#### Tuning 

In [5]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, loguniform
import pandas as pd

# Parameter distributions
param_dists = {
    'LogisticRegression': {
        'C': loguniform(1e-3, 1e3),
        'solver': ['lbfgs', 'liblinear']
    },
    'SVM': {
        'C': loguniform(1e-3, 1e3),
        'kernel': ['linear', 'rbf']
    },
    'NaiveBayes': {
        'alpha': uniform(0.01, 2.0)  # samples values from [0.01, 2.01)
    }
}

# Randomized search
tuned_results = {}
for name, model in tqdm(models.items(), desc="Tuning Models (RandomizedSearchCV)"):
    search = RandomizedSearchCV(
        model,
        param_distributions=param_dists[name],
        n_iter=20,  # number of combinations to try
        scoring='f1_weighted',
        cv=5,
        random_state=42,
        n_jobs=-1
    )
    search.fit(tfidf_train, y_train)

    best_model = search.best_estimator_
    y_pred = best_model.predict(tfidf_val)

    accuracy = accuracy_score(y_val, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_val, y_pred, average='weighted')
    f1_per_class = precision_recall_fscore_support(y_val, y_pred)[2]
    y_val_bin = lb.fit_transform(y_val)
    y_pred_proba = best_model.predict_proba(tfidf_val)
    roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')

    tuned_results[name] = {
        'best_params': search.best_params_,
        'accuracy': accuracy,
        'f1_weighted': f1,
        'f1_per_class': f1_per_class,
        'roc_auc': roc_auc
    }

    joblib.dump(best_model, f"models/baseline_models/{name}_tuned_random.joblib")

# Save results
pd.DataFrame(tuned_results).to_csv("models/baseline_models/tuned_results_random.csv")

# Print results
for name, metrics in tuned_results.items():
    print(f"\n{name} Tuned Validation Metrics (Randomized Search):")
    print(f"Best Parameters: {metrics['best_params']}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")

Tuning Models (RandomizedSearchCV): 100%|██████████| 3/3 [13:15<00:00, 265.02s/it]


NaiveBayes Tuned Validation Metrics (Randomized Search):
Best Parameters: {'alpha': 0.32203728088487305}
Accuracy: 0.6090
Weighted F1: 0.5888
F1 per Class (Neg, Pos, Neu): [0.70642202 0.46853147 0.49489796]
ROC-AUC: 0.7264

LogisticRegression Tuned Validation Metrics (Randomized Search):
Best Parameters: {'C': 1.2173252504194045, 'solver': 'liblinear'}
Accuracy: 0.5987
Weighted F1: 0.5795
F1 per Class (Neg, Pos, Neu): [0.68792711 0.47552448 0.48704663]
ROC-AUC: 0.7193

SVM Tuned Validation Metrics (Randomized Search):
Best Parameters: {'C': 5.132347525675523, 'kernel': 'rbf'}
Accuracy: 0.5897
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.67788462 0.48684211 0.48792271]
ROC-AUC: 0.7080





#### Optimized for multiclass

In [6]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, loguniform
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
import pandas as pd
from tqdm import tqdm
import joblib

# Parameter distributions for randomized search
param_dists = {
    'LogisticRegression': {
        'C': loguniform(1e-3, 1e3),
        'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'],
        'multi_class': ['multinomial'],
        'penalty': ['l2'],
        'max_iter': [1000]
    },
    'SVM': {
        'C': loguniform(1e-3, 1e3),
        'kernel': ['linear', 'rbf']
    },
    'NaiveBayes': {
        'alpha': uniform(0.01, 2.0)
    }
}

# Run randomized hyperparameter tuning
tuned_results = {}

for name, model in tqdm(models.items(), desc="Tuning Models (Multiclass Optimized)"):
    try:
        search = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_dists[name],
            n_iter=20,
            scoring='f1_weighted',
            cv=5,
            random_state=42,
            n_jobs=-1
        )
        search.fit(tfidf_train, y_train)

        best_model = search.best_estimator_
        y_pred = best_model.predict(tfidf_val)

        # Compute evaluation metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(y_val, y_pred, average='weighted')
        f1_per_class = precision_recall_fscore_support(y_val, y_pred)[2]

        y_val_bin = lb.fit_transform(y_val)
        y_pred_proba = best_model.predict_proba(tfidf_val)
        roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')

        # Store results
        tuned_results[name] = {
            'best_params': search.best_params_,
            'accuracy': accuracy,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Save the best model
        joblib.dump(best_model, f"models/baseline_models/{name}_tuned_multiclass.joblib")

    except Exception as e:
        print(f"Error tuning {name}: {str(e)}")

# Save all tuning results to CSV
pd.DataFrame(tuned_results).to_csv("models/baseline_models/tuned_results_multiclass.csv")

# Print the tuned results
for name, metrics in tuned_results.items():
    print(f"\n{name} Tuned Validation Metrics (Multiclass Randomized Search):")
    print(f"Best Parameters: {metrics['best_params']}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


Tuning Models (Multiclass Optimized): 100%|██████████| 3/3 [11:58<00:00, 239.54s/it]


NaiveBayes Tuned Validation Metrics (Multiclass Randomized Search):
Best Parameters: {'alpha': 0.32203728088487305}
Accuracy: 0.6090
Weighted F1: 0.5888
F1 per Class (Neg, Pos, Neu): [0.70642202 0.46853147 0.49489796]
ROC-AUC: 0.7264

LogisticRegression Tuned Validation Metrics (Multiclass Randomized Search):
Best Parameters: {'C': 1.2173252504194045, 'max_iter': 1000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'}
Accuracy: 0.5897
Weighted F1: 0.5757
F1 per Class (Neg, Pos, Neu): [0.67852906 0.47333333 0.49140049]
ROC-AUC: 0.7179

SVM Tuned Validation Metrics (Multiclass Randomized Search):
Best Parameters: {'C': 5.132347525675523, 'kernel': 'rbf'}
Accuracy: 0.5897
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.67788462 0.48684211 0.48792271]
ROC-AUC: 0.7080





*** Try GridSearchCV**

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelBinarizer
from tqdm import tqdm
import joblib
import pandas as pd

# Grid parameter definitions
param_grids = {
    'LogisticRegression': {
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'],
        'multi_class': ['multinomial'],
        'penalty': ['l2'],
        'max_iter': [1000]
    },
    'SVM': {
        'C': [0.01, 0.1, 1, 10, 100],
        'kernel': ['linear', 'rbf']
    },
    'NaiveBayes': {
        'alpha': [0.01, 0.1, 0.5, 1.0, 2.0]
    }
}

# Grid search loop
tuned_results = {}
lb = LabelBinarizer()

for name, model in tqdm(models.items(), desc="Tuning Models (GridSearchCV)"):
    try:
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring='f1_weighted',
            cv=5,
            n_jobs=-1
        )
        grid_search.fit(tfidf_train, y_train)

        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(tfidf_val)

        # Compute evaluation metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(y_val, y_pred, average='weighted')
        f1_per_class = precision_recall_fscore_support(y_val, y_pred)[2]

        y_val_bin = lb.fit_transform(y_val)
        y_pred_proba = best_model.predict_proba(tfidf_val)
        roc_auc = roc_auc_score(y_val_bin, y_pred_proba, multi_class='ovr')

        # Save results
        tuned_results[name] = {
            'best_params': grid_search.best_params_,
            'accuracy': accuracy,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Save model
        joblib.dump(best_model, f"models/baseline_models/{name}_tuned_grid.joblib")

    except Exception as e:
        print(f"Error tuning {name}: {str(e)}")

# Save all tuning results
pd.DataFrame(tuned_results).to_csv("models/baseline_models/tuned_results_grid.csv")

# Print summary
for name, metrics in tuned_results.items():
    print(f"\n{name} Tuned Validation Metrics (Multiclass Grid Search):")
    print(f"Best Parameters: {metrics['best_params']}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


Tuning Models (GridSearchCV): 100%|██████████| 3/3 [06:38<00:00, 132.77s/it]


NaiveBayes Tuned Validation Metrics (Multiclass Grid Search):
Best Parameters: {'alpha': 0.5}
Accuracy: 0.6168
Weighted F1: 0.5935
F1 per Class (Neg, Pos, Neu): [0.71061453 0.47826087 0.49604222]
ROC-AUC: 0.7265

LogisticRegression Tuned Validation Metrics (Multiclass Grid Search):
Best Parameters: {'C': 1, 'max_iter': 1000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'}
Accuracy: 0.5935
Weighted F1: 0.5776
F1 per Class (Neg, Pos, Neu): [0.68304094 0.46938776 0.49376559]
ROC-AUC: 0.7188

SVM Tuned Validation Metrics (Multiclass Grid Search):
Best Parameters: {'C': 10, 'kernel': 'rbf'}
Accuracy: 0.5935
Weighted F1: 0.5817
F1 per Class (Neg, Pos, Neu): [0.6811071  0.49180328 0.49275362]
ROC-AUC: 0.7081





Fine Tuning with grid search is better 

### Step 4:  BanglaBERT Training

- **Objective**: Train BanglaBERT 

See in notebook: BanglaBert.ipyb=nb

### Step 5: Evaluate on Test Set

- **Objective**: Evaluate tuned models on test set with confusion matrices and ROC-AUC curves.

In [17]:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
import pandas as pd
import joblib
from tqdm import tqdm
import logging
import scipy.sparse as sp
from sklearn.preprocessing import LabelBinarizer


In [18]:
# Load TF-IDF matrices
tfidf_train = sp.load_npz("text_representation/tfidf_train.npz")
tfidf_val = sp.load_npz("text_representation/tfidf_val.npz")
tfidf_test = sp.load_npz("text_representation/tfidf_test.npz")
    
# Load labels
y_train = pd.read_csv("text_representation/labels_train.csv")['Label'].values
y_val = pd.read_csv("text_representation/labels_val.csv")['Label'].values
y_test = pd.read_csv("text_representation/labels_test.csv")['Label'].values
    
# Load BERT tokens
bert_input_ids = np.load("text_representation/bert_input_ids.npy")
bert_attention_masks = np.load("text_representation/bert_attention_masks.npy")
    
# Verify shapes
print("TF-IDF Train Shape:", tfidf_train.shape)
print("Labels Train Shape:", y_train.shape)
print("BERT Input IDs Shape:", bert_input_ids.shape)
print("Label Distribution (Train):\n", pd.Series(y_train).value_counts(normalize=True) * 100)

TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193,)
BERT Input IDs Shape: (7743, 128)
Label Distribution (Train):
 0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64


In [19]:
# Define class names
class_names = ['Negative', 'Positive', 'Neutral']

# Evaluate on test set
test_results = {}
lb = LabelBinarizer()  # turns y_val into binary format for ROC-AUC


In [23]:
for name in tqdm(['LogisticRegression', 'SVM', 'NaiveBayes'], desc="Evaluating Test Set"):
    try:
        model = joblib.load(f"models/baseline_models/{name}_tuned_grid.joblib")
        logging.info(f"Loaded model: {name}_tuned_grid.joblib")
        y_pred = model.predict(tfidf_test)

        accuracy = accuracy_score(y_test, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
        f1_per_class = precision_recall_fscore_support(y_test, y_pred)[2]
        y_test_bin = lb.fit_transform(y_test)
        y_pred_proba = model.predict_proba(tfidf_test)
        roc_auc = roc_auc_score(y_test_bin, y_pred_proba, multi_class='ovr')

        test_results[name] = {
            'accuracy': accuracy,
            'f1_weighted': f1,
            'f1_per_class': f1_per_class,
            'roc_auc': roc_auc
        }

        # Confusion matrix
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
        plt.title(f"{name} Confusion Matrix")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        plt.savefig(f"models/baseline_models/{name}_cm.png")
        plt.close()
        logging.info(f"Saved confusion matrix: {name}_cm.png")

        # ROC-AUC curves (one-vs-rest)
        plt.figure(figsize=(10, 8))
        for i in range(len(class_names)):
            fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_pred_proba[:, i])
            auc = roc_auc_score(y_test_bin[:, i], y_pred_proba[:, i])
            plt.plot(fpr, tpr, label=f"{class_names[i]} (AUC = {auc:.2f})")
        plt.plot([0, 1], [0, 1], 'k--', label='Chance')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f"{name} ROC-AUC Curves (One-vs-Rest)")
        plt.legend(loc='lower right')
        plt.grid(True)
        plt.savefig(f"models/baseline_models/{name}_roc_auc.png")
        plt.close()
        logging.info(f"Saved ROC-AUC plot: {name}_roc_auc.png")

    except Exception as e:
        logging.error(f"Error evaluating {name}: {str(e)}")


Evaluating Test Set: 100%|██████████| 3/3 [00:02<00:00,  1.41it/s]


In [24]:
test_results

{'LogisticRegression': {'accuracy': 0.6296774193548387,
  'f1_weighted': 0.612207681458976,
  'f1_per_class': array([0.71559633, 0.52173913, 0.5171504 ]),
  'roc_auc': 0.7539578154887493},
 'SVM': {'accuracy': 0.6258064516129033,
  'f1_weighted': 0.61292540121353,
  'f1_per_class': array([0.70769231, 0.53035144, 0.5255102 ]),
  'roc_auc': 0.7348161708242125},
 'NaiveBayes': {'accuracy': 0.6219354838709678,
  'f1_weighted': 0.5957916071584985,
  'f1_per_class': array([0.71009772, 0.50359712, 0.48433048]),
  'roc_auc': 0.7443484209937052}}

In [25]:

pd.DataFrame(test_results).to_csv("models/baseline_models/test_results.csv")
logging.info("Test results saved: test_results.csv")
   

In [26]:
for name, metrics in test_results.items():
    print(f"\n{name} Test Metrics:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Weighted F1: {metrics['f1_weighted']:.4f}")
    print(f"F1 per Class (Neg, Pos, Neu): {metrics['f1_per_class']}")
    print(f"ROC-AUC: {metrics['roc_auc']:.4f}")


LogisticRegression Test Metrics:
Accuracy: 0.6297
Weighted F1: 0.6122
F1 per Class (Neg, Pos, Neu): [0.71559633 0.52173913 0.5171504 ]
ROC-AUC: 0.7540

SVM Test Metrics:
Accuracy: 0.6258
Weighted F1: 0.6129
F1 per Class (Neg, Pos, Neu): [0.70769231 0.53035144 0.5255102 ]
ROC-AUC: 0.7348

NaiveBayes Test Metrics:
Accuracy: 0.6219
Weighted F1: 0.5958
F1 per Class (Neg, Pos, Neu): [0.71009772 0.50359712 0.48433048]
ROC-AUC: 0.7443


In [27]:
with open("models/baseline_models/README.txt", "w", encoding='utf-8') as f:
    f.write("Baseline Model Outputs:\n"
            "- *_baseline.joblib: Initial trained models\n"
            "- *_tuned_grid.joblib: Tuned models (GridSearchCV, multiclass optimized)\n"
            "- tuned_results_grid.csv: Tuning results\n"
            "- test_results.csv: Test set metrics\n"
            "- *_cm.png: Confusion matrix plots\n"
            "- *_roc_auc.png: ROC-AUC curve plots\n"
            "- banglabert_baseline/: BanglaBERT model (if trained)")
logging.info("README updated")
