## Phase 3: Baseline Model Training 


**Objective**: Establish baseline performance for sentiment classification on the Bangla Sentiment Dataset without imbalance mitigation, using Logistic Regression, SVM, Naive Bayes (MultinomialNB), and BanglaBERT, to quantify the impact of class imbalance.

### Step 1: Load Preprocessed Data

In [2]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
    
# Load TF-IDF matrices
tfidf_train = sp.load_npz("text_representation/tfidf_train.npz")
tfidf_val = sp.load_npz("text_representation/tfidf_val.npz")
tfidf_test = sp.load_npz("text_representation/tfidf_test.npz")
    
# Load labels
y_train = pd.read_csv("text_representation/labels_train.csv")['Label'].values
y_val = pd.read_csv("text_representation/labels_val.csv")['Label'].values
y_test = pd.read_csv("text_representation/labels_test.csv")['Label'].values
    
# Load BERT tokens
bert_input_ids = np.load("text_representation/bert_input_ids.npy")
bert_attention_masks = np.load("text_representation/bert_attention_masks.npy")
    
# Verify shapes
print("TF-IDF Train Shape:", tfidf_train.shape)
print("Labels Train Shape:", y_train.shape)
print("BERT Input IDs Shape:", bert_input_ids.shape)
print("Label Distribution (Train):\n", pd.Series(y_train).value_counts(normalize=True) * 100)

TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193,)
BERT Input IDs Shape: (7743, 128)
Label Distribution (Train):
 0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64
