#### TF-IDF + Logistic Regression baseline

Before moving to advanced models , we first establish a simple baseline using TF-IDF vectorization combined with Logistic Regression. This provides a benchmark to compare more complex approaches against.



1. Load the splits

In [None]:
train_df = pd.read_csv("../data/train.csv")
val_df   = pd.read_csv("../data/val.csv")
test_df  = pd.read_csv("../data/test.csv")

print("Train shape:", train_df.shape)
print("Val shape:", val_df.shape)   
print("Test shape:", test_df.shape)


Train shape: (35918, 10)
Val shape: (4490, 10)
Test shape: (4490, 10)


2. vectorize text with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=10000,     # top 10k features
    ngram_range=(1,2),      # unigrams + bigrams
    stop_words="english"    # remove stopwords
)

X_train = vectorizer.fit_transform(train_df["combined_text"])
X_val   = vectorizer.transform(val_df["combined_text"])
X_test  = vectorizer.transform(test_df["combined_text"])

y_train = train_df["label"]
y_val   = val_df["label"]
y_test  = test_df["label"]

3. Train logistic regression 

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=500,       # enough iterations
    solver="liblinear"  # good for small/medium datasets
)
log_reg.fit(X_train, y_train)

4. Evaluate on Validation Set

In [None]:
from sklearn.metrics import classification_report, accuracy_score

y_val_pred = log_reg.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("\nClassification Report (Validation):\n", classification_report(y_val, y_val_pred))

5. Evaluate on Test Set

In [None]:
y_test_pred = log_reg.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report (Test):\n", classification_report(y_test, y_test_pred))