## Bag of Words (Simple Baseline Model)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score

In [2]:
# Load datasets
train_dataset = pd.read_csv("../dataset/dpm_pcl_train.csv")
val_dataset = pd.read_csv("../dataset/dpm_pcl_val.csv")
test_dataset = pd.read_csv("../dataset/dpm_pcl_test.csv")

# Preprocess labels
label_mapping = lambda x: 0 if (x == 0 or x == 1) else 1
train_dataset["label"] = train_dataset["orig_label"].apply(label_mapping)
val_dataset["label"] = val_dataset["orig_label"].apply(label_mapping)
test_dataset["label"] = test_dataset["orig_label"].apply(label_mapping)

# Handle missing values
train_dataset["text"].fillna("", inplace=True)
val_dataset["text"].fillna("", inplace=True)
test_dataset["text"].fillna("", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_dataset["text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  val_dataset["text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values a

In [3]:
# Extract features using BoW
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_dataset["text"])
X_val = vectorizer.transform(val_dataset["text"])
X_test = vectorizer.transform(test_dataset["text"])

y_train = train_dataset["label"]
y_val = val_dataset["label"]
y_test = test_dataset["label"]

In [5]:
# Train a logistic regression classifier
clf = LogisticRegression(max_iter=5000)
clf.fit(X_train, y_train)

In [6]:
# Evaluate on validation set
val_preds = clf.predict(X_val)
print("Validation Performance:")
print(classification_report(y_val, val_preds, digits=4))

# Evaluate on test set
test_preds = clf.predict(X_test)
print("Test Performance:")
print(classification_report(y_test, test_preds, digits=4))

# Calculate F1 scores explicitly
val_f1 = f1_score(y_val, val_preds)
test_f1 = f1_score(y_test, test_preds)
print(f"Validation F1 Score: {val_f1:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")

Validation Performance:
              precision    recall  f1-score   support

           0     0.9023    0.9993    0.9483      1506
           1     0.8571    0.0355    0.0682       169

    accuracy                         0.9021      1675
   macro avg     0.8797    0.5174    0.5083      1675
weighted avg     0.8977    0.9021    0.8595      1675

Test Performance:
              precision    recall  f1-score   support

           0     0.9066    0.9984    0.9503      1895
           1     0.5714    0.0201    0.0388       199

    accuracy                         0.9054      2094
   macro avg     0.7390    0.5093    0.4946      2094
weighted avg     0.8747    0.9054    0.8637      2094

Validation F1 Score: 0.0682
Test F1 Score: 0.0388
