# 04 - Traditional ML Baselines

**Overview:** Build TF-IDF baselines (Logistic Regression, SVM) and engineered-feature models; evaluate class-imbalance strategies and baseline performance.

# 04 - Traditional ML Baselines

**Overview:** Implement TF-IDF and engineered-feature baselines (e.g., Logistic Regression, XGBoost) and evaluate class imbalance strategies.

In [None]:
# Setup: traditional ML imports and data loader
from datasets import load_dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from src.utils import set_seed

set_seed(42)

# Load dataset
print("Loading dataset FutureMa/EvasionBench...")
ds = load_dataset("FutureMa/EvasionBench")
if isinstance(ds, dict):
    ds = ds[list(ds.keys())[0]]
df = ds.to_pandas()
print("Dataset shape:", df.shape)

# Sample train/val/test split
train, test = train_test_split(df, test_size=0.2, stratify=df['eva4b_label'], random_state=42)
train, val = train_test_split(train, test_size=0.125, stratify=train['eva4b_label'], random_state=42)
print("Splits: ", train.shape, val.shape, test.shape)


# 04 - Traditional ML Baselines

**Objectives:**
- TF-IDF + Logistic Regression / XGBoost
- Engineered features and feature importance
- Class imbalance experiments