# Final Ensemble Notebook

Goal: build a strong word + character TF‑IDF ensemble for the news classification task, evaluate it on a validation split, then train on all data and generate submission files for the competition.

High‑level steps:
- Construct a rich text field from source, title, and article.
- Define word‑level and character‑level TF‑IDF + Logistic Regression pipelines.
- Compare individual models and a weighted ensemble on a validation set.
- Retrain the best ensemble on all labeled data.
- Apply the ensemble to `evaluation.csv` and save submission CSVs.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

def build_text(df):
    src = df["source"].fillna("")
    title = df["title"].fillna("")
    art = df["article"].fillna("")
    return (src + " ") * 2 + title + " " + art   # <- source x2

train_df = pd.read_csv("development.csv")
train_df["text"] = build_text(train_df)

X = train_df["text"]
y = train_df["label"]

X_train_text, X_val_text, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train:", len(X_train_text), "Val:", len(X_val_text))

Train: 63997 Val: 16000


### Step 1 – Build text and create train/validation split

**Aim:** Combine `source`, `title`, and `article` into a single text field (with `source` repeated twice for extra weight), then create a stratified train/validation split.

- `build_text` defines how raw columns are merged into `text`.
- `train_df` loads `development.csv` and adds the `text` column.
- `train_test_split(..., stratify=y)` ensures label proportions are similar in train and validation.

**What the results show:**
- The printed `Train: ... Val: ...` line confirms the data loaded correctly and that the split sizes look as expected.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

word_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 3),
        min_df=3,
        sublinear_tf=True,
        max_features=100000
    )),
    ("clf", LogisticRegression(
        C=2.0,
        max_iter=2000,
        class_weight="balanced"
    ))
])

### Step 2 – Define word‑level TF‑IDF + Logistic Regression pipeline

**Aim:** Create the main word‑based model that converts text into word n‑gram TF‑IDF features and trains a Logistic Regression classifier.

- `TfidfVectorizer` uses word n‑grams `(1, 3)`, ignores English stop words, applies `min_df=3`, and caps features at 100k with `sublinear_tf=True`.
- `LogisticRegression` uses `C=2.0`, `max_iter=2000`, and `class_weight='balanced'` to handle class imbalance and high dimensionality.

**What the results show:**
- No printed output yet; this cell just defines `word_pipeline` to be fitted later.

In [8]:
char_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="char_wb",
        ngram_range=(3, 5),
        min_df=3,
        sublinear_tf=True,
        max_features=200000
    )),
    ("clf", LogisticRegression(
        C=2.0,
        max_iter=2000,
        class_weight="balanced"
    ))
])

In [24]:
from sklearn.metrics import f1_score

# 4.a) Tune char ngram_range
char_ngram_grid = [(2, 4), (3, 5), (4, 6)]
best_ng = None
best_f1 = -1.0

for ng in char_ngram_grid:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            analyzer="char_wb",
            ngram_range=ng,
            min_df=3,
            sublinear_tf=True,
            max_features=200000,
        )),
        ("clf", LogisticRegression(
            C=2.0,
            max_iter=2000,
            class_weight="balanced",
        )),
    ])
    pipe.fit(X_train_text, y_train)
    pred = pipe.predict(X_val_text)
    f1 = f1_score(y_val, pred, average="macro")
    print(f"ngram_range={ng} -> macroF1={f1:.5f}")
    if f1 > best_f1:
        best_f1 = f1
        best_ng = ng

print("Best char ngram_range:", best_ng, "with macroF1=", round(best_f1, 5))

ngram_range=(2, 4) -> macroF1=0.70950
ngram_range=(3, 5) -> macroF1=0.71417
ngram_range=(4, 6) -> macroF1=0.71376
Best char ngram_range: (3, 5) with macroF1= 0.71417


In [25]:
# 4.b) Tune char min_df
char_min_df_grid = [2, 3, 4, 5]
best_min_df = None
best_f1_md = -1.0

for md in char_min_df_grid:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            analyzer="char_wb",
            ngram_range=(3, 5),
            min_df=md,
            sublinear_tf=True,
            max_features=200000,
        )),
        ("clf", LogisticRegression(
            C=2.0,
            max_iter=2000,
            class_weight="balanced",
        )),
    ])
    pipe.fit(X_train_text, y_train)
    pred = pipe.predict(X_val_text)
    f1 = f1_score(y_val, pred, average="macro")
    print(f"min_df={md} -> macroF1={f1:.5f}")
    if f1 > best_f1_md:
        best_f1_md = f1
        best_min_df = md

print("Best char min_df:", best_min_df, "with macroF1=", round(best_f1_md, 5))

min_df=2 -> macroF1=0.71335
min_df=3 -> macroF1=0.71417
min_df=4 -> macroF1=0.71394
min_df=5 -> macroF1=0.71315
Best char min_df: 3 with macroF1= 0.71417


In [26]:
# 4.c) Tune char max_features
char_max_feat_grid = [50000, 100000, 200000, None]
best_max_feat = None
best_f1_mf = -1.0

for mf in char_max_feat_grid:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            analyzer="char_wb",
            ngram_range=(3, 5),
            min_df=3,
            sublinear_tf=True,
            max_features=mf,
        )),
        ("clf", LogisticRegression(
            C=2.0,
            max_iter=2000,
            class_weight="balanced",
        )),
    ])
    pipe.fit(X_train_text, y_train)
    pred = pipe.predict(X_val_text)
    f1 = f1_score(y_val, pred, average="macro")
    print(f"max_features={mf} -> macroF1={f1:.5f}")
    if f1 > best_f1_mf:
        best_f1_mf = f1
        best_max_feat = mf

print("Best char max_features:", best_max_feat, "with macroF1=", round(best_f1_mf, 5))

max_features=50000 -> macroF1=0.71195
max_features=100000 -> macroF1=0.71326
max_features=200000 -> macroF1=0.71417
max_features=None -> macroF1=0.71374
Best char max_features: 200000 with macroF1= 0.71417


In [27]:
# 4.d) Tune char LogisticRegression C
char_C_grid = [0.5, 1.0, 2.0, 4.0]
best_C = None
best_f1_C = -1.0

for C_val in char_C_grid:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            analyzer="char_wb",
            ngram_range=(3, 5),
            min_df=3,
            sublinear_tf=True,
            max_features=200000,
        )),
        ("clf", LogisticRegression(
            C=C_val,
            max_iter=2000,
            class_weight="balanced",
        )),
    ])
    pipe.fit(X_train_text, y_train)
    pred = pipe.predict(X_val_text)
    f1 = f1_score(y_val, pred, average="macro")
    print(f"C={C_val} -> macroF1={f1:.5f}")
    if f1 > best_f1_C:
        best_f1_C = f1
        best_C = C_val

print("Best char C:", best_C, "with macroF1=", round(best_f1_C, 5))

C=0.5 -> macroF1=0.70617
C=1.0 -> macroF1=0.71001
C=2.0 -> macroF1=0.71417
C=4.0 -> macroF1=0.71046
Best char C: 2.0 with macroF1= 0.71417


### Step 3 – Define char‑level TF‑IDF + Logistic Regression pipeline

**Aim:** Build a complementary character‑level model that can capture subword patterns (e.g. prefixes, suffixes, typos) using character n‑grams.

- `TfidfVectorizer` works on `analyzer='char_wb'` with n‑grams `(3, 5)`, `min_df=3`, and up to 200k features.
- The same Logistic Regression setup (`C=2.0`, `max_iter=2000`, `class_weight='balanced'`) is used as for the word model.

**What the results show:**
- Again no direct output; this defines `char_pipeline` which will be trained and evaluated in the next step.

In [9]:
from sklearn.metrics import f1_score
import numpy as np

word_pipeline.fit(X_train_text, y_train)
char_pipeline.fit(X_train_text, y_train)

word_proba = word_pipeline.predict_proba(X_val_text)
char_proba = char_pipeline.predict_proba(X_val_text)

# For reference: each model alone
pred_word = np.argmax(word_proba, axis=1)
pred_char = np.argmax(char_proba, axis=1)

print("Word-only macroF1:", round(f1_score(y_val, pred_word, average="macro"), 5))
print("Char-only macroF1:", round(f1_score(y_val, pred_char, average="macro"), 5))

Word-only macroF1: 0.71779
Char-only macroF1: 0.71417


In [23]:
import numpy as np
from sklearn.metrics import f1_score

# Try different weight combinations for word vs char
weight_grid = [(0.55, 0.45), (0.64, 0.36), (0.75, 0.25), (0.84, 0.16), (0.95, 0.05)]

best_f1 = -1.0
best_w = None

for w_word, w_char in weight_grid:
    final_proba = w_word * word_proba + w_char * char_proba
    pred_ens = np.argmax(final_proba, axis=1)
    f1 = f1_score(y_val, pred_ens, average="macro")
    print(f"w_word={w_word:.2f}, w_char={w_char:.2f} -> macroF1={f1:.5f}")
    if f1 > best_f1:
        best_f1 = f1
        best_w = (w_word, w_char)

print("Best weights:", best_w, "with macroF1=", round(best_f1, 5))

w_word=0.55, w_char=0.45 -> macroF1=0.72055
w_word=0.64, w_char=0.36 -> macroF1=0.72057
w_word=0.75, w_char=0.25 -> macroF1=0.72019
w_word=0.84, w_char=0.16 -> macroF1=0.72058
w_word=0.95, w_char=0.05 -> macroF1=0.71802
Best weights: (0.84, 0.16) with macroF1= 0.72058


### Step 4 – Fit both models and compare word vs char performance

**Aim:** Train the word and char pipelines on the training split and measure how each performs alone on the validation set using Macro F1.

- Both pipelines are fitted on `X_train_text, y_train`.
- `predict_proba` is used to obtain class probabilities on `X_val_text`, and `argmax` converts them to predicted labels.
- Macro F1 scores are computed and printed separately for word‑only and char‑only models.

**What the results show:**
- `Word-only macroF1: ...` – performance of the word‑level model.
- `Char-only macroF1: ...` – performance of the char‑level model.
- These numbers tell you which representation is stronger on its own and set a baseline before ensembling.

In [10]:
import numpy as np
from sklearn.metrics import f1_score

# Predict probabilities with source x2 text
word_proba = word_pipeline.predict_proba(X_val_text)
char_proba = char_pipeline.predict_proba(X_val_text)

# Final ensemble (locked weights)
final_proba = 0.7 * word_proba + 0.3 * char_proba
pred_ens = np.argmax(final_proba, axis=1)

f1_ens = f1_score(y_val, pred_ens, average="macro")
print("FINAL Ensemble macroF1 (source x2):", round(f1_ens, 5))

FINAL Ensemble macroF1 (source x2): 0.72063


### Step 5 – Build a weighted word + char ensemble on validation

**Aim:** Combine the word and char models into a single ensemble using fixed weights (0.7 for word, 0.3 for char) and evaluate it on the validation set.

- Reuses `predict_proba` outputs for both models on `X_val_text`.
- Forms `final_proba = 0.7 * word_proba + 0.3 * char_proba` and takes `argmax` to get ensemble predictions.
- Computes Macro F1 for the ensemble and prints it.

**What the results show:**
- `FINAL Ensemble macroF1 (source x2): ...` – if this value is higher than both word‑only and char‑only scores, the ensemble is a true improvement.
- Confirms that using both views of the text together is beneficial.

In [11]:
# Rebuild full text with source x2
train_df["text"] = build_text(train_df)

X_full = train_df["text"]
y_full = train_df["label"]

# Train both models on ALL data
word_pipeline.fit(X_full, y_full)
char_pipeline.fit(X_full, y_full)

print("✅ Both models trained on full training data")

✅ Both models trained on full training data


### Step 6 – Retrain both models on all labeled data

**Aim:** After validating the ensemble, refit the word and char pipelines on the **full** training set so they can use every labeled example before generating submissions.

- Rebuilds `text` for all rows using `build_text` (with `source` repeated).
- Sets `X_full` and `y_full` and fits both pipelines on the entire dataset.

**What the results show:**
- The message `✅ Both models trained on full training data` confirms that training completed successfully on all labeled data.

In [12]:
eval_df = pd.read_csv("evaluation.csv")
eval_df["text"] = build_text(eval_df)

X_eval = eval_df["text"]

print("Eval samples:", len(X_eval))

Eval samples: 20000


### Step 7 – Prepare evaluation text

**Aim:** Load the competition evaluation set and build its `text` field in exactly the same way as for the training data, so the pipelines can be applied consistently.

- Reads `evaluation.csv` into `eval_df`.
- Applies `build_text` to create `eval_df["text"]`.
- Extracts `X_eval` as the text Series and prints the number of evaluation samples.

**What the results show:**
- `Eval samples: ...` confirms that the evaluation file was loaded correctly and that the number of rows matches expectations.

In [13]:
# Predict probabilities
word_proba_test = word_pipeline.predict_proba(X_eval)
char_proba_test = char_pipeline.predict_proba(X_eval)

# Ensemble
final_proba_test = 0.7 * word_proba_test + 0.3 * char_proba_test
preds_test = np.argmax(final_proba_test, axis=1)

# Submission
submission = pd.DataFrame({
    "Id": eval_df["Id"],
    "Predicted": preds_test
})

submission.to_csv("submission_ensemble_tuned.csv", index=False)

print("✅ submission_ensemble_tuned.csv created")
print(submission.head())

✅ submission_ensemble_tuned.csv created
   Id  Predicted
0   0          5
1   1          2
2   2          5
3   3          1
4   4          5


### Step 8 – Create tuned ensemble submission file

**Aim:** Apply the trained word + char ensemble to the evaluation set and save the main submission file `submission_ensemble_tuned.csv`.

- Computes `predict_proba` for both models on `X_eval`.
- Forms the same 0.7/0.3 ensemble and takes `argmax` to get final predictions.
- Builds a DataFrame with `Id` and `Predicted` and writes it to CSV.

**What the results show:**
- The message `✅ submission_ensemble_tuned.csv created` confirms the file was written.
- `submission.head()` lets you visually inspect the first few rows (Ids and predicted labels) to sanity‑check the format.

In [14]:
import numpy as np
import pandas as pd

def build_text_src1(df):
    return df["source"].fillna("") + " " + df["title"].fillna("") + " " + df["article"].fillna("")

def build_text_src2(df):
    src = df["source"].fillna("")
    return (src + " ") * 2 + df["title"].fillna("") + " " + df["article"].fillna("")

def make_submission(text_builder, out_name):
    train_df = pd.read_csv("development.csv")
    eval_df  = pd.read_csv("evaluation.csv")

    X_full = text_builder(train_df)
    y_full = train_df["label"]
    X_eval = text_builder(eval_df)

    # fit full
    word_pipeline.fit(X_full, y_full)
    char_pipeline.fit(X_full, y_full)

    # predict
    wp = word_pipeline.predict_proba(X_eval)
    cp = char_pipeline.predict_proba(X_eval)
    preds = np.argmax(0.7*wp + 0.3*cp, axis=1)

    sub = pd.DataFrame({"Id": eval_df["Id"], "Predicted": preds})
    sub.to_csv(out_name, index=False)
    print("✅ wrote", out_name)

make_submission(build_text_src1, "submission_ensemble_src1.csv")
make_submission(build_text_src2, "submission_ensemble_src2.csv")

✅ wrote submission_ensemble_src1.csv
✅ wrote submission_ensemble_src2.csv


### Step 9 – Generate alternative submissions with different text recipes

**Aim:** Produce two extra submission files using slightly different ways of building the text field from `source`, `title`, and `article`, while keeping the same word + char ensemble.

- `build_text_src1` uses `source` once; `build_text_src2` repeats `source` twice, similar to earlier.
- `make_submission` is a helper that, for a given text builder, fits both pipelines on all training data, predicts on evaluation text, ensembles with weights (0.7, 0.3), and writes a CSV.
- Two submissions are written: `submission_ensemble_src1.csv` and `submission_ensemble_src2.csv`.

**What the results show:**
- Lines like `✅ wrote submission_ensemble_src1.csv` confirm each file was created.
- These variants let you compare or ensemble across slightly different input formulations if desired.