# Semeval 2025 Task 10
### Subtask 2: Narrative Classification

Given a news article and a [two-level taxonomy of narrative labels](https://propaganda.math.unipd.it/semeval2025task10/NARRATIVE-TAXONOMIES.pdf) (where each narrative is subdivided into subnarratives) from a particular domain, assign to the article all the appropriate subnarrative labels. This is a multi-label multi-class document classification task.

In [1]:
random_state=None

In [2]:
import torch
import numpy as np
import random

if random_state:
    print('[WARNING] Setting random state')
    torch.manual_seed(random_state)
    np.random.seed(random_state) 
    random.seed(random_state)

## Ensemble Model Using Cross-Validation

As of now, we’ve been training on a combined dataset of all languages to handle limited data. Our final submission will focus on a single test set, but we want to leverage as much training data as possible.

One practical approach is to perform n-fold cross-validation on our combined set, which creates n different models—each learning slightly different patterns. 
* Then, at prediction time, we can load each fold model and average their outputs.
  By combining multiple models it is said that, we smooth out any biases or quirks of individual folds and often get more robust predictions.
* This ensemble method will help us get the most out of our data while still producing a single set of final predictions for submission.

In [3]:
import pickle
import pandas as pd
import os

root_dir = "../../"
base_save_folder_dir = '../saved/'
dataset_folder = os.path.join(base_save_folder_dir, 'Dataset')

with open(os.path.join(dataset_folder, 'dataset_train_cleaned.pkl'), 'rb') as f:
    dataset_train = pickle.load(f)

In [4]:
dataset_train.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives,narratives_encoded,subnarratives_encoded,aggregated_subnarratives
0,RU,RU-URW-1161.txt,<PARA>в ближайшие два месяца сша будут стремит...,[URW: Blaming the war on others rather than th...,"[The West are the aggressors, Other, The West ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
1,RU,RU-URW-1175.txt,<PARA>в ес испугались последствий популярности...,"[URW: Discrediting the West, Diplomacy, URW: D...","[The West is weak, Other, The EU is divided]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
2,RU,RU-URW-1149.txt,<PARA>возможность признания аллы пугачевой ино...,[URW: Distrust towards Media],[Western media is an instrument of propaganda],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
3,RU,RU-URW-1015.txt,<PARA>азаров рассказал о смене риторики киева ...,"[URW: Discrediting Ukraine, URW: Discrediting ...","[Ukraine is a puppet of the West, Discrediting...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
4,RU,RU-URW-1001.txt,<PARA>в россиянах проснулась массовая любовь к...,[URW: Praise of Russia],[Russia is a guarantor of peace and prosperity],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."


In [5]:
misc_folder = os.path.join(base_save_folder_dir, 'Misc')

with open(os.path.join(misc_folder, 'narrative_to_subnarratives.pkl'), 'rb') as f:
    narrative_to_subnarratives = pickle.load(f)

In [6]:
with open(os.path.join(misc_folder, 'narrative_to_subnarratives_map.pkl'), 'rb') as f:
    narrative_to_sub_map = pickle.load(f)

In [7]:
with open(os.path.join(misc_folder, 'coarse_classes.pkl'), 'rb') as f:
    coarse_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'fine_classes.pkl'), 'rb') as f:
    fine_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'narrative_order.pkl'), 'rb') as f:
    narrative_order = pickle.load(f)

In [8]:
dataset_train.shape

(1781, 8)

In [9]:
narrative_to_subnarratives

{'URW: Discrediting Ukraine': ['Ukraine is a puppet of the West',
  'Rewriting Ukraine’s history',
  'Ukraine is a hub for criminal activities',
  'Discrediting Ukrainian nation and society',
  'Discrediting Ukrainian government and officials and policies',
  'Discrediting Ukrainian military',
  'Other',
  'Situation in Ukraine is hopeless',
  'Ukraine is associated with nazism'],
 'URW: Discrediting the West, Diplomacy': ['West is tired of Ukraine',
  'Diplomacy does/will not work',
  'The West is weak',
  'The EU is divided',
  'The West does not care about Ukraine, only about its interests',
  'Other',
  'The West is overreacting'],
 'URW: Praise of Russia': ['Russia has international support from a number of countries and people',
  'Praise of Russian military might',
  'Russia is a guarantor of peace and prosperity',
  'Other',
  'Praise of Russian President Vladimir Putin',
  'Russian invasion has strong national support'],
 'URW: Russia is the Victim': ['The West is russophobic'

In [10]:
label_encoder_folder = os.path.join(base_save_folder_dir, 'LabelEncoders')

with open(os.path.join(label_encoder_folder, 'mlb_narratives.pkl'), 'rb') as f:
    mlb_narratives = pickle.load(f)

with open(os.path.join(label_encoder_folder, 'mlb_subnarratives.pkl'), 'rb') as f:
    mlb_subnarratives = pickle.load(f)

In [11]:
import numpy as np

embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_train_stella.npy')

def load_embeddings(filename):
    return np.load(filename)

train_embeddings = load_embeddings(embeddings_folder)

In [12]:
train_embeddings.shape

(1781, 1024)

In [13]:
def filter_dataset_and_embeddings(dataset, embeddings, condition_fn):
    filtered_indices = dataset.index[dataset.apply(condition_fn, axis=1)].tolist()
    
    filtered_dataset = dataset.loc[filtered_indices]
    filtered_embeddings = embeddings[filtered_indices]

    return filtered_dataset, filtered_embeddings

In [14]:
with open(os.path.join(dataset_folder, 'dataset_val_cleaned.pkl'), 'rb') as f:
    dataset_val = pickle.load(f)

In [15]:
dataset_val.shape

(178, 8)

In [16]:
embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_dev_stella.npy')

val_embeddings = load_embeddings(embeddings_folder)

We keep as target language for final evaluation the English dataset

In [17]:
target_lang = "EN"

In [18]:
dataset_val_target, val_embeddings_target = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == target_lang
)

In [19]:
dataset_val_non_target, val_embeddings_non_target = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] != target_lang
)

We combine both train and val datasets:

In [20]:
dataset_combined = pd.concat([dataset_train, dataset_val_non_target], ignore_index=True)

In [21]:
prefer_cpu=True

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available() and not prefer_cpu
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


In [22]:
import torch

train_embeddings_tensor = torch.tensor(train_embeddings, dtype=torch.float32).to(device)
val_embeddings_tensor = torch.tensor(val_embeddings_non_target, dtype=torch.float32).to(device)

In [23]:
val_embeddings_target_tensor = torch.tensor(val_embeddings_target, dtype=torch.float32).to(device)

We also do the same for the embeddings:

In [24]:
embeddings_combined = torch.cat([train_embeddings_tensor, val_embeddings_tensor])

In [25]:
import numpy as np

def custom_shuffling(data, embeddings):
    shuffled_indices = np.arange(len(data))
    np.random.shuffle(shuffled_indices)
    
    data = data.iloc[shuffled_indices].reset_index(drop=True)
    embeddings = embeddings[shuffled_indices]

    return data, embeddings

In [26]:
dataset, embeddings = custom_shuffling(dataset_combined, embeddings_combined)

In [27]:
dataset.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives,narratives_encoded,subnarratives_encoded,aggregated_subnarratives
0,EN,EN_UA_019640.txt,"<PARA>after North Korea’s Kim Jong Un, Putin a...","[URW: Praise of Russia, URW: Russia is the Vic...",[Russia has international support from a numbe...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
1,PT,PT_81.txt,<PARA>deputada do chega em cascais não poupa o...,"[URW: Discrediting Ukraine, URW: Discrediting ...","[Ukraine is associated with nazism, Ukraine is...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
2,EN,EN_UA_300050.txt,<PARA>is EU kowtowing to Germany's interests?<...,[Other],[Other],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
3,HI,HI_244.txt,<PARA>रूस ने यूक्रेन के ऊर्जा संयंत्रों को बना...,[URW: Speculating war outcomes],[Other],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
4,BG,BG_699.txt,<PARA>«държавите от НАТО създадоха за милиарди...,"[URW: Discrediting Ukraine, URW: Blaming the w...","[Ukraine is a puppet of the West, The West are...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."


In [28]:
y_sub_heads = dataset['aggregated_subnarratives'].to_numpy()

In [29]:
input_size = embeddings.shape[1]
print(input_size)

1024


In [30]:
import torch.nn as nn
import torch.nn.functional as F

class MultiTaskClassifierMultiHead(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size=1024,
        num_narratives=len(mlb_narratives.classes_),
        narrative_to_sub_map=narrative_to_sub_map,
        dropout_rate=0.4
    ):
        super().__init__()
        self.shared_layer = nn.Sequential(
            nn.Linear(input_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        self.narrative_head = nn.Sequential(
            nn.Linear(hidden_size * 2, num_narratives),
            nn.Sigmoid()
        )

        self.subnarrative_heads = nn.ModuleDict()
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            num_subs_for_this_narr = len(sub_indices)
            self.subnarrative_heads[str(narr_idx)] = nn.Sequential(
                nn.Linear(hidden_size * 2, num_subs_for_this_narr),
                nn.Sigmoid()
            )

    def forward(self, x):
        shared_out = self.shared_layer(x)
        narr_probs = self.narrative_head(shared_out)

        sub_probs_dict = {}
        for narr_idx, head in self.subnarrative_heads.items():
            sub_probs_dict[narr_idx] = head(shared_out)

        return narr_probs, sub_probs_dict

In [31]:
network_params = {
    'lr': 0.001,
    'hidden_size': 1024,
    'dropout': 0.4
}

In [32]:
y_nar = dataset['narratives_encoded'].tolist()

y_sub_nar = dataset['subnarratives_encoded'].tolist()

In [33]:
y_nar = torch.tensor(y_nar, dtype=torch.float32).to(device)
y_sub_nar = torch.tensor(y_sub_nar, dtype=torch.float32).to(device)

In [34]:
coarse_classes[:5]

['CC: Amplifying Climate Fears',
 'CC: Climate change is beneficial',
 'CC: Controversy about green technologies',
 'CC: Criticism of climate movement',
 'CC: Criticism of climate policies']

In [35]:
input_size = train_embeddings_tensor.shape[1]

In [36]:
fine_classes[:5]

['CC: Amplifying Climate Fears: Amplifying existing fears of global warming',
 'CC: Amplifying Climate Fears: Doomsday scenarios for humans',
 'CC: Amplifying Climate Fears: Earth will be uninhabitable soon',
 'CC: Amplifying Climate Fears: Other',
 'CC: Amplifying Climate Fears: Whatever we do it is already too late']

We define the same evaluator we use:

In [37]:
import os
from sklearn import metrics

class MultiHeadEvaluator:
    def __init__(
        self,
        classes_coarse=coarse_classes,
        classes_fine=fine_classes,
        narrative_to_sub_map=narrative_to_sub_map,
        narrative_order=narrative_order,
        narrative_classes=mlb_narratives.classes_,
        subnarrative_classes=mlb_subnarratives.classes_,
        device='cpu',
        output_dir='../../../submissions',
    ):
        self.narrative_to_sub_map = narrative_to_sub_map
        self.narrative_order = narrative_order
        self.narrative_classes = list(narrative_classes)
        self.subnarrative_classes = list(subnarrative_classes)
        
        self.classes_coarse = classes_coarse
        self.classes_fine = classes_fine
        
        self.device = torch.device(device)
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    
    def evaluate(
        self,
        model,
        embeddings=val_embeddings_tensor,
        dataset=dataset_val_non_target,
        thresholds=None,
        save=False,
        std_weight=0.4,
        lower_thres=0.1,
        upper_thres=0.60
    ):
        if thresholds is None:
            thresholds = np.arange(lower_thres, upper_thres, 0.05)

        embeddings = embeddings.to(self.device)
        model.eval()

        best_results = {
            "best_coarse_f1": -1,
            "best_coarse_std": float("inf"),
            "best_fine_f1": -1,
            "best_fine_std": float("inf"),
            "narr_threshold": 0,
            "sub_threshold": 0,
            "predictions": None,
            "best_combined_score": -float("inf"),
        }

        with torch.no_grad():
            narr_probs, sub_probs_dict = model(embeddings)
            narr_probs = narr_probs.cpu().numpy()
            sub_probs_dict = {k: v.cpu().numpy() for k, v in sub_probs_dict.items()}

        for narr_threshold in thresholds:
            for sub_threshold in thresholds:
                predictions = []
                for sample_idx, row in dataset.iterrows():
                    pred = self._make_prediction(
                        row["article_id"],
                        sample_idx,
                        narr_probs,
                        sub_probs_dict,
                        narr_threshold,
                        sub_threshold,
                    )
                    predictions.append(pred)

                f1_coarse_mean, coarse_std, f1_fine_mean, fine_std = self._compute_metrics_coarse_fine(
                    predictions, dataset
                )
                combined_score = f1_fine_mean - (std_weight * coarse_std)

                if combined_score > best_results["best_combined_score"]:
                    best_results.update(
                        {
                            "best_coarse_f1": f1_coarse_mean,
                            "best_coarse_std": coarse_std,
                            "best_fine_f1": f1_fine_mean,
                            "best_fine_std": fine_std,
                            "narr_threshold": narr_threshold,
                            "sub_threshold": sub_threshold,
                            "predictions": predictions,
                            "best_combined_score": combined_score,
                        }
                    )

        print("\nBest thresholds found:")
        print(f"Narrative threshold: {best_results['narr_threshold']:.2f}")
        print(f"Subnarrative threshold: {best_results['sub_threshold']:.2f}")
        print("\nCompetition Values")
        print(f"Coarse-F1: {best_results['best_coarse_f1']:.3f}")
        print(f"F1 st. dev. coarse: {best_results['best_coarse_std']:.3f}")
        print(f"Fine-F1: {best_results['best_fine_f1']:.3f}")
        print(f"F1 st. dev. fine: {best_results['best_fine_std']:.3f}")

        if save:
            self._save_predictions(best_results, os.path.join(self.output_dir, "submission.txt"))

        return best_results

    def _make_prediction(self, article_id, sample_idx, narr_probs, sub_probs_dict, narr_threshold, sub_threshold):
        other_idx = self.narrative_classes.index("Other")
        active_narratives = [
            (n_idx, prob)
            for n_idx, prob in enumerate(narr_probs[sample_idx])
            if n_idx != other_idx and prob >= narr_threshold
        ]
        # Fallback, If no active narrartive, output "Other" for both
        # narrative and subnarratives.
        if not active_narratives:
            return {
                'article_id': article_id,
                'narratives': ["Other"],
                'pairs': ["Other"]
            }
        
        narratives = []
        pairs = []
        seen_pairs = set()
        
        active_narratives.sort(key=lambda x: x[1], reverse=True)
        for narr_idx, _ in active_narratives:
            narr_name = self.narrative_classes[narr_idx]
            
            sub_probs = sub_probs_dict[str(narr_idx)][sample_idx]
            # FInd active subnarratives based on the cur threshold
            active_subnarratives = [
                (local_idx, s_prob)
                for local_idx, s_prob in enumerate(sub_probs)
                if s_prob >= sub_threshold
            ]
            # If no active subnarrative, output the predicted Narrative, with Other
            # as a pair.
            active_subnarratives.sort(key=lambda x: x[1], reverse=True)
            if not active_subnarratives:
                pairs.append(f"{narr_name}: Other")
            else:
                for local_idx, _ in active_subnarratives:
                    global_sub_idx = self.narrative_to_sub_map[narr_idx][local_idx]
                    sub_name = self.subnarrative_classes[global_sub_idx]
                    pair = f"{narr_name}: {sub_name}"
                    if pair not in seen_pairs:
                        pairs.append(pair)
                        seen_pairs.add(pair)
            narratives.append(narr_name)
        
        return {
            'article_id': article_id,
            'narratives': narratives,
            'pairs': pairs
        }

    def _compute_metrics_coarse_fine(self, predictions, dataset):
        """
        Evaluates the problem predictions with the gold.
        Mimics the challenge evaluation function.
        """
        gold_coarse_all = []
        gold_fine_all = []
        pred_coarse_all = []
        pred_fine_all = []

        for pred, (_, row) in zip(predictions, dataset.iterrows()):
            gold_coarse = row['narratives']
            gold_subnarratives = row['subnarratives']
            
            pred_coarse = pred['narratives']
            pred_fine = []
            for p in pred['pairs']:
                if p == "Other":
                    pred_fine.append("Other")
                else:
                    pred_fine.append(p)

            gold_fine = []
            for gold_nar, gold_sub in zip(gold_coarse, gold_subnarratives):
                if gold_nar == "Other":
                    gold_fine.append("Other")
                else:
                    gold_fine.append(f"{gold_nar}: {gold_sub}")
            
            gold_coarse_all.append(gold_coarse)
            gold_fine_all.append(gold_fine)
            pred_coarse_all.append(pred_coarse)
            pred_fine_all.append(pred_fine)

        f1_coarse_mean, coarse_std = self._evaluate_multi_label(gold_coarse_all, pred_coarse_all, self.classes_coarse)
        f1_fine_mean, fine_std = self._evaluate_multi_label(gold_fine_all, pred_fine_all, self.classes_fine)
        
        gold_coarse_flat = []
        pred_coarse_flat = []
        for g_labels, p_labels in zip(gold_coarse_all, pred_coarse_all):
            g_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            for lab in g_labels:
                if lab in self.classes_coarse:
                    g_onehot[self.classes_coarse.index(lab)] = 1
            p_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            for lab in p_labels:
                if lab in self.classes_coarse:
                    p_onehot[self.classes_coarse.index(lab)] = 1
            gold_coarse_flat.append(g_onehot)
            pred_coarse_flat.append(p_onehot)
        gold_coarse_flat = np.array(gold_coarse_flat)
        pred_coarse_flat = np.array(pred_coarse_flat)
        report_coarse = metrics.classification_report(
            gold_coarse_flat, pred_coarse_flat, target_names=self.classes_coarse, zero_division=0
        )
        
        gold_fine_flat = []
        pred_fine_flat = []
        for g_labels, p_labels in zip(gold_fine_all, pred_fine_all):
            g_onehot = np.zeros(len(self.classes_fine), dtype=int)
            for lab in g_labels:
                if lab in self.classes_fine:
                    g_onehot[self.classes_fine.index(lab)] = 1
            p_onehot = np.zeros(len(self.classes_fine), dtype=int)
            for lab in p_labels:
                if lab in self.classes_fine:
                    p_onehot[self.classes_fine.index(lab)] = 1
            gold_fine_flat.append(g_onehot)
            pred_fine_flat.append(p_onehot)
        gold_fine_flat = np.array(gold_fine_flat)
        pred_fine_flat = np.array(pred_fine_flat)

        
        return f1_coarse_mean, coarse_std, f1_fine_mean, fine_std

    def _evaluate_multi_label(self, gold, predicted, class_list):
        """
        Evaluates the predicted, with the gold and returns the mean and std f1 scores.
        Mimics the challenge evaluation function.
        """
        f1_scores = []
        for g_labels, p_labels in zip(gold, predicted):
            g_onehot = np.zeros(len(class_list), dtype=int)
            for lab in g_labels:
                if lab in class_list:
                    g_onehot[class_list.index(lab)] = 1
                    
            p_onehot = np.zeros(len(class_list), dtype=int)
            for lab in p_labels:
                if lab in class_list:
                    p_onehot[class_list.index(lab)] = 1

            f1_doc = metrics.f1_score(g_onehot, p_onehot, zero_division=0)
            f1_scores.append(f1_doc)
        
        return float(np.mean(f1_scores)), float(np.std(f1_scores))

    def _save_predictions(self, best_results, filepath):
        predictions = best_results['predictions']
        if os.path.exists(filepath):
            os.remove(filepath)
        
        with open(filepath, 'w', encoding='utf-8') as f:
            for pred in predictions:
                line = (f"{pred['article_id']}\t"
                        f"{';'.join(pred['narratives'])}\t"
                        f"{';'.join(pred['pairs'])}\n")
                f.write(line)

In [38]:
def compute_class_weights(y_train):
    total_samples = y_train.shape[0]
    class_weights = []
    for label in range(y_train.shape[1]):
        pos_count = y_train[:, label].sum().item()
        neg_count = total_samples - pos_count
        pos_weight = total_samples / (2 * pos_count) if pos_count > 0 else 0
        neg_weight = total_samples / (2 * neg_count) if neg_count > 0 else 0
        class_weights.append((pos_weight, neg_weight))
    return class_weights

class WeightedBCELoss(nn.Module):
    def __init__(self, class_weights):
        super().__init__()
        self.class_weights = class_weights

    def forward(self, probs, targets):
        bce_loss = 0
        epsilon = 1e-7
        for i, (pos_weight, neg_weight) in enumerate(self.class_weights):
            prob = probs[:, i]
            bce = -pos_weight * targets[:, i] * torch.log(prob + epsilon) - \
                  neg_weight * (1 - targets[:, i]) * torch.log(1 - prob + epsilon)
            bce_loss += bce.mean()
        return bce_loss / len(self.class_weights)

We also define the same loss:

In [39]:
class MultiHeadLoss(nn.Module):
    def __init__(self, narrative_criterion, sub_criterion_dict, 
                 condition_weight=0.3, sub_weight=0.3):
        
        super().__init__()
        self.narrative_criterion = narrative_criterion
        self.sub_criterion_dict = sub_criterion_dict
        self.condition_weight = condition_weight
        self.sub_weight = sub_weight
        
    def forward(self, narr_probs, sub_probs_dict, y_narr, y_sub_heads):
        narr_loss = self.narrative_criterion(narr_probs, y_narr)
        sub_loss = 0.0
        condition_loss = 0.0
        
        for narr_idx_str, sub_probs in sub_probs_dict.items():
            narr_idx = int(narr_idx_str)
            y_sub = [row[narr_idx] for row in y_sub_heads]
            y_sub_tensor = torch.tensor(y_sub, dtype=torch.float32, device=sub_probs.device)
            
            sub_loss_func = self.sub_criterion_dict[narr_idx_str]
            sub_loss += sub_loss_func(sub_probs, y_sub_tensor)

            narr_pred = narr_probs[:, narr_idx].unsqueeze(1)
            condition_term = torch.mean(
                # Penalize high probs of sub, based on first level narr predictinos
                torch.abs(sub_probs * (1 - narr_pred)) + 
                # If a narrative is true, then the subnarrative predictions should match their actual true values.
                narr_pred * torch.abs(sub_probs - y_sub_tensor.unsqueeze(1))
            )
            condition_loss += condition_term
            
        sub_loss = sub_loss / len(sub_probs_dict)
        condition_loss = condition_loss / len(sub_probs_dict)
        
        total_loss = (1 - self.sub_weight) * narr_loss + \
                    self.sub_weight * sub_loss + \
                    self.condition_weight * condition_loss
        
        return total_loss

In [40]:
def train_with_multihead(
    model,
    optimizer,
    loss_fn,
    train_embeddings,
    y_train_nar,
    y_train_sub_heads,
    val_embeddings,
    y_val_nar,
    y_val_sub_heads,
    patience=10,
    num_epochs=100,
    scheduler=None,
    min_delta=0.001,
    show_progress=True
):
    best_val_loss = float('inf')
    best_model = None
    patience_counter = 0
    for epoch in range(num_epochs):
        model.train()
        train_narr_probs, train_sub_probs_dict = model(train_embeddings)
        train_loss = loss_fn(train_narr_probs, train_sub_probs_dict, y_train_nar, y_train_sub_heads)
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        model.eval()
        with torch.no_grad():
            val_narr_probs, val_sub_probs_dict = model(val_embeddings)
            val_loss = loss_fn(val_narr_probs, val_sub_probs_dict, y_val_nar, y_val_sub_heads)
            
        if show_progress:
            print(f"Epoch {epoch+1}/{num_epochs}, "
                  f"Training Loss: {train_loss.item():.4f}, "
                  f"Validation Loss: {val_loss.item():.4f}")

        if scheduler:
            scheduler.step(val_loss)
            current_lr = scheduler.optimizer.param_groups[0]['lr']
            print(f"Current Learning Rate: {current_lr:.6f}")

        if val_loss.item() < best_val_loss - min_delta:
            best_val_loss = val_loss.item()
            patience_counter = 0
            best_model = model.state_dict().copy()
        else:
            patience_counter += 1
            print(f"Validation loss did not significantly improve for {patience_counter} epoch(s).")

        if patience_counter >= patience:
            print("Early stopping triggered.")
            break

    if best_model:
        model.load_state_dict(best_model)
    return model

We first create a splitter, shuffle data and iterate each fold.
- Each fold gives us a train and val idx in order for us to access the train and val in that fold.
- We load our model, optimizer and class weights and then we train like usual.
- After that, we validate our model on the validation embeddings for that fold.
- Lastly, we save those models, their configs, and thresholds for later use during inference.
- We finally averages the performance across folds so that we can get an estimate of how the model does on different splits of our dataset.

In [41]:
import torch
from sklearn.model_selection import KFold
import numpy as np

class CrossValEnsembleTrainer:
    def __init__(
        self,
        model_class,
        embeddings=embeddings,
        dataset=dataset,
        y_nar=y_nar,
        y_sub_heads=y_sub_heads,
        input_size=input_size,
        hidden_size=1024,
        dropout_rate=0.4,
        target_dataset=dataset_val_target,
        target_embed=val_embeddings_target_tensor,
        lr=0.001,
        n_splits=5,
        patience=10,
        num_epochs=100,
    ):
        self.model_class = model_class
        self.embeddings = embeddings
        self.dataset = dataset
        self.y_nar = y_nar
        self.y_sub_heads = y_sub_heads
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate
        self.lr = lr
        self.n_splits = n_splits
        self.patience = patience
        self.num_epochs = num_epochs
        self.target_dataset = target_dataset
        self.target_embed=target_embed
        
        self.fold_models = []
        self.fold_results = []

    def fit(self, sub_weight=0.3, condition_weight=0.3):
        kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=42)
        
        for fold, (train_idx, val_idx) in enumerate(kf.split(self.embeddings.cpu(), self.dataset['language'])):
            print(f"\nTraining Fold {fold + 1}/{self.n_splits}")
            
            train_embeddings = self.embeddings[train_idx].to(device)
            val_embeddings_fold = self.embeddings[val_idx].to(device)
            
            train_nar = self.y_nar[train_idx].to(device)
            val_nar = self.y_nar[val_idx].to(device)
            
            train_sub = self.y_sub_heads[train_idx]
            val_sub = self.y_sub_heads[val_idx]
            
            model = self.model_class(
                input_size=self.input_size,
                hidden_size=self.hidden_size,
                dropout_rate=self.dropout_rate
            ).to(device)
            
            optimizer = torch.optim.AdamW(model.parameters(), lr=self.lr)

            class_weights_nar_fold = compute_class_weights(train_nar)
            narrative_criterion_fold = WeightedBCELoss(class_weights_nar_fold).to(device)
            
            sub_criterion_dict_fold = {}
            for narr_idx_str in model.subnarrative_heads.keys():
                narr_idx = int(narr_idx_str)
                y_sub_fold = torch.tensor([row[narr_idx] for row in train_sub]).to(device)
                class_weights_sub = compute_class_weights(y_sub_fold)
                sub_criterion_dict_fold[narr_idx_str] = WeightedBCELoss(class_weights_sub).to(device)

            loss_fn = MultiHeadLoss(
                narrative_criterion=narrative_criterion_fold,
                sub_criterion_dict=sub_criterion_dict_fold,
                sub_weight=sub_weight,
                condition_weight=condition_weight
            )
            
            trained_model = train_with_multihead(
                model=model,
                optimizer=optimizer,
                loss_fn=loss_fn,
                train_embeddings=train_embeddings,
                y_train_nar=train_nar,
                y_train_sub_heads=train_sub,
                val_embeddings=val_embeddings_fold,
                y_val_nar=val_nar,
                y_val_sub_heads=val_sub,
                patience=self.patience,
                num_epochs=self.num_epochs,
                show_progress=False
            )
            
            evaluator = MultiHeadEvaluator()
            fold_dataset = self.dataset.iloc[val_idx].reset_index(drop=True)
            metrics = evaluator.evaluate(
                trained_model, 
                val_embeddings_fold,
                dataset=fold_dataset
            )
            
            self.fold_models.append(trained_model.state_dict())
            self.fold_results.append({
                'fold': fold + 1,
                'metrics': metrics,
                'val_indices': val_idx
            })
            
            print(f"\nFold {fold + 1} Results:")
            print(f"Coarse-F1: {metrics['best_coarse_f1']:.3f}")
            print(f"Fine-F1: {metrics['best_fine_f1']:.3f}")

        self._display_cv_performance()
        return self.fold_models, self.fold_results


    def _display_cv_performance(self):
        print("\nCross-Validation Performance:")
        avg_coarse_f1 = np.mean([r['metrics']['best_coarse_f1'] for r in self.fold_results])
        std_coarse_f1 = np.std([r['metrics']['best_coarse_f1'] for r in self.fold_results])
        avg_fine_f1 = np.mean([r['metrics']['best_fine_f1'] for r in self.fold_results])
        std_fine_f1 = np.std([r['metrics']['best_fine_f1'] for r in self.fold_results])
        
        print(f"Average Coarse-F1: {avg_coarse_f1:.3f} ± {std_coarse_f1:.3f}")
        print(f"Average Fine-F1:   {avg_fine_f1:.3f} ± {std_fine_f1:.3f}")

In [42]:
simple_trainer = CrossValEnsembleTrainer(
    model_class=MultiTaskClassifierMultiHead,
    hidden_size=1024,
    dropout_rate=0.4,
    lr=0.001,
    n_splits=5,
    patience=10,
    num_epochs=100,
)
ensemble_simple_models, simple_fold_results = simple_trainer.fit(sub_weight=0.5)


Training Fold 1/5
Validation loss did not significantly improve for 1 epoch(s).
Validation loss did not significantly improve for 2 epoch(s).
Validation loss did not significantly improve for 3 epoch(s).
Validation loss did not significantly improve for 4 epoch(s).
Validation loss did not significantly improve for 5 epoch(s).
Validation loss did not significantly improve for 6 epoch(s).
Validation loss did not significantly improve for 7 epoch(s).
Validation loss did not significantly improve for 8 epoch(s).
Validation loss did not significantly improve for 9 epoch(s).
Validation loss did not significantly improve for 10 epoch(s).
Early stopping triggered.

Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.50

Competition Values
Coarse-F1: 0.576
F1 st. dev. coarse: 0.381
Fine-F1: 0.416
F1 st. dev. fine: 0.352

Fold 1 Results:
Coarse-F1: 0.576
Fine-F1: 0.416

Training Fold 2/5
Validation loss did not significantly improve for 1 epoch(s).
Validation loss did not

We can also see how our model did in a separated validation dataset that has articles targeted to the desired language we want to validate for. For that we need to create a wrapper with a forward method to calculate the predictions from embeddings.

In [43]:
class CVEnsembleWrapper:
    def __init__(self, fold_models, model_class, input_size, hidden_size, dropout_rate):
        self.fold_models = fold_models
        self.model_class = model_class
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate

    def __call__(self, embeddings):
        models = []
        for state_dict in self.fold_models:
            model = self.model_class(
                input_size=self.input_size,
                hidden_size=self.hidden_size,
                dropout_rate=self.dropout_rate
            ).to(device)
            model.load_state_dict(state_dict)
            model.eval()
            models.append(model)

        all_narr_probs = []
        all_sub_probs_dicts = []

        with torch.no_grad():
            for model in models:
                narr_probs, sub_probs_dict = model(embeddings)
                all_narr_probs.append(narr_probs)
                all_sub_probs_dicts.append(sub_probs_dict)

        avg_narr_probs = torch.mean(torch.stack(all_narr_probs), dim=0)

        avg_sub_probs_dict = {}
        for key in all_sub_probs_dicts[0].keys():
            sub_probs_stack = torch.stack([d[key] for d in all_sub_probs_dicts])
            avg_sub_probs_dict[key] = torch.mean(sub_probs_stack, dim=0)

        return avg_narr_probs, avg_sub_probs_dict

    def eval(self):
        return self

    def to(self, device):
        return self

In [44]:
cv_ensemble_simple = CVEnsembleWrapper(
    fold_models=ensemble_simple_models,
    model_class=MultiTaskClassifierMultiHead,
    input_size=input_size,
    hidden_size=1024,
    dropout_rate=0.4
)

In [45]:
evaluator = MultiHeadEvaluator()

We also create a evaluator wrapper for the ensemble:

In [46]:
def evaluate_ensemble(
    base_evaluator,
    ensemble_model,
    embeddings,
    dataset,
    save=False,
):
    embeddings = torch.tensor(embeddings, dtype=torch.float32).to(device)
    
    results = base_evaluator.evaluate(
        model=ensemble_model,
        embeddings=embeddings,
        dataset=dataset,
        save=save
    )
    
    return results

In [47]:
evaluator = MultiHeadEvaluator(device=device)
results = evaluate_ensemble(
    base_evaluator=evaluator,
    ensemble_model=cv_ensemble_simple,
    embeddings=val_embeddings_target,
    dataset=dataset_val_target.reset_index(),
)


Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.50

Competition Values
Coarse-F1: 0.466
F1 st. dev. coarse: 0.399
Fine-F1: 0.335
F1 st. dev. fine: 0.356


In [48]:
class MultiTaskClassifierMultiHeadConcat(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size,
        num_narratives=len(mlb_narratives.classes_),
        narrative_to_sub_map=narrative_to_sub_map,
        dropout_rate=network_params['dropout']
    ):
        super().__init__()
        
        self.shared_layer = nn.Sequential(
            nn.Linear(input_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        self.narrative_head = nn.Sequential(
            nn.Linear(hidden_size * 2, num_narratives),
            nn.Sigmoid()
        )

        self.subnarrative_heads = nn.ModuleDict()
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            num_subs_for_this_narr = len(sub_indices)
            self.subnarrative_heads[str(narr_idx)] = nn.Sequential(
                nn.Linear(hidden_size * 2 + 1, num_subs_for_this_narr),
                nn.Sigmoid()
            )

    def forward(self, x):
        shared_out = self.shared_layer(x)

        narr_probs = self.narrative_head(shared_out)

        sub_probs_dict = {}
        for narr_idx, head in self.subnarrative_heads.items():
            conditioned_input = torch.cat((shared_out, narr_probs[:, int(narr_idx)].unsqueeze(1)), dim=1)
            sub_probs_dict[narr_idx] = head(conditioned_input)

        return narr_probs, sub_probs_dict

In [49]:
trainer_concat = CrossValEnsembleTrainer(
    model_class=MultiTaskClassifierMultiHeadConcat,
    hidden_size=2048,
    dropout_rate=0.4,
    lr=0.001,
    n_splits=5,
    patience=10,
    num_epochs=100,
)
ensemble_models_concat, fold_results_concat = trainer_concat.fit()


Training Fold 1/5
Validation loss did not significantly improve for 1 epoch(s).
Validation loss did not significantly improve for 2 epoch(s).
Validation loss did not significantly improve for 3 epoch(s).
Validation loss did not significantly improve for 4 epoch(s).
Validation loss did not significantly improve for 5 epoch(s).
Validation loss did not significantly improve for 6 epoch(s).
Validation loss did not significantly improve for 7 epoch(s).
Validation loss did not significantly improve for 8 epoch(s).
Validation loss did not significantly improve for 9 epoch(s).
Validation loss did not significantly improve for 10 epoch(s).
Early stopping triggered.

Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.50

Competition Values
Coarse-F1: 0.581
F1 st. dev. coarse: 0.380
Fine-F1: 0.405
F1 st. dev. fine: 0.347

Fold 1 Results:
Coarse-F1: 0.581
Fine-F1: 0.405

Training Fold 2/5
Validation loss did not significantly improve for 1 epoch(s).
Validation loss did not

In [50]:
cv_ensemble_concat = CVEnsembleWrapper(
    fold_models=ensemble_models_concat,
    model_class=MultiTaskClassifierMultiHeadConcat,
    input_size=input_size,
    hidden_size=2048,
    dropout_rate=0.4
)

In [51]:
results = evaluate_ensemble(
    base_evaluator=evaluator,
    ensemble_model=cv_ensemble_concat,
    embeddings=val_embeddings_target,
    dataset=dataset_val_target.reset_index(),
)


Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.40

Competition Values
Coarse-F1: 0.478
F1 st. dev. coarse: 0.388
Fine-F1: 0.338
F1 st. dev. fine: 0.336
