# Semeval 2025 Task 10
### Subtask 2: Narrative Classification

Given a news article and a [two-level taxonomy of narrative labels](https://propaganda.math.unipd.it/semeval2025task10/NARRATIVE-TAXONOMIES.pdf) (where each narrative is subdivided into subnarratives) from a particular domain, assign to the article all the appropriate subnarrative labels. This is a multi-label multi-class document classification task.

In [1]:
random_state=None

In [2]:
import torch
import numpy as np
import random

if random_state:
    print('[WARNING] Setting random state')
    torch.manual_seed(random_state)
    np.random.seed(random_state) 
    random.seed(random_state)

## Loss Weighting by Language

As of now,  we trained a model in 5 different languages just so that we can face the problem of having limited data. Our final submission is going to be in one of those languages. 
This is our current target right now, to make our model somewhat focus on a specified language.

One way to account for that is to add an extra penalty in our current loss. We know the language of each training sample, so we can double or triple the loss for that sample, in a way to tell the model to pay more attention to it.

This can help improve performance in a target language, especially when having limited data.


We go ahead and do the boring stuff again by loading our pre-saved components.

In [3]:
import pickle
import os
import pandas as pd

root_dir = "../../"
base_save_folder_dir = '../saved/'
dataset_folder = os.path.join(base_save_folder_dir, 'Dataset')

with open(os.path.join(dataset_folder, 'dataset_train_cleaned.pkl'), 'rb') as f:
    dataset_train = pickle.load(f)

In [4]:
dataset_train.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives,narratives_encoded,subnarratives_encoded,aggregated_subnarratives
0,RU,RU-URW-1161.txt,<PARA>в ближайшие два месяца сша будут стремит...,[URW: Blaming the war on others rather than th...,"[The West are the aggressors, Other, The West ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 1, 0, 0, 0], [1, 0, 0], [1, 0, 0, 0], [1,..."
1,RU,RU-URW-1175.txt,<PARA>в ес испугались последствий популярности...,"[URW: Discrediting the West, Diplomacy, URW: D...","[The West is weak, Other, The EU is divided]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 1, 0, 0, 0], [1, 0, 0], [1, 0, 0, 0], [1,..."
2,RU,RU-URW-1149.txt,<PARA>возможность признания аллы пугачевой ино...,[URW: Distrust towards Media],[Western media is an instrument of propaganda],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
3,RU,RU-URW-1015.txt,<PARA>азаров рассказал о смене риторики киева ...,"[URW: Discrediting Ukraine, URW: Discrediting ...","[Ukraine is a puppet of the West, Discrediting...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
4,RU,RU-URW-1001.txt,<PARA>в россиянах проснулась массовая любовь к...,[URW: Praise of Russia],[Russia is a guarantor of peace and prosperity],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."


In [5]:
dataset_train.shape

(1781, 8)

In [6]:
dataset_train.language.value_counts()

language
BG    401
PT    400
EN    399
HI    366
RU    215
Name: count, dtype: int64

In [7]:
misc_folder = os.path.join(base_save_folder_dir, 'Misc')

with open(os.path.join(misc_folder, 'narrative_to_subnarratives.pkl'), 'rb') as f:
    narrative_to_subnarratives = pickle.load(f)

In [8]:
with open(os.path.join(misc_folder, 'narrative_to_subnarratives_map.pkl'), 'rb') as f:
    narrative_to_sub_map = pickle.load(f)

In [9]:
with open(os.path.join(misc_folder, 'coarse_classes.pkl'), 'rb') as f:
    coarse_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'fine_classes.pkl'), 'rb') as f:
    fine_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'narrative_order.pkl'), 'rb') as f:
    narrative_order = pickle.load(f)

In [10]:
narrative_to_subnarratives

{'URW: Discrediting Ukraine': ['Discrediting Ukrainian government and officials and policies',
  'Discrediting Ukrainian nation and society',
  'Other',
  'Ukraine is associated with nazism',
  'Ukraine is a puppet of the West',
  'Rewriting Ukraine’s history',
  'Situation in Ukraine is hopeless',
  'Discrediting Ukrainian military',
  'Ukraine is a hub for criminal activities'],
 'URW: Discrediting the West, Diplomacy': ['Diplomacy does/will not work',
  'The EU is divided',
  'West is tired of Ukraine',
  'Other',
  'The West does not care about Ukraine, only about its interests',
  'The West is overreacting',
  'The West is weak'],
 'URW: Praise of Russia': ['Praise of Russian President Vladimir Putin',
  'Russia has international support from a number of countries and people',
  'Russia is a guarantor of peace and prosperity',
  'Other',
  'Russian invasion has strong national support',
  'Praise of Russian military might'],
 'URW: Russia is the Victim': ['Other',
  'The West is r

In [11]:
label_encoder_folder = os.path.join(base_save_folder_dir, 'LabelEncoders')

with open(os.path.join(label_encoder_folder, 'mlb_narratives.pkl'), 'rb') as f:
    mlb_narratives = pickle.load(f)

with open(os.path.join(label_encoder_folder, 'mlb_subnarratives.pkl'), 'rb') as f:
    mlb_subnarratives = pickle.load(f)

We will be using `Stella` embeddings, as they have proved quite better than `KaLM`.

In [12]:
import numpy as np

embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_train_kalm.npy')

def load_embeddings(filename):
    return np.load(filename)

train_embeddings = load_embeddings(embeddings_folder)

In [13]:
with open(os.path.join(dataset_folder, 'dataset_val_cleaned.pkl'), 'rb') as f:
    dataset_val = pickle.load(f)

In [14]:
embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_val_kalm.npy')

val_embeddings = load_embeddings(embeddings_folder)

In [15]:
def filter_dataset_and_embeddings(dataset, embeddings, condition_fn):
    filtered_indices = dataset.index[dataset.apply(condition_fn, axis=1)].tolist()
    
    filtered_dataset = dataset.loc[filtered_indices]
    filtered_embeddings = embeddings[filtered_indices]

    return filtered_dataset, filtered_embeddings

In [16]:
dataset_val, val_embeddings = filter_dataset_and_embeddings(
    dataset_val,
    val_embeddings, 
    lambda row: row["language"] == "EN"
)

In [17]:
import numpy as np

def custom_shuffling(data, embeddings):
    shuffled_indices = np.arange(len(data))
    np.random.shuffle(shuffled_indices)
    
    data = data.iloc[shuffled_indices].reset_index(drop=True)
    embeddings = embeddings[shuffled_indices]

    return data, embeddings

In [18]:
dataset_train, train_embeddings = custom_shuffling(dataset_train, train_embeddings)

In [19]:
dataset_val, val_embeddings = custom_shuffling(dataset_val, val_embeddings)

In [20]:
misc_folder = os.path.join(base_save_folder_dir, 'Misc')

In [21]:
y_train_sub_heads = dataset_train['aggregated_subnarratives'].to_numpy()
y_val_sub_heads = dataset_val['aggregated_subnarratives'].to_numpy()

In [22]:
prefer_cpu=True

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available() and not prefer_cpu
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


In [23]:
import torch

train_embeddings_tensor = torch.tensor(train_embeddings, dtype=torch.float32).to(device)
val_embeddings_tensor = torch.tensor(val_embeddings, dtype=torch.float32).to(device)

In [24]:
input_size = train_embeddings_tensor.shape[1]
print(input_size)

896


In [25]:
network_params = {
    'lr': 0.001,
    'hidden_size': 1024,
    'dropout': 0.4
}

In [26]:
y_train_nar = dataset_train['narratives_encoded'].tolist()
y_val_nar = dataset_val['narratives_encoded'].tolist()

y_train_sub_nar = dataset_train['subnarratives_encoded'].tolist()
y_val_sub_nar = dataset_val['subnarratives_encoded'].tolist()

In [27]:
y_train_nar = torch.tensor(y_train_nar, dtype=torch.float32).to(device)
y_train_sub_nar = torch.tensor(y_train_sub_nar, dtype=torch.float32).to(device)

y_val_nar = torch.tensor(y_val_nar, dtype=torch.float32).to(device)
y_val_sub_nar = torch.tensor(y_val_sub_nar, dtype=torch.float32).to(device)

In [28]:
train_embeddings_tensor = torch.tensor(train_embeddings, dtype=torch.float32).to(device)
val_embeddings_tensor = torch.tensor(val_embeddings, dtype=torch.float32).to(device)

In [29]:
import torch
import torch.nn as nn

def compute_class_weights(y_train):
    total_samples = y_train.shape[0]
    class_weights = []
    for label in range(y_train.shape[1]):
        pos_count = y_train[:, label].sum().item()
        neg_count = total_samples - pos_count
        pos_weight = total_samples / (2 * pos_count) if pos_count > 0 else 0
        neg_weight = total_samples / (2 * neg_count) if neg_count > 0 else 0
        class_weights.append((pos_weight, neg_weight))
    return class_weights

class WeightedBCELoss(nn.Module):
    def __init__(self, class_weights):
        super().__init__()
        self.class_weights = class_weights

    def forward(self, probs, targets):
        bce_loss = 0
        epsilon = 1e-7
        for i, (pos_weight, neg_weight) in enumerate(self.class_weights):
            prob = probs[:, i]
            bce = -pos_weight * targets[:, i] * torch.log(prob + epsilon) - \
                  neg_weight * (1 - targets[:, i]) * torch.log(1 - prob + epsilon)
            bce_loss += bce.mean()
        return bce_loss / len(self.class_weights)

class_weights_sub_nar = compute_class_weights(y_val_sub_nar)
class_weights_nar = compute_class_weights(y_val_nar)
narrative_criterion = WeightedBCELoss(class_weights_nar)

In [30]:
sub_criterion_dict = {}

for narr_idx, sub_indices in narrative_to_sub_map.items():
    local_weights = [ class_weights_sub_nar[sub_i] for sub_i in sub_indices ]

    sub_criterion = WeightedBCELoss(local_weights)
    sub_criterion_dict[str(narr_idx)] = sub_criterion

We will also select the MultiHeadConcat Model, since this is the one appearing to do the best for our Fine-F1 score.

In [31]:
class MultiTaskClassifierMultiHeadConcat(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size,
        num_narratives=len(mlb_narratives.classes_),
        narrative_to_sub_map=narrative_to_sub_map,
        dropout_rate=network_params['dropout']
    ):
        super().__init__()
        
        self.shared_layer = nn.Sequential(
            nn.Linear(input_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        self.narrative_head = nn.Sequential(
            nn.Linear(hidden_size * 2, num_narratives),
            nn.Sigmoid()
        )

        self.subnarrative_heads = nn.ModuleDict()
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            num_subs_for_this_narr = len(sub_indices)
            self.subnarrative_heads[str(narr_idx)] = nn.Sequential(
                nn.Linear(hidden_size * 2 + 1, num_subs_for_this_narr),
                nn.Sigmoid()
            )

    def forward(self, x):
        shared_out = self.shared_layer(x)

        narr_probs = self.narrative_head(shared_out)

        sub_probs_dict = {}
        for narr_idx, head in self.subnarrative_heads.items():
            conditioned_input = torch.cat((shared_out, narr_probs[:, int(narr_idx)].unsqueeze(1)), dim=1)
            sub_probs_dict[narr_idx] = head(conditioned_input)

        return narr_probs, sub_probs_dict

In [32]:
model = MultiTaskClassifierMultiHeadConcat(
    input_size=input_size,
    hidden_size=2048
).to(device)

We add the extra penalty for english samples

* In the forwarding step, we check if the sample is an english one, and if it is we apply the extra weight upon the loss to essentially tell them model to pay more attention to those samples.

In [33]:
class LanguageAwareMultiHeadLoss(nn.Module):
    def __init__(self, narrative_criterion, sub_criterion_dict, 
                 condition_weight=0.3, sub_weight=0.3,
                 english_weight=3.0):
        
        super().__init__()
        self.narrative_criterion = narrative_criterion
        self.sub_criterion_dict = sub_criterion_dict
        self.condition_weight = condition_weight
        self.sub_weight = sub_weight
        self.english_weight = english_weight

    def forward(self, narr_probs, sub_probs_dict, y_narr, y_sub_heads, is_english):
        is_english = is_english.to(narr_probs.device)
        sample_weights = torch.where(is_english == 1, self.english_weight, 1.0)
        
        narr_loss = self.narrative_criterion(narr_probs, y_narr)
        narr_loss = (narr_loss * sample_weights.unsqueeze(1)).mean()

        sub_loss = 0.0
        condition_loss = 0.0

        for narr_idx_str, sub_probs in sub_probs_dict.items():
            narr_idx = int(narr_idx_str)
            y_sub = [row[narr_idx] for row in y_sub_heads]
            y_sub_tensor = torch.tensor(y_sub, dtype=torch.float32, device=sub_probs.device)
            sub_loss_func = self.sub_criterion_dict[narr_idx_str]
            sub_batch_loss = sub_loss_func(sub_probs, y_sub_tensor)
            sub_loss += (sub_batch_loss * sample_weights).mean()

            narr_pred = narr_probs[:, narr_idx].unsqueeze(1)
            condition_term = torch.abs(sub_probs * (1 - narr_pred)) + \
                             narr_pred * torch.abs(sub_probs - y_sub_tensor.unsqueeze(1))
            condition_term = (condition_term * sample_weights.unsqueeze(1)).mean()
            condition_loss += condition_term

        sub_loss = sub_loss / len(sub_probs_dict)
        condition_loss = condition_loss / len(sub_probs_dict)
        total_loss = (1 - self.sub_weight) * narr_loss + self.sub_weight * sub_loss + \
                     self.condition_weight * condition_loss
        return total_loss

We find the instances of our dataset that are English:

In [34]:
is_english_train = torch.tensor([1 if lang == 'EN' else 0 for lang in dataset_train['language']], 
                              dtype=torch.float32)

In [35]:
is_english_train

tensor([0., 0., 0.,  ..., 0., 1., 0.])

In [36]:
language_aware_loss = LanguageAwareMultiHeadLoss(
    narrative_criterion=narrative_criterion,
    sub_criterion_dict=sub_criterion_dict,
).to(device)

In [37]:
def train_with_multihead(
    model,
    optimizer,
    loss_fn=language_aware_loss,
    train_embeddings=train_embeddings_tensor,
    y_train_nar=y_train_nar,
    y_train_sub_heads=y_train_sub_heads,
    val_embeddings=val_embeddings_tensor,
    y_val_nar=y_val_nar,
    y_val_sub_heads=y_val_sub_heads,
    is_english_train=is_english_train, 
    patience=10,
    num_epochs=100,
    scheduler=None,
    min_delta=0.001
):
    best_val_loss = float('inf')
    best_model = None
    patience_counter = 0

    for epoch in range(num_epochs):
        model.train()
        train_narr_probs, train_sub_probs_dict = model(train_embeddings)

        train_loss = loss_fn(
            train_narr_probs, 
            train_sub_probs_dict, 
            y_train_nar, 
            y_train_sub_heads,
            is_english_train
        )

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        model.eval()
        with torch.no_grad():
            val_narr_probs, val_sub_probs_dict = model(val_embeddings)
            is_english_val = torch.ones(len(val_embeddings), device=val_embeddings.device)
            val_loss = loss_fn(
                val_narr_probs, 
                val_sub_probs_dict, 
                y_val_nar, 
                y_val_sub_heads,
                is_english_val
            )

        print(f"Epoch {epoch+1}/{num_epochs}, "
              f"Training Loss: {train_loss.item():.4f}, "
              f"Validation Loss: {val_loss.item():.4f}")

        if scheduler:
            scheduler.step(val_loss)
            current_lr = scheduler.optimizer.param_groups[0]['lr']
            print(f"Current Learning Rate: {current_lr:.6f}")

        if val_loss.item() < best_val_loss - min_delta:
            best_val_loss = val_loss.item()
            patience_counter = 0
            best_model = model.state_dict().copy()
        else:
            patience_counter += 1
            print(f"Validation loss did not significantly improve for {patience_counter} epoch(s).")

        if patience_counter >= patience:
            print("Early stopping triggered.")
            break

    if best_model:
        model.load_state_dict(best_model)
    return model

In [38]:
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau

def initialize_and_train_model(
    model,
    num_epochs=100,
    lr=0.001,
    patience=10,
    use_scheduler=True,
    scheduler_patience=3,
    loss_fn=language_aware_loss,
    num_subnarratives=len(mlb_subnarratives.classes_),
    device=device
):
    optimizer = AdamW(model.parameters(), lr=lr)

    scheduler = None
    if use_scheduler:
        scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=scheduler_patience)

    trained_model = train_with_multihead(
                                    model=model,
                                    optimizer=optimizer,
                                    scheduler=scheduler,
                                    loss_fn=loss_fn,
                                    patience=patience
                                ).to(device)
    return trained_model, optimizer, scheduler

In [39]:
import torch.nn.functional as F

trained_model, _, _ = initialize_and_train_model(
    model,
    patience=10,
    device=device
)

Epoch 1/100, Training Loss: 1.1729, Validation Loss: 2.3293
Current Learning Rate: 0.001000
Epoch 2/100, Training Loss: 0.7802, Validation Loss: 2.2946
Current Learning Rate: 0.001000
Epoch 3/100, Training Loss: 0.6632, Validation Loss: 2.2670
Current Learning Rate: 0.001000
Epoch 4/100, Training Loss: 0.6027, Validation Loss: 2.2429
Current Learning Rate: 0.001000
Epoch 5/100, Training Loss: 0.5603, Validation Loss: 2.2185
Current Learning Rate: 0.001000
Epoch 6/100, Training Loss: 0.5223, Validation Loss: 2.1925
Current Learning Rate: 0.001000
Epoch 7/100, Training Loss: 0.4911, Validation Loss: 2.1644
Current Learning Rate: 0.001000
Epoch 8/100, Training Loss: 0.4688, Validation Loss: 2.1351
Current Learning Rate: 0.001000
Epoch 9/100, Training Loss: 0.4497, Validation Loss: 2.1059
Current Learning Rate: 0.001000
Epoch 10/100, Training Loss: 0.4376, Validation Loss: 2.0767
Current Learning Rate: 0.001000
Epoch 11/100, Training Loss: 0.4234, Validation Loss: 2.0476
Current Learning R

In [40]:
coarse_classes

['CC: Amplifying Climate Fears',
 'CC: Climate change is beneficial',
 'CC: Controversy about green technologies',
 'CC: Criticism of climate movement',
 'CC: Criticism of climate policies',
 'CC: Criticism of institutions and authorities',
 'CC: Downplaying climate change',
 'CC: Green policies are geopolitical instruments',
 'CC: Hidden plots by secret schemes of powerful groups',
 'CC: Questioning the measurements and science',
 'Other',
 'URW: Amplifying war-related fears',
 'URW: Blaming the war on others rather than the invader',
 'URW: Discrediting Ukraine',
 'URW: Discrediting the West, Diplomacy',
 'URW: Distrust towards Media',
 'URW: Hidden plots by secret schemes of powerful groups',
 'URW: Negative Consequences for the West',
 'URW: Overpraising the West',
 'URW: Praise of Russia',
 'URW: Russia is the Victim',
 'URW: Speculating war outcomes']

In [41]:
fine_classes[:15]

['CC: Amplifying Climate Fears: Amplifying existing fears of global warming',
 'CC: Amplifying Climate Fears: Doomsday scenarios for humans',
 'CC: Amplifying Climate Fears: Earth will be uninhabitable soon',
 'CC: Amplifying Climate Fears: Other',
 'CC: Amplifying Climate Fears: Whatever we do it is already too late',
 'CC: Climate change is beneficial: CO2 is beneficial',
 'CC: Climate change is beneficial: Other',
 'CC: Climate change is beneficial: Temperature increase is beneficial',
 'CC: Controversy about green technologies: Other',
 'CC: Controversy about green technologies: Renewable energy is costly',
 'CC: Controversy about green technologies: Renewable energy is dangerous',
 'CC: Controversy about green technologies: Renewable energy is unreliable',
 'CC: Criticism of climate movement: Ad hominem attacks on key activists',
 'CC: Criticism of climate movement: Climate movement is alarmist',
 'CC: Criticism of climate movement: Climate movement is corrupt']

In [42]:
import os
from sklearn import metrics

class MultiHeadEvaluator:
    def __init__(
        self,
        classes_coarse=coarse_classes,
        classes_fine=fine_classes,
        narrative_to_sub_map=narrative_to_sub_map,
        narrative_order=narrative_order,
        narrative_classes=mlb_narratives.classes_,
        subnarrative_classes=mlb_subnarratives.classes_,
        device='cpu',
        output_dir='../../../submissions',
    ):
        self.narrative_to_sub_map = narrative_to_sub_map
        self.narrative_order = narrative_order
        self.narrative_classes = list(narrative_classes)
        self.subnarrative_classes = list(subnarrative_classes)
        
        self.classes_coarse = classes_coarse
        self.classes_fine = classes_fine

        self.device = device
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    
    def evaluate(
        self,
        model,
        embeddings=val_embeddings_tensor,
        dataset=dataset_val,
        thresholds=None,
        save=False,
        std_weight=0.4,
        lower_thres=0.1,
        upper_thres=0.60
    ):
        if thresholds is None:
            thresholds = np.arange(lower_thres, upper_thres, 0.05)    
        embeddings = embeddings.to(self.device)
    
        best_results = {
            'best_coarse_f1': -1,
            'best_coarse_std': float('inf'),
            'best_fine_f1': -1,
            'best_fine_std': float('inf'),
            'narr_threshold': 0,
            'sub_threshold': 0,
            'predictions': None,
            'best_combined_score': -float('inf'),
            'coarse_classification_report': None,
            'fine_precision': None,
            'fine_recall': None,
            'samples_f1_fine': None,
        }
    
        with torch.no_grad():
            narr_probs, sub_probs_dict = model(embeddings)
            narr_probs = narr_probs.cpu().numpy()
            sub_probs_dict = {k: v.cpu().numpy() for k, v in sub_probs_dict.items()}
    
        for narr_threshold in thresholds:
            for sub_threshold in thresholds:
                predictions = []
                for sample_idx, row in dataset.iterrows():
                    pred = self._make_prediction(
                        row['article_id'],
                        sample_idx,
                        narr_probs,
                        sub_probs_dict,
                        narr_threshold,
                        sub_threshold
                    )
                    predictions.append(pred)
                
                f1_coarse_mean, coarse_std, f1_fine_mean, fine_std, report_coarse, precision_fine, recall_fine, samples_f1_fine = self._compute_metrics_coarse_fine(predictions, dataset)
                
                combined_score = f1_fine_mean - (std_weight * coarse_std)
                
                if combined_score > best_results['best_combined_score']:
                    best_results.update({
                        'best_coarse_f1': f1_coarse_mean,
                        'best_coarse_std': coarse_std,
                        'best_fine_f1': f1_fine_mean,
                        'best_fine_std': fine_std,
                        'narr_threshold': narr_threshold,
                        'sub_threshold': sub_threshold,
                        'predictions': predictions,
                        'best_combined_score': combined_score,
                        'coarse_classification_report': report_coarse,
                        'fine_precision': precision_fine,
                        'fine_recall': recall_fine,
                        'samples_f1_fine': samples_f1_fine,
                    })
    
        print("\nBest thresholds found:")
        print(f"Narrative threshold: {best_results['narr_threshold']:.2f}")
        print(f"Subnarrative threshold: {best_results['sub_threshold']:.2f}")
        print('\nCompetition Values')
        print(f"Coarse-F1: {best_results['best_coarse_f1']:.3f}")
        print(f"F1 st. dev. coarse: {best_results['best_coarse_std']:.3f}")
        print(f"Fine-F1: {best_results['best_fine_f1']:.3f}")
        print(f"F1 st. dev. fine: {best_results['best_fine_std']:.3f}")
        print("\nCoarse Classification Report:")
        print(best_results['coarse_classification_report'])
        print("\nFine Metrics:")
        print("Precision: {:.3f}".format(best_results['fine_precision']))
        print("Recall: {:.3f}".format(best_results['fine_recall']))
        print("F1 Samples: {:.3f}".format(best_results['samples_f1_fine']))

        if save:
            self._save_predictions(best_results, os.path.join(self.output_dir, 'submission.txt'))
        
        return best_results

    def _make_prediction(self, article_id, sample_idx, narr_probs, sub_probs_dict, narr_threshold, sub_threshold):
        other_idx = self.narrative_classes.index("Other")
        active_narratives = [
            (n_idx, prob)
            for n_idx, prob in enumerate(narr_probs[sample_idx])
            if n_idx != other_idx and prob >= narr_threshold
        ]
        # Fallback, If no active narrartive, output "Other" for both
        # narrative and subnarratives.
        if not active_narratives:
            return {
                'article_id': article_id,
                'narratives': ["Other"],
                'pairs': ["Other"]
            }
        
        narratives = []
        pairs = []
        seen_pairs = set()
        
        active_narratives.sort(key=lambda x: x[1], reverse=True)
        for narr_idx, _ in active_narratives:
            narr_name = self.narrative_classes[narr_idx]
            
            sub_probs = sub_probs_dict[str(narr_idx)][sample_idx]
            # FInd active subnarratives based on the cur threshold
            active_subnarratives = [
                (local_idx, s_prob)
                for local_idx, s_prob in enumerate(sub_probs)
                if s_prob >= sub_threshold
            ]
            # If no active subnarrative, output the predicted Narrative, with Other
            # as a pair.
            active_subnarratives.sort(key=lambda x: x[1], reverse=True)
            if not active_subnarratives:
                pairs.append(f"{narr_name}: Other")
            else:
                for local_idx, _ in active_subnarratives:
                    global_sub_idx = self.narrative_to_sub_map[narr_idx][local_idx]
                    sub_name = self.subnarrative_classes[global_sub_idx]
                    pair = f"{narr_name}: {sub_name}"
                    if pair not in seen_pairs:
                        pairs.append(pair)
                        seen_pairs.add(pair)
            narratives.append(narr_name)
        
        return {
            'article_id': article_id,
            'narratives': narratives,
            'pairs': pairs
        }

    def _compute_metrics_coarse_fine(self, predictions, dataset):
        """
        Evaluates the problem predictions with the gold.
        Mimics the challenge evaluation function.
        """
        gold_coarse_all = []
        gold_fine_all = []
        pred_coarse_all = []
        pred_fine_all = []

        for pred, (_, row) in zip(predictions, dataset.iterrows()):
            gold_coarse = row['narratives']
            gold_subnarratives = row['subnarratives']
            
            pred_coarse = pred['narratives']
            pred_fine = []
            for p in pred['pairs']:
                if p == "Other":
                    pred_fine.append("Other")
                else:
                    pred_fine.append(p)

            gold_fine = []
            for gold_nar, gold_sub in zip(gold_coarse, gold_subnarratives):
                if gold_nar == "Other":
                    gold_fine.append("Other")
                else:
                    gold_fine.append(f"{gold_nar}: {gold_sub}")
            
            gold_coarse_all.append(gold_coarse)
            gold_fine_all.append(gold_fine)
            pred_coarse_all.append(pred_coarse)
            pred_fine_all.append(pred_fine)

        f1_coarse_mean, coarse_std = self._evaluate_multi_label(gold_coarse_all, pred_coarse_all, self.classes_coarse)
        f1_fine_mean, fine_std = self._evaluate_multi_label(gold_fine_all, pred_fine_all, self.classes_fine)
        
        gold_coarse_flat = []
        pred_coarse_flat = []
        for g_labels, p_labels in zip(gold_coarse_all, pred_coarse_all):
            g_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            for lab in g_labels:
                if lab in self.classes_coarse:
                    g_onehot[self.classes_coarse.index(lab)] = 1
            p_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            for lab in p_labels:
                if lab in self.classes_coarse:
                    p_onehot[self.classes_coarse.index(lab)] = 1
            gold_coarse_flat.append(g_onehot)
            pred_coarse_flat.append(p_onehot)
        gold_coarse_flat = np.array(gold_coarse_flat)
        pred_coarse_flat = np.array(pred_coarse_flat)
        report_coarse = metrics.classification_report(
            gold_coarse_flat, pred_coarse_flat, target_names=self.classes_coarse, zero_division=0
        )
        
        gold_fine_flat = []
        pred_fine_flat = []
        for g_labels, p_labels in zip(gold_fine_all, pred_fine_all):
            g_onehot = np.zeros(len(self.classes_fine), dtype=int)
            for lab in g_labels:
                if lab in self.classes_fine:
                    g_onehot[self.classes_fine.index(lab)] = 1
            p_onehot = np.zeros(len(self.classes_fine), dtype=int)
            for lab in p_labels:
                if lab in self.classes_fine:
                    p_onehot[self.classes_fine.index(lab)] = 1
            gold_fine_flat.append(g_onehot)
            pred_fine_flat.append(p_onehot)
        gold_fine_flat = np.array(gold_fine_flat)
        pred_fine_flat = np.array(pred_fine_flat)
        
        precision_fine = metrics.precision_score(gold_fine_flat, pred_fine_flat, average='macro', zero_division=0)
        recall_fine = metrics.recall_score(gold_fine_flat, pred_fine_flat, average='macro', zero_division=0)
        samples_f1_fine = metrics.f1_score(
            gold_fine_flat, 
            pred_fine_flat, 
            average='samples',
            zero_division=0
        )
        
        return f1_coarse_mean, coarse_std, f1_fine_mean, fine_std, report_coarse, precision_fine, recall_fine, samples_f1_fine

    def _evaluate_multi_label(self, gold, predicted, class_list):
        """
        Evaluates the predicted, with the gold and returns the mean and std f1 scores.
        Mimics the challenge evaluation function.
        """
        f1_scores = []
        for g_labels, p_labels in zip(gold, predicted):
            g_onehot = np.zeros(len(class_list), dtype=int)
            for lab in g_labels:
                if lab in class_list:
                    g_onehot[class_list.index(lab)] = 1
                    
            p_onehot = np.zeros(len(class_list), dtype=int)
            for lab in p_labels:
                if lab in class_list:
                    p_onehot[class_list.index(lab)] = 1

            f1_doc = metrics.f1_score(g_onehot, p_onehot, zero_division=0)
            f1_scores.append(f1_doc)
        
        return float(np.mean(f1_scores)), float(np.std(f1_scores))

    def _save_predictions(self, best_results, filepath):
        predictions = best_results['predictions']
        if os.path.exists(filepath):
            os.remove(filepath)
        
        with open(filepath, 'w', encoding='utf-8') as f:
            for pred in predictions:
                line = (f"{pred['article_id']}\t"
                        f"{';'.join(pred['narratives'])}\t"
                        f"{';'.join(pred['pairs'])}\n")
                f.write(line)

In [43]:
evaluator = MultiHeadEvaluator()

In [44]:
_ = evaluator.evaluate(
    model=trained_model,
)


Best thresholds found:
Narrative threshold: 0.45
Subnarrative threshold: 0.20

Competition Values
Coarse-F1: 0.459
F1 st. dev. coarse: 0.374
Fine-F1: 0.311
F1 st. dev. fine: 0.312

Coarse Classification Report:
                                                        precision    recall  f1-score   support

                          CC: Amplifying Climate Fears       0.00      0.00      0.00         0
                      CC: Climate change is beneficial       0.00      0.00      0.00         1
              CC: Controversy about green technologies       0.33      0.50      0.40         2
                     CC: Criticism of climate movement       0.60      0.75      0.67         8
                     CC: Criticism of climate policies       0.20      0.67      0.31         3
         CC: Criticism of institutions and authorities       0.41      0.88      0.56         8
                        CC: Downplaying climate change       0.00      0.00      0.00         2
       CC: Green po