# Semeval 2025 Task 10
### Subtask 2: Narrative Classification

Given a news article and a [two-level taxonomy of narrative labels](https://propaganda.math.unipd.it/semeval2025task10/NARRATIVE-TAXONOMIES.pdf) (where each narrative is subdivided into subnarratives) from a particular domain, assign to the article all the appropriate subnarrative labels. This is a multi-label multi-class document classification task.

In [1]:
random_state=None

In [2]:
import torch
import numpy as np
import random

if random_state:
    print('[WARNING] Setting random state')
    torch.manual_seed(random_state)
    np.random.seed(random_state) 
    random.seed(random_state)

## Continual Learning

As of current, we were using all multilingual training data (Russian, Bulgarian, Portuguese, Hindi, and English) at once, mixing it together during training, and then evaluating specifically on English validation data. However, since our final evaluation is language-target based we can leverage a sequential training of langauges (Russian -> Bulgarian -> Portuguese -> Hindi -> English) in order to aim for better results.

This is also similar to how we as humans might learn languages. We start with one then move to another one while maintaining knowledge of the past ones.

This way might help our learning on identifying different useful patterns per language that could later help a specific language classification. For example:
* Russian articles might help learn certain propaganda patterns.
* Bulgarian articles might contribute different narrative structures.
* Each language adds its own unique perspective to the model's understanding, the model get's this knoweledge sequentially.

In [3]:
import pickle
import os
import pandas as pd

root_dir = "../../"
base_save_folder_dir = '../saved/'
dataset_folder = os.path.join(base_save_folder_dir, 'Dataset')

with open(os.path.join(dataset_folder, 'dataset_train_cleaned.pkl'), 'rb') as f:
    dataset_train = pickle.load(f)

In [4]:
dataset_train.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives,narratives_encoded,subnarratives_encoded,aggregated_subnarratives
0,RU,RU-URW-1161.txt,<PARA>в ближайшие два месяца сша будут стремит...,[URW: Blaming the war on others rather than th...,"[The West are the aggressors, Other, The West ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
1,RU,RU-URW-1175.txt,<PARA>в ес испугались последствий популярности...,"[URW: Discrediting the West, Diplomacy, URW: D...","[The West is weak, Other, The EU is divided]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
2,RU,RU-URW-1149.txt,<PARA>возможность признания аллы пугачевой ино...,[URW: Distrust towards Media],[Western media is an instrument of propaganda],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
3,RU,RU-URW-1015.txt,<PARA>азаров рассказал о смене риторики киева ...,"[URW: Discrediting Ukraine, URW: Discrediting ...","[Ukraine is a puppet of the West, Discrediting...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
4,RU,RU-URW-1001.txt,<PARA>в россиянах проснулась массовая любовь к...,[URW: Praise of Russia],[Russia is a guarantor of peace and prosperity],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."


In [5]:
misc_folder = os.path.join(base_save_folder_dir, 'Misc')

with open(os.path.join(misc_folder, 'narrative_to_subnarratives.pkl'), 'rb') as f:
    narrative_to_subnarratives = pickle.load(f)

In [6]:
with open(os.path.join(misc_folder, 'narrative_to_subnarratives_map.pkl'), 'rb') as f:
    narrative_to_sub_map = pickle.load(f)

In [7]:
with open(os.path.join(misc_folder, 'coarse_classes.pkl'), 'rb') as f:
    coarse_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'fine_classes.pkl'), 'rb') as f:
    fine_classes = pickle.load(f)

with open(os.path.join(misc_folder, 'narrative_order.pkl'), 'rb') as f:
    narrative_order = pickle.load(f)

In [8]:
dataset_train.shape

(1781, 8)

In [9]:
label_encoder_folder = os.path.join(base_save_folder_dir, 'LabelEncoders')

with open(os.path.join(label_encoder_folder, 'mlb_narratives.pkl'), 'rb') as f:
    mlb_narratives = pickle.load(f)

with open(os.path.join(label_encoder_folder, 'mlb_subnarratives.pkl'), 'rb') as f:
    mlb_subnarratives = pickle.load(f)

In [10]:
import numpy as np

embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_train_stella.npy')

def load_embeddings(filename):
    return np.load(filename)

train_embeddings = load_embeddings(embeddings_folder)

In [11]:
train_embeddings.shape

(1781, 1024)

In [12]:
with open(os.path.join(dataset_folder, 'dataset_val_cleaned.pkl'), 'rb') as f:
    dataset_val = pickle.load(f)

In [13]:
dataset_val.shape

(178, 8)

In [14]:
dataset_val.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives,narratives_encoded,subnarratives_encoded,aggregated_subnarratives
0,RU,RU-URW-1014.txt,<PARA>алаудинов: российские силы растянули и р...,[URW: Praise of Russia],[Praise of Russian military might],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
1,RU,RU-URW-1174.txt,<PARA>других сценариев нет. никаких переговоро...,"[URW: Speculating war outcomes, URW: Discredit...","[Ukrainian army is collapsing, Discrediting Uk...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
2,RU,RU-URW-1166.txt,<PARA>попытка запада изолировать путина провал...,"[URW: Praise of Russia, URW: Distrust towards ...","[Praise of Russian President Vladimir Putin, W...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."
3,RU,RU-URW-1170.txt,<PARA>часть территории украины войдет в состав...,"[URW: Discrediting Ukraine, URW: Speculating w...",[Discrediting Ukrainian government and officia...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,..."
4,RU,RU-URW-1004.txt,<PARA>зеленскому не очень понравилась идея о в...,"[URW: Discrediting Ukraine, URW: Discrediting ...",[Discrediting Ukrainian government and officia...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,..."


In [15]:
embeddings_folder = os.path.join(base_save_folder_dir, 'Embeddings/embeddings_dev_stella.npy')

val_embeddings = load_embeddings(embeddings_folder)

In [16]:
def filter_dataset_and_embeddings(dataset, embeddings, condition_fn):
    filtered_indices = dataset.index[dataset.apply(condition_fn, axis=1)].tolist()
    
    filtered_dataset = dataset.loc[filtered_indices]
    filtered_embeddings = embeddings[filtered_indices]

    return filtered_dataset, filtered_embeddings

In [17]:
dataset_val_en, val_embeddings_en = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == "EN"
    )

In [18]:
dataset_val_en.shape

(41, 8)

In [19]:
val_embeddings_en.shape

(41, 1024)

In [20]:
dataset_train.shape

(1781, 8)

In [21]:
train_embeddings.shape

(1781, 1024)

In [22]:
dataset_train['aggregated_subnarratives']

0       [[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,...
1       [[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,...
2       [[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
3       [[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
4       [[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
                              ...                        
1776    [[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
1777    [[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,...
1778    [[0, 0, 0, 1, 0], [0, 1, 0], [0, 0, 0, 1], [0,...
1779    [[0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
1780    [[0, 0, 1, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0,...
Name: aggregated_subnarratives, Length: 1781, dtype: object

In [23]:
import torch

prefer_cpu=True

# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available() and not prefer_cpu
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


In [24]:
y_train_sub_heads = dataset_train['aggregated_subnarratives'].to_numpy()
y_val_sub_heads = dataset_val['aggregated_subnarratives'].to_numpy()

In [25]:
dataset_train['language'].unique()

array(['RU', 'PT', 'BG', 'HI', 'EN'], dtype=object)

In [26]:
narrative_order

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]

In [27]:
y_train_sub_heads = dataset_train['aggregated_subnarratives'].to_numpy()
y_val_sub_heads = dataset_val['aggregated_subnarratives'].to_numpy()

In [28]:
dataset_train['language'].unique()

array(['RU', 'PT', 'BG', 'HI', 'EN'], dtype=object)

In [29]:
network_params = {
    'lr': 0.001,
    'hidden_size': 2048,
    'dropout': 0.4,
    'patience': 10
}

In [30]:
import torch.nn as nn
import torch.nn.functional as F

class MultiTaskClassifierMultiHead(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size=1024,
        num_narratives=len(mlb_narratives.classes_),
        narrative_to_sub_map=narrative_to_sub_map,
        dropout_rate=0.4,
        model_name="MultiTaskClassifierMultiHead" 
    ):
        super().__init__()
        self.input_size = input_size
        self.model_name = model_name
        
        self.shared_layer = nn.Sequential(
            nn.Linear(input_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        self.narrative_head = nn.Sequential(
            nn.Linear(hidden_size * 2, num_narratives),
            nn.Sigmoid()
        )

        self.subnarrative_heads = nn.ModuleDict()
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            num_subs_for_this_narr = len(sub_indices)
            self.subnarrative_heads[str(narr_idx)] = nn.Sequential(
                nn.Linear(hidden_size * 2, num_subs_for_this_narr),
                nn.Sigmoid()
            )

    def forward(self, x):
        shared_out = self.shared_layer(x)
        narr_probs = self.narrative_head(shared_out)
        
        sub_probs_dict = {}
        for narr_idx, head in self.subnarrative_heads.items():
            sub_probs_dict[narr_idx] = head(shared_out)
            
        return narr_probs, sub_probs_dict

In [31]:
class MultiTaskClassifierMultiHeadConcat(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size,
        num_narratives=len(mlb_narratives.classes_),
        narrative_to_sub_map=narrative_to_sub_map,
        dropout_rate=network_params['dropout'],
    ):
        super().__init__()
        self.input_size = input_size        
        
        self.shared_layer = nn.Sequential(
            nn.Linear(input_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        self.narrative_head = nn.Sequential(
            nn.Linear(hidden_size * 2, num_narratives),
            nn.Sigmoid()
        )

        self.subnarrative_heads = nn.ModuleDict()
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            num_subs_for_this_narr = len(sub_indices)
            # Here each head expects an additional 1-dimension input (the narrative probability for that head)
            self.subnarrative_heads[str(narr_idx)] = nn.Sequential(
                nn.Linear(hidden_size * 2 + 1, num_subs_for_this_narr),
                nn.Sigmoid()
            )

    def forward(self, x):
        shared_out = self.shared_layer(x)

        narr_probs = self.narrative_head(shared_out)

        sub_probs_dict = {}
        for narr_idx, head in self.subnarrative_heads.items():
            # Add a new dimension: get the probability for the narrative corresponding to narr_idx
            # Then concatenate it with shared layer's output.
            conditioned_input = torch.cat((shared_out, narr_probs[:, int(narr_idx)].unsqueeze(1)), dim=1)
            sub_probs_dict[narr_idx] = head(conditioned_input)

        return narr_probs, sub_probs_dict

In [32]:
y_train_nar = dataset_train['narratives_encoded'].tolist()

y_train_sub_nar = dataset_train['subnarratives_encoded'].tolist()


In [33]:
y_train_nar = torch.tensor(y_train_nar, dtype=torch.float32).to(device)
y_train_sub_nar = torch.tensor(y_train_sub_nar, dtype=torch.float32).to(device)

In [34]:
train_embeddings_tensor = torch.tensor(train_embeddings, dtype=torch.float32).to(device)

In [35]:
input_size = train_embeddings_tensor.shape[1]
print(input_size)

1024


In [36]:
model_multi_head = MultiTaskClassifierMultiHead(
    input_size=input_size,
    hidden_size=network_params['hidden_size'],
).to(device)

In [37]:
def compute_class_weights(y_train):
    total_samples = y_train.shape[0]
    class_weights = []
    for label in range(y_train.shape[1]):
        pos_count = y_train[:, label].sum().item()
        neg_count = total_samples - pos_count
        pos_weight = total_samples / (2 * pos_count) if pos_count > 0 else 0
        neg_weight = total_samples / (2 * neg_count) if neg_count > 0 else 0
        class_weights.append((pos_weight, neg_weight))
    return class_weights

class WeightedBCELoss(nn.Module):
    def __init__(self, class_weights):
        super().__init__()
        self.class_weights = class_weights

    def forward(self, probs, targets):
        bce_loss = 0
        epsilon = 1e-7
        for i, (pos_weight, neg_weight) in enumerate(self.class_weights):
            prob = probs[:, i]
            bce = -pos_weight * targets[:, i] * torch.log(prob + epsilon) - \
                  neg_weight * (1 - targets[:, i]) * torch.log(1 - prob + epsilon)
            bce_loss += bce.mean()
        return bce_loss / len(self.class_weights)

In [38]:
loss_params = {
    'sub_weight': 0.3,
    'condition_weight': 0.3,
    'target_weight': 2.0
}

In [39]:
class MultiHeadLoss(nn.Module):
    def __init__(self, narrative_criterion, sub_criterion_dict, 
                 condition_weight=loss_params['condition_weight'],
                 sub_weight=loss_params['sub_weight'],
                 target_weight=loss_params['target_weight'],
                 is_target=False):
        
        super().__init__()
        self.narrative_criterion = narrative_criterion
        self.sub_criterion_dict = sub_criterion_dict
        self.condition_weight = condition_weight
        self.sub_weight = sub_weight
        self.target_weight = target_weight
        self.is_target = is_target
        
    def forward(self, narr_probs, sub_probs_dict, y_narr, y_sub_heads):
        narr_loss = self.narrative_criterion(narr_probs, y_narr)
        sub_loss = 0.0
        condition_loss = 0.0
        
        for narr_idx_str, sub_probs in sub_probs_dict.items():
            narr_idx = int(narr_idx_str)
            y_sub = [row[narr_idx] for row in y_sub_heads]
            y_sub_tensor = torch.tensor(y_sub, dtype=torch.float32, device=sub_probs.device)
            
            sub_loss_func = self.sub_criterion_dict[narr_idx_str]
            sub_loss += sub_loss_func(sub_probs, y_sub_tensor)
            
            narr_pred = narr_probs[:, narr_idx].unsqueeze(1)
            condition_term = torch.mean(
                torch.abs(sub_probs * (1 - narr_pred)) + 
                narr_pred * torch.abs(sub_probs - y_sub_tensor.unsqueeze(1))
            )
            condition_loss += condition_term
            
        sub_loss = sub_loss / len(sub_probs_dict)
        condition_loss = condition_loss / len(sub_probs_dict)
        
        total_loss = (1 - self.sub_weight) * narr_loss + \
                    self.sub_weight * sub_loss + \
                    self.condition_weight * condition_loss
        
        if self.is_target:
            total_loss *= self.target_weight
        
        return total_loss

In [40]:
coarse_classes

['CC: Amplifying Climate Fears',
 'CC: Climate change is beneficial',
 'CC: Controversy about green technologies',
 'CC: Criticism of climate movement',
 'CC: Criticism of climate policies',
 'CC: Criticism of institutions and authorities',
 'CC: Downplaying climate change',
 'CC: Green policies are geopolitical instruments',
 'CC: Hidden plots by secret schemes of powerful groups',
 'CC: Questioning the measurements and science',
 'Other',
 'URW: Amplifying war-related fears',
 'URW: Blaming the war on others rather than the invader',
 'URW: Discrediting Ukraine',
 'URW: Discrediting the West, Diplomacy',
 'URW: Distrust towards Media',
 'URW: Hidden plots by secret schemes of powerful groups',
 'URW: Negative Consequences for the West',
 'URW: Overpraising the West',
 'URW: Praise of Russia',
 'URW: Russia is the Victim',
 'URW: Speculating war outcomes']

In [41]:
fine_classes[:15]

['CC: Amplifying Climate Fears: Amplifying existing fears of global warming',
 'CC: Amplifying Climate Fears: Doomsday scenarios for humans',
 'CC: Amplifying Climate Fears: Earth will be uninhabitable soon',
 'CC: Amplifying Climate Fears: Other',
 'CC: Amplifying Climate Fears: Whatever we do it is already too late',
 'CC: Climate change is beneficial: CO2 is beneficial',
 'CC: Climate change is beneficial: Other',
 'CC: Climate change is beneficial: Temperature increase is beneficial',
 'CC: Controversy about green technologies: Other',
 'CC: Controversy about green technologies: Renewable energy is costly',
 'CC: Controversy about green technologies: Renewable energy is dangerous',
 'CC: Controversy about green technologies: Renewable energy is unreliable',
 'CC: Criticism of climate movement: Ad hominem attacks on key activists',
 'CC: Criticism of climate movement: Climate movement is alarmist',
 'CC: Criticism of climate movement: Climate movement is corrupt']

In [42]:
narrative_order

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]

In [43]:
from sklearn import metrics

class MultiHeadEvaluator:
    def __init__(
        self,
        classes_coarse=coarse_classes,
        classes_fine=fine_classes,
        narrative_to_sub_map=narrative_to_sub_map,
        narrative_order=narrative_order,
        narrative_classes=mlb_narratives.classes_,
        subnarrative_classes=mlb_subnarratives.classes_,
        device='cpu',
        output_dir='../../../submissions',
    ):
        self.narrative_to_sub_map = narrative_to_sub_map
        self.narrative_order = narrative_order
        self.narrative_classes = list(narrative_classes)
        self.subnarrative_classes = list(subnarrative_classes)
        
        self.classes_coarse = classes_coarse
        self.classes_fine = classes_fine

        self.device = device
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    
    def evaluate(
        self,
        model,
        embeddings,
        dataset,
        thresholds=None,
        save=False,
        std_weight=0.6,
        lower_thres=0.1,
        upper_thres=0.6,
        show_results=True
    ):
        if thresholds is None:
            thresholds = np.arange(lower_thres, upper_thres, 0.05)
        
        dataset = dataset.reset_index(drop=True)
        embeddings = embeddings.to(self.device)
    
        best_results = {
            'best_coarse_f1': -1,
            'best_coarse_std': float('inf'),
            'best_fine_f1': -1,
            'best_fine_std': float('inf'),
            'narr_threshold': 0,
            'sub_threshold': 0,
            'predictions': None,
            'best_combined_score': -float('inf'),
            'coarse_classification_report': None,
            'fine_precision': None,
            'fine_recall': None,
            'samples_f1_fine': None,
        }
    
        with torch.no_grad():
            narr_probs, sub_probs_dict = model(embeddings)
            narr_probs = narr_probs.cpu().numpy()
            sub_probs_dict = {k: v.cpu().numpy() for k, v in sub_probs_dict.items()}
    
        for narr_threshold in thresholds:
            for sub_threshold in thresholds:
                predictions = []
                try:
                    for sample_idx, row in dataset.iterrows():
                        pred = self._make_prediction(
                            row['article_id'],
                            sample_idx,
                            narr_probs,
                            sub_probs_dict,
                            narr_threshold,
                            sub_threshold
                        )
                        predictions.append(pred)
                    
                    metrics_result = self._compute_metrics_coarse_fine(predictions, dataset)
                    f1_coarse_mean, coarse_std, f1_fine_mean, fine_std, report_coarse, precision_fine, recall_fine, \
                    samples_f1_fine = metrics_result
                    
                    combined_score = f1_fine_mean - (std_weight * coarse_std)
                    
                    if combined_score > best_results['best_combined_score']:
                        best_results.update({
                            'best_coarse_f1': f1_coarse_mean,
                            'best_coarse_std': coarse_std,
                            'best_fine_f1': f1_fine_mean,
                            'best_fine_std': fine_std,
                            'narr_threshold': narr_threshold,
                            'sub_threshold': sub_threshold,
                            'predictions': predictions,
                            'best_combined_score': combined_score,
                            'coarse_classification_report': report_coarse,
                            'fine_precision': precision_fine,
                            'fine_recall': recall_fine,
                            'samples_f1_fine': samples_f1_fine,
                        })
                except Exception as e:
                    print(f"Error during evaluation with thresholds {narr_threshold:.2f}, {sub_threshold:.2f}: {str(e)}")
                    continue
                    
        if show_results:
            print("\nBest thresholds found:")
            print(f"Narrative threshold: {best_results['narr_threshold']:.2f}")
            print(f"Subnarrative threshold: {best_results['sub_threshold']:.2f}")
            print('\nCompetition Values')
            print(f"Coarse-F1: {best_results['best_coarse_f1']:.3f}")
            print(f"F1 st. dev. coarse: {best_results['best_coarse_std']:.3f}")
            print(f"Fine-F1: {best_results['best_fine_f1']:.3f}")
            print(f"F1 st. dev. fine: {best_results['best_fine_std']:.3f}")
            print("\nFine Metrics:")
            print("Precision: {:.3f}".format(best_results['fine_precision']))
            print("Recall: {:.3f}".format(best_results['fine_recall']))
            print("F1 Samples: {:.3f}".format(best_results['samples_f1_fine']))

        if save:
            self._save_predictions(best_results, os.path.join(self.output_dir, 'submission.txt'))
        
        return best_results

    def _make_prediction(self, article_id, sample_idx, narr_probs, sub_probs_dict, narr_threshold, sub_threshold):       
        other_idx = self.narrative_classes.index("Other")  
        active_narratives = [
            (n_idx, prob)
            for n_idx, prob in enumerate(narr_probs[sample_idx])
            if n_idx != other_idx and prob >= narr_threshold
        ]

        if not active_narratives:
            return {
                'article_id': article_id,
                'narratives': ["Other"],
                'pairs': ["Other"]
            }
        
        narratives = []
        pairs = []
        seen_pairs = set()
        
        active_narratives.sort(key=lambda x: x[1], reverse=True)
        for narr_idx, _ in active_narratives:
            narr_name = self.narrative_classes[narr_idx]
                
            sub_probs = sub_probs_dict[str(narr_idx)][sample_idx]
            active_subnarratives = [
                (local_idx, s_prob)
                for local_idx, s_prob in enumerate(sub_probs)
                if s_prob >= sub_threshold
            ]
            
            active_subnarratives.sort(key=lambda x: x[1], reverse=True)
            if not active_subnarratives:
                pairs.append(f"{narr_name}: Other")
            else:
                for local_idx, _ in active_subnarratives:   
                    global_sub_idx = self.narrative_to_sub_map[narr_idx][local_idx]
                    sub_name = self.subnarrative_classes[global_sub_idx]
                    pair = f"{narr_name}: {sub_name}"
                    if pair not in seen_pairs:
                        pairs.append(pair)
                        seen_pairs.add(pair)
            narratives.append(narr_name)
        
        return {
            'article_id': article_id,
            'narratives': narratives,
            'pairs': pairs
        }

    def _compute_metrics_coarse_fine(self, predictions, dataset):
        gold_coarse_all = []
        gold_fine_all = []
        pred_coarse_all = []
        pred_fine_all = []

        for pred, (_, row) in zip(predictions, dataset.iterrows()):
            gold_coarse = row['narratives']
            gold_subnarratives = row['subnarratives']
            
            pred_coarse = pred['narratives']
            pred_fine = []
            for p in pred['pairs']:
                if p == "Other":
                    pred_fine.append("Other")
                else:
                    pred_fine.append(p)

            gold_fine = []
            for gold_nar, gold_sub in zip(gold_coarse, gold_subnarratives):
                if gold_nar == "Other":
                    gold_fine.append("Other")
                else:
                    gold_fine.append(f"{gold_nar}: {gold_sub}")
            
            gold_coarse_all.append(gold_coarse)
            gold_fine_all.append(gold_fine)
            pred_coarse_all.append(pred_coarse)
            pred_fine_all.append(pred_fine)

        f1_coarse_mean, coarse_std = self._evaluate_multi_label(gold_coarse_all, pred_coarse_all, self.classes_coarse)
        f1_fine_mean, fine_std = self._evaluate_multi_label(gold_fine_all, pred_fine_all, self.classes_fine)
        
        gold_coarse_flat = []
        pred_coarse_flat = []
        for g_labels, p_labels in zip(gold_coarse_all, pred_coarse_all):
            g_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            p_onehot = np.zeros(len(self.classes_coarse), dtype=int)
            
            for lab in g_labels:
                if lab in self.classes_coarse:
                    g_onehot[self.classes_coarse.index(lab)] = 1
            for lab in p_labels:
                if lab in self.classes_coarse:
                    p_onehot[self.classes_coarse.index(lab)] = 1
                    
            gold_coarse_flat.append(g_onehot)
            pred_coarse_flat.append(p_onehot)
            
        gold_coarse_flat = np.array(gold_coarse_flat)
        pred_coarse_flat = np.array(pred_coarse_flat)
        
        report_coarse = metrics.classification_report(
                gold_coarse_flat, pred_coarse_flat, 
                target_names=self.classes_coarse, 
                zero_division=0
        )
        
        gold_fine_flat = []
        pred_fine_flat = []
        for g_labels, p_labels in zip(gold_fine_all, pred_fine_all):
            g_onehot = np.zeros(len(self.classes_fine), dtype=int)
            p_onehot = np.zeros(len(self.classes_fine), dtype=int)
            
            for lab in g_labels:
                if lab in self.classes_fine:
                    g_onehot[self.classes_fine.index(lab)] = 1
            for lab in p_labels:
                if lab in self.classes_fine:
                    p_onehot[self.classes_fine.index(lab)] = 1
                    
            gold_fine_flat.append(g_onehot)
            pred_fine_flat.append(p_onehot)
            
        gold_fine_flat = np.array(gold_fine_flat)
        pred_fine_flat = np.array(pred_fine_flat)
        
        precision_fine = metrics.precision_score(gold_fine_flat, pred_fine_flat, average='macro', zero_division=0)
        recall_fine = metrics.recall_score(gold_fine_flat, pred_fine_flat, average='macro', zero_division=0)
        samples_f1_fine = metrics.f1_score(gold_fine_flat, pred_fine_flat, average='samples', zero_division=0)
        
        return f1_coarse_mean, coarse_std, f1_fine_mean, fine_std, report_coarse, precision_fine, recall_fine, samples_f1_fine

    def _evaluate_multi_label(self, gold, predicted, class_list):
        f1_scores = []
        for g_labels, p_labels in zip(gold, predicted):
            g_onehot = np.zeros(len(class_list), dtype=int)
            p_onehot = np.zeros(len(class_list), dtype=int)
            
            for lab in g_labels:
                if lab in class_list:
                    g_onehot[class_list.index(lab)] = 1
            for lab in p_labels:
                if lab in class_list:
                    p_onehot[class_list.index(lab)] = 1
                    
            f1_doc = metrics.f1_score(g_onehot, p_onehot, zero_division=0)
            f1_scores.append(f1_doc)
            
        return float(np.mean(f1_scores)), float(np.std(f1_scores))

    def _save_predictions(self, best_results, filepath):
        predictions = best_results['predictions']
        if os.path.exists(filepath):
            os.remove(filepath)
        
        with open(filepath, 'w', encoding='utf-8') as f:
            for pred in predictions:
                line = (f"{pred['article_id']}\t"
                       f"{';'.join(pred['narratives'])}\t"
                       f"{';'.join(pred['pairs'])}\n")
                f.write(line)

As we train on multiple languages in sequence, the last language (our target) needs some kind of special care. 
Our goal is to make sure the model performs best on our target language, without losing what it learned from previous languages. When we reach the target language, we make two key changes:
- We increase the patience parameter to train the model more carefully on the target language.
- We also lower the learning rate, because the model has already learned some patterns from other languages, we want smaller, more precise updates for the target language.
- We monitor target language validation throughout all sequential training phases, allowing each language to train until no further improvements are seen on the target language metrics.

In [44]:
class ContinualLearningModel:
    def __init__(
        self,
        model_params,
        dataset_val,
        val_embeddings,
        model_class=MultiTaskClassifierMultiHeadConcat,
        dataset_train=dataset_train,
        train_embeddings=train_embeddings,
        language_order=['RU', 'BG', 'HI', 'PT', 'EN'],
        learning_rate=0.001,
        target="EN",
        device=device,
        show_progress=True
    ):
        self.model_class = model_class
        self.model_params = model_params
        self.dataset_train = dataset_train
        self.train_embeddings = train_embeddings
        self.dataset_val = dataset_val
        self.val_embeddings = val_embeddings
        self.language_order = language_order
        self.learning_rate = learning_rate
        self.device = device
        self.target = target
        self.y_val_nar = self.dataset_val['narratives_encoded'].tolist()
        self.y_val_sub_heads = self.dataset_val['aggregated_subnarratives'].tolist()
        self.show_progress = show_progress
        

    def _prepare_language_data(self, language, shuffle=False):
        language_mask = self.dataset_train["language"] == language
        train_data = self.dataset_train[language_mask].copy()
        train_emb = self.train_embeddings[language_mask]
        
        if shuffle:
            indices = torch.randperm(len(train_data))
            train_data = train_data.iloc[indices].reset_index(drop=True)
            train_emb = train_emb[indices]
        
        y_train_nar = torch.tensor(train_data['narratives_encoded'].tolist(), dtype=torch.float32).to(self.device)
        y_train_sub_heads = train_data['aggregated_subnarratives'].tolist()
        train_emb = torch.tensor(train_emb, dtype=torch.float32).to(self.device)
        return train_data, train_emb, y_train_nar, y_train_sub_heads

    def _setup_loss_function(self, y_train_nar, y_train_sub_heads, language):
        class_weights_nar = compute_class_weights(y_train_nar)
        narrative_criterion = WeightedBCELoss(class_weights_nar)
        
        sub_criterion_dict = {}
        for narr_idx, sub_indices in narrative_to_sub_map.items():
            local_weights = compute_class_weights(torch.tensor([h[narr_idx] for h in y_train_sub_heads]))
            sub_criterion = WeightedBCELoss(local_weights)
            sub_criterion_dict[str(narr_idx)] = sub_criterion
            
        if (language==self.target):
            print('Focusing on', self.target)
            
        return MultiHeadLoss(narrative_criterion, sub_criterion_dict, is_target=(language == self.target))

    def train(self, epochs_per_language=100, patience=10, shuffle=False):
        self.model = self.model_class(**self.model_params).to(self.device)

        for lang_idx, language in enumerate(self.language_order):
            print(f"\nTraining on {language} data...")
            
            if language == self.target:
                patience = patience * 2
                learning_rate = self.learning_rate * 0.2
                optimizer = torch.optim.Adam(self.model.parameters(), lr=learning_rate)
                scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                    optimizer, 
                    mode='min', 
                    factor=0.5,
                    patience=8,
                    min_lr=2e-5,
                    threshold=1e-4
                )
            else:
                optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)
                scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                    optimizer, mode='min', factor=0.5, patience=5
                )
            
            train_data, train_emb, y_train_nar, y_train_sub_heads = self._prepare_language_data(language, shuffle=shuffle)
            loss_fn = self._setup_loss_function(y_train_nar, y_train_sub_heads, language)
            val_emb_tensor = torch.tensor(self.val_embeddings, dtype=torch.float32).to(self.device)
            best_val_loss = float('inf')
            patience_counter = 0
            best_model_state = None

            for epoch in range(epochs_per_language):
                self.model.train()
                train_narr_probs, train_sub_probs_dict = self.model(train_emb)
                train_loss = loss_fn(
                    train_narr_probs,
                    train_sub_probs_dict,
                    y_train_nar,
                    y_train_sub_heads
                )

                optimizer.zero_grad()
                train_loss.backward()
                                    
                optimizer.step()

                self.model.eval()
                with torch.no_grad():
                    val_narr_probs, val_sub_probs_dict = self.model(val_emb_tensor)
                    val_loss = loss_fn(
                        val_narr_probs,
                        val_sub_probs_dict,
                        torch.tensor(self.y_val_nar, dtype=torch.float32).to(self.device),
                        self.y_val_sub_heads
                    )
                if self.show_progress:
                    print(f"Epoch {epoch+1}/{epochs_per_language}, "
                          f"Train Loss: {train_loss.item():.4f}, "
                          f"Val Loss: {val_loss.item():.4f}")

                if scheduler:
                    scheduler.step(val_loss)
                    current_lr = scheduler.optimizer.param_groups[0]['lr']
                    if self.show_progress: print(f"Current Learning Rate: {current_lr:.6f}")

                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                    best_model_state = self.model.state_dict().copy()
                else:
                    patience_counter += 1
                    
                if patience_counter >= patience:
                    if self.show_progress: print(f"Early stopping triggered for {language}")
                    break

            if best_model_state:
                self.model.load_state_dict(best_model_state)

        return self.model

    def evaluate_final(self, save_predictions=True):
        evaluator = MultiHeadEvaluator(device=self.device)
        val_emb_tensor = torch.tensor(self.val_embeddings, dtype=torch.float32).to(self.device)
        res = evaluator.evaluate(
            self.model,
            val_emb_tensor,
            self.dataset_val,
            save=save_predictions
        )
        return res

In [45]:
language_order=['RU', 'BG', 'PT', 'HI', 'EN']
target=language_order[-1]

In [46]:
model_params = {
    'input_size': train_embeddings.shape[1],
    'hidden_size': 2048,
    'dropout_rate': 0.4
}

In [47]:

en_cl_model = ContinualLearningModel(
    model_params=model_params,
    language_order=language_order,
    dataset_val=dataset_val_en,
    val_embeddings=val_embeddings_en,
    target=target
)

In [48]:
en_model_right = en_cl_model.train(epochs_per_language=100, patience=15)


Training on RU data...
Epoch 1/100, Train Loss: 0.7055, Val Loss: 0.8719
Current Learning Rate: 0.001000
Epoch 2/100, Train Loss: 0.4119, Val Loss: 0.8704
Current Learning Rate: 0.001000
Epoch 3/100, Train Loss: 0.2826, Val Loss: 0.8718
Current Learning Rate: 0.001000
Epoch 4/100, Train Loss: 0.2217, Val Loss: 0.8766
Current Learning Rate: 0.001000
Epoch 5/100, Train Loss: 0.1879, Val Loss: 0.8845
Current Learning Rate: 0.001000
Epoch 6/100, Train Loss: 0.1654, Val Loss: 0.8945
Current Learning Rate: 0.001000
Epoch 7/100, Train Loss: 0.1478, Val Loss: 0.9044
Current Learning Rate: 0.001000
Epoch 8/100, Train Loss: 0.1323, Val Loss: 0.9143
Current Learning Rate: 0.000500
Epoch 9/100, Train Loss: 0.1206, Val Loss: 0.9199
Current Learning Rate: 0.000500
Epoch 10/100, Train Loss: 0.1132, Val Loss: 0.9254
Current Learning Rate: 0.000500
Epoch 11/100, Train Loss: 0.1065, Val Loss: 0.9316
Current Learning Rate: 0.000500
Epoch 12/100, Train Loss: 0.1005, Val Loss: 0.9384
Current Learning Rate

In [49]:
results = en_cl_model.evaluate_final()


Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.40

Competition Values
Coarse-F1: 0.522
F1 st. dev. coarse: 0.351
Fine-F1: 0.365
F1 st. dev. fine: 0.352

Fine Metrics:
Precision: 0.136
Recall: 0.272
F1 Samples: 0.365


In [50]:
nar_thres = results['narr_threshold']
sub_thres = results['sub_threshold']

If we change the order of the languages being trained:

In [51]:
language_order=['RU', 'HI', 'PT', 'BG', 'EN']
target=language_order[-1]

In [52]:
cl_model_demo = ContinualLearningModel(
    model_params=model_params,
    language_order=language_order,
    dataset_val=dataset_val_en,
    val_embeddings=val_embeddings_en,
    target=target
)

In [53]:
model_demo = cl_model_demo.train(epochs_per_language=100, patience=15)


Training on RU data...
Epoch 1/100, Train Loss: 0.7184, Val Loss: 0.8714
Current Learning Rate: 0.001000
Epoch 2/100, Train Loss: 0.4198, Val Loss: 0.8683
Current Learning Rate: 0.001000
Epoch 3/100, Train Loss: 0.2887, Val Loss: 0.8702
Current Learning Rate: 0.001000
Epoch 4/100, Train Loss: 0.2258, Val Loss: 0.8766
Current Learning Rate: 0.001000
Epoch 5/100, Train Loss: 0.1892, Val Loss: 0.8859
Current Learning Rate: 0.001000
Epoch 6/100, Train Loss: 0.1681, Val Loss: 0.8966
Current Learning Rate: 0.001000
Epoch 7/100, Train Loss: 0.1488, Val Loss: 0.9076
Current Learning Rate: 0.001000
Epoch 8/100, Train Loss: 0.1340, Val Loss: 0.9182
Current Learning Rate: 0.000500
Epoch 9/100, Train Loss: 0.1207, Val Loss: 0.9237
Current Learning Rate: 0.000500
Epoch 10/100, Train Loss: 0.1149, Val Loss: 0.9290
Current Learning Rate: 0.000500
Epoch 11/100, Train Loss: 0.1077, Val Loss: 0.9346
Current Learning Rate: 0.000500
Epoch 12/100, Train Loss: 0.1021, Val Loss: 0.9409
Current Learning Rate

The results are poor compared to the first language order.

That could be because, the jump from Russian directly to Hindi could be too drastic as these languages have very different structures, and we might try to get very good at it without having the right "prerequisites".

- In the first instance, however, we have a somewhat more smooth transition, Russian and Bulgarian are both Slavic languages, then the model gets to know different patterns from Portuguese and maybe harder ones from Hindi before moving on to the final target language.

However, other factors such as Russian and Bulgarian having certain patterns that overly help on being learned at early stages could actually help on classifying English articles.

In [54]:
_ = cl_model_demo.evaluate_final()


Best thresholds found:
Narrative threshold: 0.55
Subnarrative threshold: 0.50

Competition Values
Coarse-F1: 0.474
F1 st. dev. coarse: 0.355
Fine-F1: 0.285
F1 st. dev. fine: 0.321

Fine Metrics:
Precision: 0.113
Recall: 0.285
F1 Samples: 0.285


While we are at it, notice that we cannot rely on a single language order per target language, as the validation set might have characteristics that work particularly well with that order or the validation set could favor some narratives/subnarratives, and the distribution for the test might (and will) be different.

We create an ensemble of models trained with different language orders:
* Each model follows some principles, like keeping linguistically similar languages close together.
* Models are trained with different but linguistically sensible variations of the order.
* The final prediction combines all models outputs, weighted by their validation performance (since clearly, some orders appear to do better than others).

This is also helpful since we get a view with continual learning of different languages that have specific narrative/subnarratives that are language-based and thus not in other languages.

In [55]:
class ContinualLearningEnsemble:
    def __init__(
        self,
        model_params,
        dataset_val,
        val_embeddings,
        language_orders,
        model_class=MultiTaskClassifierMultiHead,
        dataset_train=dataset_train,
        train_embeddings=train_embeddings,
        learning_rate=0.001,
        device=device
    ):
        self.model_class = model_class
        self.model_params = model_params
        self.dataset_train = dataset_train
        self.train_embeddings = train_embeddings
        self.dataset_val = dataset_val
        self.val_embeddings = val_embeddings
        self.language_orders = language_orders
        self.learning_rate = learning_rate
        self.device = device
        self.models = []  
        
    def train(self):
        for order in self.language_orders:
            print(f"\nTraining model with order: {order}")
            cur_target = order[-1]
            
            model = ContinualLearningModel(
                model_params=self.model_params,
                dataset_val=self.dataset_val,
                val_embeddings=self.val_embeddings,
                model_class=self.model_class,
                dataset_train=self.dataset_train,
                train_embeddings=self.train_embeddings,
                language_order=order,
                learning_rate=self.learning_rate,
                target=cur_target,
                device=self.device,
                show_progress=False
            )
            
            trained_model = model.train()
            evaluator = MultiHeadEvaluator(device=self.device)
            val_emb_tensor = torch.tensor(self.val_embeddings, dtype=torch.float32).to(self.device)
            results = evaluator.evaluate(trained_model, val_emb_tensor, self.dataset_val, show_results=False)
            
            self.models.append((trained_model, results['best_fine_f1']))
            
            print(f"Model with order {order} achieved Fine-F1: {results['best_fine_f1']:.3f}")
            
        self.models.sort(key=lambda x: x[1], reverse=True)
        return self
    
    def predict(self, embeddings):
        all_narr_probs = []
        all_sub_probs_dicts = []
        scores = []
        
        with torch.no_grad():
            for model, score in self.models:
                narr_probs, sub_probs_dict = model(embeddings)
                all_narr_probs.append(narr_probs)
                all_sub_probs_dicts.append(sub_probs_dict)
                scores.append(score)
        
        weights = torch.tensor(scores)
        weights = weights / weights.sum()
        
        weighted_narr_probs = torch.zeros_like(all_narr_probs[0])
        for w, probs in zip(weights, all_narr_probs):
            weighted_narr_probs += w * probs
        
        weighted_sub_probs_dict = {}
        for key in all_sub_probs_dicts[0].keys():
            sub_probs_stack = torch.stack([d[key] for d in all_sub_probs_dicts])
            weighted_sub_probs = torch.zeros_like(sub_probs_stack[0])
            for w, probs in zip(weights, sub_probs_stack):
                weighted_sub_probs += w * probs
            weighted_sub_probs_dict[key] = weighted_sub_probs
        
        return weighted_narr_probs, weighted_sub_probs_dict

We select the first order is the order that achieved the best validation performance, while the remaining orders are variations of the first one that also keep certain closely related languages together.

In [56]:
language_orders = [
    ['RU', 'BG', 'PT', 'HI', 'EN'],  # Best performing
    ['RU', 'BG', 'HI', 'PT', 'EN'],  # Second best
    ['BG', 'RU', 'PT', 'HI', 'EN'],  # Starting with slavic
    ['HI', 'PT', 'RU', 'BG', 'EN'],  # Variant
    ['PT', 'HI', 'RU', 'BG', 'EN'],  # Variant
]

en_cl_model_ensemble = ContinualLearningEnsemble(
    model_params=model_params,
    language_orders=language_orders,
    dataset_val=dataset_val_en,
    val_embeddings=val_embeddings_en,
)

In the best performing orders for english, both Russian and Bulgarian come together before they reach English.

```
['RU', 'BG', 'PT', 'HI', 'EN']: 0.382
['RU', 'BG', 'HI', 'PT', 'EN']: 0.356
```

Simply swapping the starting Slavic language shows impact:
```
['BG', 'RU', 'PT', 'HI', 'EN']: 0.314
```

This could mean that the starting language in a continual learning can shape how the model build it's representantions. For isntance Russian might have certain features that make it a a strong foundational model.

But, that might not mean that lingustuic patterns surely helped the model, other factors can also influence the models transfer knowledge.

Since RU and BG have proven to help benefit English in some way, learning them later seems to reduce the impact of those shared patterns on the final model. That could be because learning too less-related languages with not much significant patterns for English, whatever patterns that may be, can lead the model's params to a different direction making it hard to then learn from RU/BG.

```
['HI', 'PT', 'RU', 'BG', 'EN']: 0.302
['PT', 'HI', 'RU', 'BG', 'EN']: 0.300
```

And this make sense, because by the time we have reached RU and BG, the model has already adapted to two less-related languages, so integrating RU and BG, doesn't integrate as strongly as it would if they were learned as foundational step.

In [57]:
en_cl_model_ensemble.train()


Training model with order: ['RU', 'BG', 'PT', 'HI', 'EN']

Training on RU data...

Training on BG data...

Training on PT data...

Training on HI data...

Training on EN data...
Focusing on EN
Model with order ['RU', 'BG', 'PT', 'HI', 'EN'] achieved Fine-F1: 0.329

Training model with order: ['RU', 'BG', 'HI', 'PT', 'EN']

Training on RU data...

Training on BG data...

Training on HI data...

Training on PT data...

Training on EN data...
Focusing on EN
Model with order ['RU', 'BG', 'HI', 'PT', 'EN'] achieved Fine-F1: 0.359

Training model with order: ['BG', 'RU', 'PT', 'HI', 'EN']

Training on BG data...

Training on RU data...

Training on PT data...

Training on HI data...

Training on EN data...
Focusing on EN
Model with order ['BG', 'RU', 'PT', 'HI', 'EN'] achieved Fine-F1: 0.346

Training model with order: ['HI', 'PT', 'RU', 'BG', 'EN']

Training on HI data...

Training on PT data...

Training on RU data...

Training on BG data...

Training on EN data...
Focusing on EN
Model wi

<__main__.ContinualLearningEnsemble at 0x169fe80e0>

In [58]:
evaluator = MultiHeadEvaluator()

In [59]:
def evaluate_ensemble(
    ensemble_model,
    embeddings,
    dataset,
    base_evaluator=evaluator,
    save=False,
):
    def model_wrapper(x):
        return ensemble_model.predict(x)
    
    emb_tensor = torch.tensor(embeddings, dtype=torch.float32).to(device)    
    results = base_evaluator.evaluate(
        model=model_wrapper,
        embeddings=emb_tensor,
        dataset=dataset,
        save=save,
    )
    
    return results

In [60]:
results_ensemble_en = evaluate_ensemble(
    ensemble_model=en_cl_model_ensemble,
    embeddings=val_embeddings_en,
    dataset=dataset_val_en,
    save=False
)


Best thresholds found:
Narrative threshold: 0.50
Subnarrative threshold: 0.45

Competition Values
Coarse-F1: 0.517
F1 st. dev. coarse: 0.352
Fine-F1: 0.353
F1 st. dev. fine: 0.342

Fine Metrics:
Precision: 0.121
Recall: 0.288
F1 Samples: 0.353


## Training a Portuguese model

In [61]:
dataset_val_pt, val_embeddings_pt = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == "PT"
)

In [62]:
dataset_val_pt.shape

(35, 8)

In [63]:
val_embeddings_pt.shape

(35, 1024)

In [64]:
language_order=['RU', 'BG', 'EN', 'HI', 'PT']
target=language_order[-1]

In [65]:
pt_cl_model = ContinualLearningModel(
    model_params=model_params,
    language_order=language_order,
    dataset_val=dataset_val_pt,
    val_embeddings=val_embeddings_pt,
    target=target,
)

In [66]:
pt_model = pt_cl_model.train(epochs_per_language=100, patience=10)


Training on RU data...
Epoch 1/100, Train Loss: 0.7144, Val Loss: 0.6333
Current Learning Rate: 0.001000
Epoch 2/100, Train Loss: 0.4091, Val Loss: 0.6123
Current Learning Rate: 0.001000
Epoch 3/100, Train Loss: 0.2868, Val Loss: 0.5957
Current Learning Rate: 0.001000
Epoch 4/100, Train Loss: 0.2248, Val Loss: 0.5823
Current Learning Rate: 0.001000
Epoch 5/100, Train Loss: 0.1874, Val Loss: 0.5711
Current Learning Rate: 0.001000
Epoch 6/100, Train Loss: 0.1644, Val Loss: 0.5622
Current Learning Rate: 0.001000
Epoch 7/100, Train Loss: 0.1479, Val Loss: 0.5549
Current Learning Rate: 0.001000
Epoch 8/100, Train Loss: 0.1316, Val Loss: 0.5476
Current Learning Rate: 0.001000
Epoch 9/100, Train Loss: 0.1180, Val Loss: 0.5399
Current Learning Rate: 0.001000
Epoch 10/100, Train Loss: 0.1084, Val Loss: 0.5329
Current Learning Rate: 0.001000
Epoch 11/100, Train Loss: 0.0981, Val Loss: 0.5268
Current Learning Rate: 0.001000
Epoch 12/100, Train Loss: 0.0876, Val Loss: 0.5215
Current Learning Rate

In [67]:
_ = pt_cl_model.evaluate_final()


Best thresholds found:
Narrative threshold: 0.15
Subnarrative threshold: 0.55

Competition Values
Coarse-F1: 0.574
F1 st. dev. coarse: 0.188
Fine-F1: 0.394
F1 st. dev. fine: 0.187

Fine Metrics:
Precision: 0.091
Recall: 0.171
F1 Samples: 0.394


We can also try shuffling the articles per language we are processing:

In [68]:
pt_cl_model_ord = ContinualLearningModel(
    model_params=model_params,
    language_order=language_order,
    dataset_val=dataset_val_pt,
    val_embeddings=val_embeddings_pt,
    target=target,
)

In [69]:
pt_model_ord = pt_cl_model_ord.train(epochs_per_language=100, patience=10, shuffle=True)


Training on RU data...
Epoch 1/100, Train Loss: 0.7089, Val Loss: 0.6359
Current Learning Rate: 0.001000
Epoch 2/100, Train Loss: 0.4127, Val Loss: 0.6149
Current Learning Rate: 0.001000
Epoch 3/100, Train Loss: 0.2814, Val Loss: 0.5959
Current Learning Rate: 0.001000
Epoch 4/100, Train Loss: 0.2195, Val Loss: 0.5811
Current Learning Rate: 0.001000
Epoch 5/100, Train Loss: 0.1879, Val Loss: 0.5704
Current Learning Rate: 0.001000
Epoch 6/100, Train Loss: 0.1663, Val Loss: 0.5621
Current Learning Rate: 0.001000
Epoch 7/100, Train Loss: 0.1481, Val Loss: 0.5545
Current Learning Rate: 0.001000
Epoch 8/100, Train Loss: 0.1322, Val Loss: 0.5468
Current Learning Rate: 0.001000
Epoch 9/100, Train Loss: 0.1182, Val Loss: 0.5390
Current Learning Rate: 0.001000
Epoch 10/100, Train Loss: 0.1075, Val Loss: 0.5318
Current Learning Rate: 0.001000
Epoch 11/100, Train Loss: 0.0983, Val Loss: 0.5258
Current Learning Rate: 0.001000
Epoch 12/100, Train Loss: 0.0883, Val Loss: 0.5211
Current Learning Rate

Which is not suprising, simply shuffling the examples within each language does not fundamentally change the sequence in which the model sees languages themselves.

In [70]:
_ = pt_cl_model_ord.evaluate_final()


Best thresholds found:
Narrative threshold: 0.10
Subnarrative threshold: 0.55

Competition Values
Coarse-F1: 0.584
F1 st. dev. coarse: 0.201
Fine-F1: 0.406
F1 st. dev. fine: 0.186

Fine Metrics:
Precision: 0.091
Recall: 0.188
F1 Samples: 0.406


The two highest scores come from sequences that begin with English:

```
['EN', 'BG', 'RU', 'HI', 'PT']: 0.420
['EN', 'RU', 'BG', 'HI', 'PT']: 0.409
```

Then we follow this with slavic languages together (BG, RU), meaning that getting English in early stages can help build a strong foundation for Portuguese.

When Russian and Bulgarian appear directly after English, performance tends to be better, that might mean that it's more beneficial getting English represantations established before moving to RU/BG.

A suprising sequence that places Hindi first, and doesn't necessarily follow language-family closeness, still does okay. This highiligthts that it's not purely about language's that belong to the same family being close together, but just that Hindi may just helped the Portuguese language.
*  So, other factors beyond simple language-family-similarity can also impact the model's performance from the language order, such as specific narrative and subnarratives present in each language, that is, if two langauges share ceratin kinds of narratives more than others, that can also help a lot in a continual learning setup.

In [71]:
language_orders = [
    ['RU', 'BG', 'EN', 'HI', 'PT'],  # Slavic first, then English for more pattern capturing
    ['RU', 'BG', 'HI', 'EN', 'PT'],  # Slavic first, then Hindi
    ['HI', 'BG', 'EN', 'RU', 'PT'],  # Hindi first
    ['EN', 'BG', 'RU', 'HI', 'PT'],  # English first then Slavic
    ['EN', 'RU', 'BG', 'HI', 'PT'],  # Variant
]

pt_cl_model_ensemble = ContinualLearningEnsemble(
    model_params=model_params,
    language_orders=language_orders,
    dataset_val=dataset_val_pt,
    val_embeddings=val_embeddings_pt,
).train()


Training model with order: ['RU', 'BG', 'EN', 'HI', 'PT']

Training on RU data...

Training on BG data...

Training on EN data...

Training on HI data...

Training on PT data...
Focusing on PT
Model with order ['RU', 'BG', 'EN', 'HI', 'PT'] achieved Fine-F1: 0.415

Training model with order: ['RU', 'BG', 'HI', 'EN', 'PT']

Training on RU data...

Training on BG data...

Training on HI data...

Training on EN data...

Training on PT data...
Focusing on PT
Model with order ['RU', 'BG', 'HI', 'EN', 'PT'] achieved Fine-F1: 0.411

Training model with order: ['HI', 'BG', 'EN', 'RU', 'PT']

Training on HI data...

Training on BG data...

Training on EN data...

Training on RU data...

Training on PT data...
Focusing on PT
Model with order ['HI', 'BG', 'EN', 'RU', 'PT'] achieved Fine-F1: 0.412

Training model with order: ['EN', 'BG', 'RU', 'HI', 'PT']

Training on EN data...

Training on BG data...

Training on RU data...

Training on HI data...

Training on PT data...
Focusing on PT
Model wi

In [72]:
results_ensemble_pt = evaluate_ensemble(
    ensemble_model=pt_cl_model_ensemble,
    embeddings=val_embeddings_pt,
    dataset=dataset_val_pt,
)


Best thresholds found:
Narrative threshold: 0.10
Subnarrative threshold: 0.55

Competition Values
Coarse-F1: 0.586
F1 st. dev. coarse: 0.200
Fine-F1: 0.420
F1 st. dev. fine: 0.218

Fine Metrics:
Precision: 0.093
Recall: 0.185
F1 Samples: 0.420


We can follow the same approach for the other languages we have for submission.

We will pick several orders, grouping related languages and select other variants based on those. Because we use an ensembler and a final voting scheme, this can ensure that we will capture both linguistic patterns if applicable, or just specific overlaps that improve performance in the continual learning setup.

## Training a Hindi model

In [73]:
dataset_val_hi, val_embeddings_hi = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == "HI"
)

In [74]:
dataset_val_hi.shape

(35, 8)

In [75]:
val_embeddings_hi.shape

(35, 1024)

In [76]:
language_orders = [
    ['BG', 'RU', 'EN', 'PT', 'HI'],  # Best order
    ['RU', 'BG', 'EN', 'PT', 'HI'],  # Slavic languages together first
    ['BG', 'RU', 'PT', 'EN', 'HI'],  # Variant ^
    ['PT', 'EN', 'BG', 'RU', 'HI'],  # Slaving languages close together end
    ['EN', 'PT', 'RU', 'BG', 'HI']   # Alternative with Western first
]

hi_cl_model_ensemble = ContinualLearningEnsemble(
    model_params=model_params,
    language_orders=language_orders,
    dataset_val=dataset_val_hi,
    val_embeddings=val_embeddings_hi,
).train()


Training model with order: ['BG', 'RU', 'EN', 'PT', 'HI']

Training on BG data...

Training on RU data...

Training on EN data...

Training on PT data...

Training on HI data...
Focusing on HI
Model with order ['BG', 'RU', 'EN', 'PT', 'HI'] achieved Fine-F1: 0.334

Training model with order: ['RU', 'BG', 'EN', 'PT', 'HI']

Training on RU data...

Training on BG data...

Training on EN data...

Training on PT data...

Training on HI data...
Focusing on HI
Model with order ['RU', 'BG', 'EN', 'PT', 'HI'] achieved Fine-F1: 0.308

Training model with order: ['BG', 'RU', 'PT', 'EN', 'HI']

Training on BG data...

Training on RU data...

Training on PT data...

Training on EN data...

Training on HI data...
Focusing on HI
Model with order ['BG', 'RU', 'PT', 'EN', 'HI'] achieved Fine-F1: 0.314

Training model with order: ['PT', 'EN', 'BG', 'RU', 'HI']

Training on PT data...

Training on EN data...

Training on BG data...

Training on RU data...

Training on HI data...
Focusing on HI
Model wi

In [77]:
results_ensemble_hi = evaluate_ensemble(
    ensemble_model=hi_cl_model_ensemble,
    embeddings=val_embeddings_hi,
    dataset=dataset_val_hi,
)


Best thresholds found:
Narrative threshold: 0.45
Subnarrative threshold: 0.45

Competition Values
Coarse-F1: 0.492
F1 st. dev. coarse: 0.323
Fine-F1: 0.326
F1 st. dev. fine: 0.285

Fine Metrics:
Precision: 0.095
Recall: 0.149
F1 Samples: 0.326


## Training a Bulgarian model

In [78]:
dataset_val_bg, val_embeddings_bg = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == "BG"
)

In [79]:
language_orders = [
    ['HI', 'PT', 'RU', 'EN', 'BG'],  # Best order
    ['HI', 'PT', 'EN', 'RU', 'BG'],  # RU closer to BG
    ['PT', 'HI', 'EN', 'RU', 'BG'],  # Small Variant
    ['EN', 'PT', 'HI', 'RU', 'BG'],  # English first
    ['PT', 'EN', 'HI', 'RU', 'BG']   # Another variant
]

bg_cl_model_ensemble = ContinualLearningEnsemble(
    model_params=model_params,
    language_orders=language_orders,
    dataset_val=dataset_val_bg,
    val_embeddings=val_embeddings_bg,
).train()


Training model with order: ['HI', 'PT', 'RU', 'EN', 'BG']

Training on HI data...

Training on PT data...

Training on RU data...

Training on EN data...

Training on BG data...
Focusing on BG
Model with order ['HI', 'PT', 'RU', 'EN', 'BG'] achieved Fine-F1: 0.365

Training model with order: ['HI', 'PT', 'EN', 'RU', 'BG']

Training on HI data...

Training on PT data...

Training on EN data...

Training on RU data...

Training on BG data...
Focusing on BG
Model with order ['HI', 'PT', 'EN', 'RU', 'BG'] achieved Fine-F1: 0.371

Training model with order: ['PT', 'HI', 'EN', 'RU', 'BG']

Training on PT data...

Training on HI data...

Training on EN data...

Training on RU data...

Training on BG data...
Focusing on BG
Model with order ['PT', 'HI', 'EN', 'RU', 'BG'] achieved Fine-F1: 0.399

Training model with order: ['EN', 'PT', 'HI', 'RU', 'BG']

Training on EN data...

Training on PT data...

Training on HI data...

Training on RU data...

Training on BG data...
Focusing on BG
Model wi

In [80]:
results_ensemble_bg = evaluate_ensemble(
    ensemble_model=bg_cl_model_ensemble,
    embeddings=val_embeddings_bg,
    dataset=dataset_val_bg,
)


Best thresholds found:
Narrative threshold: 0.50
Subnarrative threshold: 0.50

Competition Values
Coarse-F1: 0.596
F1 st. dev. coarse: 0.373
Fine-F1: 0.390
F1 st. dev. fine: 0.334

Fine Metrics:
Precision: 0.094
Recall: 0.174
F1 Samples: 0.390


## Training a Russian model

In [81]:
dataset_val_ru, val_embeddings_ru = filter_dataset_and_embeddings(
        dataset_val, val_embeddings, lambda row: row["language"] == "RU"
)

In [85]:
language_orders = [
    ['HI', 'PT', 'BG', 'EN', 'RU'],
    ['HI', 'EN', 'BG', 'PT', 'RU'],
    ['PT', 'HI', 'EN', 'BG', 'RU'],
    ['BG', 'EN', 'PT', 'HI', 'RU'],
    ['EN', 'PT', 'BG', 'HI', 'RU'],
    ['PT', 'EN', 'HI', 'BG', 'RU'] 
]

ru_cl_model_ensemble = ContinualLearningEnsemble(
    model_params=model_params,
    language_orders=language_orders,
    dataset_val=dataset_val_ru,
    val_embeddings=val_embeddings_ru,
).train()


Training model with order: ['HI', 'PT', 'BG', 'EN', 'RU']

Training on HI data...

Training on PT data...

Training on BG data...

Training on EN data...

Training on RU data...
Focusing on RU
Model with order ['HI', 'PT', 'BG', 'EN', 'RU'] achieved Fine-F1: 0.262

Training model with order: ['HI', 'EN', 'BG', 'PT', 'RU']

Training on HI data...

Training on EN data...

Training on BG data...

Training on PT data...

Training on RU data...
Focusing on RU
Model with order ['HI', 'EN', 'BG', 'PT', 'RU'] achieved Fine-F1: 0.273

Training model with order: ['PT', 'HI', 'EN', 'BG', 'RU']

Training on PT data...

Training on HI data...

Training on EN data...

Training on BG data...

Training on RU data...
Focusing on RU
Model with order ['PT', 'HI', 'EN', 'BG', 'RU'] achieved Fine-F1: 0.252

Training model with order: ['BG', 'EN', 'PT', 'HI', 'RU']

Training on BG data...

Training on EN data...

Training on PT data...

Training on HI data...

Training on RU data...
Focusing on RU
Model wi

In [86]:
results_ensemble_ru = evaluate_ensemble(
    ensemble_model=ru_cl_model_ensemble,
    embeddings=val_embeddings_ru,
    dataset=dataset_val_ru,
)


Best thresholds found:
Narrative threshold: 0.30
Subnarrative threshold: 0.40

Competition Values
Coarse-F1: 0.523
F1 st. dev. coarse: 0.272
Fine-F1: 0.306
F1 st. dev. fine: 0.251

Fine Metrics:
Precision: 0.080
Recall: 0.143
F1 Samples: 0.306
