# Semeval 2025 Task 10
### Subtask 2: Narrative Baseline Classification -- Multilingual

Given a news article and a [two-level taxonomy of narrative labels](https://propaganda.math.unipd.it/semeval2025task10/NARRATIVE-TAXONOMIES.pdf) (where each narrative is subdivided into subnarratives) from a particular domain, assign to the article all the appropriate subnarrative labels. This is a multi-label multi-class document classification task.

## 1. Getting embeddings for the articles

We load our dataframe, to be used for getting the embeddings from the article context:

In [4]:
root_dir = '../../'

In [5]:
import pickle
import os

base_save_folder_dir = '../saved/'
dataset_folder = os.path.join(base_save_folder_dir, 'Dataset')

with open(os.path.join(dataset_folder, 'dataset_train_cleaned.pkl'), 'rb') as f:
    dataset_train = pickle.load(f)

In [6]:
with open(os.path.join(dataset_folder, 'dataset_val_cleaned.pkl'), 'rb') as f:
    dataset_val = pickle.load(f)

We make sure that everything is ok:

In [8]:
dataset_train.head()

Unnamed: 0,language,article_id,content,narratives,subnarratives
0,RU,RU-URW-1161.txt,<PARA>в ближайшие два месяца сша будут стремит...,[URW: Blaming the war on others rather than th...,"[The West are the aggressors, Other, The West ..."
1,RU,RU-URW-1175.txt,<PARA>в ес испугались последствий популярности...,"[URW: Discrediting the West, Diplomacy, URW: D...","[The West is weak, Other, The EU is divided]"
2,RU,RU-URW-1149.txt,<PARA>возможность признания аллы пугачевой ино...,[URW: Distrust towards Media],[Western media is an instrument of propaganda]
3,RU,RU-URW-1015.txt,<PARA>азаров рассказал о смене риторики киева ...,"[URW: Discrediting Ukraine, URW: Discrediting ...","[Ukraine is a puppet of the West, Discrediting..."
4,RU,RU-URW-1001.txt,<PARA>в россиянах проснулась массовая любовь к...,[URW: Praise of Russia],[Russia is a guarantor of peace and prosperity]


In [10]:
dataset_train.shape

(1781, 5)

In [11]:
dataset_val.shape

(178, 7)

We are going to be using the [KaLM-embedding-multilingual-mini-instruct-v1.5](https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5) model for generating embeddings. 

There are several embedding models available out there, as listed in the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), and after testing multiple options, this one has shown to provide the best performance. 

Notice also, that we're using a maximum sequence length of 512 for the tokenizer. 
While we could use a longer sequence to tokenize the entire article, splitting the article into smaller chunks and processing them individually yielded better results during testing.

In [12]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch

kalm = "HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5"
kalm_model = SentenceTransformer(kalm)
kalm_max_length = 512 # recommended by the model
kalm_model.max_seq_length = kalm_max_length

Move to GPU if available:

In [13]:
device = "cuda" if torch.cuda.is_available() else "cpu"
kalm_model.to(device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: Qwen2Model 
  (1): Pooling({'word_embedding_dimension': 896, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [14]:
texts = [
    "This is a news article about politics.",  # English
    "यह राजनीति के बारे में एक समाचार लेख है।",  # Hindi translation
    "Este é um artigo de notícias sobre política.",  # Portuguese translation
    "Това е новинарска статия за политика.",  # Bulgarian translation
    "Это новость о политике.",  # Russian translation
    "The weather was nice today."  # Unrelated sentence
]

embeddings = kalm_model.encode(texts)

In [15]:
row = 7
english_article = dataset_train[dataset_train['language'] == 'EN'].iloc[row].content
english_article



In [16]:
import re

def split_into_sections(content):
    parts = re.split(r'<PARA>|</PARA>', content)
    parts = [p.strip() for p in parts if p.strip()]

    if len(parts) == 1:
        return parts[0], "", ""
    elif len(parts) == 2:
        return parts[0], parts[1], ""
    else:
        header = parts[0]
        footer = parts[-1]
        body = " ".join(parts[1:-1])
        return header, body, footer

In [17]:
header, body, footer = split_into_sections(english_article)
print("Header: ", header)
print("\n\n")
print("Body: ", body)
print("\n\n")
print("Footer: ", footer)

Header:  UN chief Warns of global economic crisis at world economic forum in Davos






Footer:  other tech firms, such as Amazon, Meta, Alphabet, Salesforce, and Twitter, have announced similar moves in recent weeks. Microsoft, based in Redmond, Washington, had 221,000 full-time employees as of june 30, 2022, according to government filings.


Here, `EmbeddingUtils` will help us generate embeddings for text. The main task here is to handle long paragraphs that might exceed the model's token limit. If a paragraph is too long, it is split into smaller chunks, each of which is encoded separately. 

Once we have embeddings for all chunks, they are combined using `sum` aggregation strtategy (proved to be the best) to get a single vector that represents the entire paragraph. This approach ensures we can handle long texts while preserving the meaningful content in the embeddings.

In [18]:
from transformers import AutoTokenizer, AutoModel
import torch

class EmbeddingUtils:
    def __init__(self, model, tokenizer, max_length, device=None):
        self.model = model
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.device = device

    def _split_long_paragraph_and_get_embeddings(self, text, encode_fn, strategy="sum"):
        """
        Splits a long paragraph into chunks if it exceeds max_length in tokens,
        calls the provided encoding function on each chunk, then aggregates.
        """
        embeddings = []

        # We'll do a naive approach: check token length, if too big -> chunk it.
        while True:
            # Tokenize
            tokens = self.tokenizer(
                text, truncation=False, return_tensors="pt", add_special_tokens=False
            )

            # Ensure token tensors are on the correct device
            tokens = {key: value.to(self.device) for key, value in tokens.items()}
            num_tokens = tokens["input_ids"].shape[1]

            # If it fits in one chunk, just encode and break
            if num_tokens <= self.max_length:
                emb = encode_fn(text)
                embeddings.append(emb)
                break
            else:
                # If it doesn't fit, let's do a naive split ~ in half by chars
                split_index = len(text) // 2
                chunk = text[:split_index]
                emb = encode_fn(chunk)
                embeddings.append(emb)

                text = text[split_index:].strip()

        aggregated_emb = self._aggregate_embeddings(embeddings, strategy=strategy)
        return aggregated_emb

    def _aggregate_embeddings(self, embedding_list, strategy="sum"):
        """
        Combine multiple chunk embeddings into a single vector.
        """
        if not embedding_list:
            print('[WARNING] Embedding list was empty')
            return None

        # Stack them on the same device
        stacked = torch.stack([emb.to(self.device) for emb in embedding_list], dim=0)

        if strategy == "mean":
            agg = stacked.mean(dim=0)
        elif strategy == "sum":
            agg = stacked.sum(dim=0)
        elif strategy == "concat":
            agg = torch.cat(embedding_list, dim=0)
        elif strategy == "rms":
            squares = stacked**2
            agg = torch.sqrt(squares.mean(dim=0))
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        return agg

We also define the `KALMEmbeddingProcessor` class, which is responsible for generating embeddings for a given text. 

In [19]:
class KALMEmbeddingProcessor:
    def __init__(self, model, max_length=512, device='cpu'):
        self.model = model
        self.model.max_seq_length = max_length
        self.device = device
        self.instruction = "Produce an embedding useful for detecting relevant war- or climate-related narratives from a taxonomy."
        self.utils = EmbeddingUtils(self.model, self.model.tokenizer, self.model.max_seq_length, self.device)
        print(f"Max length is set to {self.model.max_seq_length}.")
        print("Using device", device)

    def _encode(self, sentence):
        text_to_encode = f"Instruct: {self.instruction}\nQuery: {sentence}"

        embedding = self.model.encode(
            text_to_encode,
            convert_to_tensor=True,
            normalize_embeddings=False,
            show_progress_bar=False,
            device=self.device
        )
        return embedding

    def get_embeddings(self, content, strategy="mean"):
        """
        Main method that splits into header, body, footer, applies chunking,
        and aggregates into a single doc embedding.
        """
        header, body, footer = split_into_sections(content)

        section_embs = []

        # 1) Header
        if header:
            emb = self.utils._split_long_paragraph_and_get_embeddings(header, self._encode, strategy="sum")
            if emb is not None:
                section_embs.append(emb)

        # 2) Body
        if body:
            emb = self.utils._split_long_paragraph_and_get_embeddings(body, self._encode, strategy="sum")
            if emb is not None:
                section_embs.append(emb)

        # 3) Footer
        if footer:
            emb = self.utils._split_long_paragraph_and_get_embeddings(footer, self._encode, strategy="sum")
            if emb is not None:
                section_embs.append(emb)

        if not header and not body and not footer:
            print("[WARNING] Empty article or no sections found")
            return None

        final_emb = self.utils._aggregate_embeddings(section_embs, strategy=strategy)
        if final_emb is None:
            print("[WARNING] Failed to aggregate embeddings")
            return None

        final_emb_np = final_emb.detach().cpu().numpy()
        return final_emb_np

In [20]:
processor_kalm = KALMEmbeddingProcessor(model=kalm_model, max_length=kalm_max_length, device=device)

Max length is set to 512.
Using device cpu


In [22]:
embeddings_dir = base_save_folder_dir
embedding_train_file_name = 'embeddings_train_kalm.npy'
embeddings_train_full_path = embeddings_dir + 'Embeddings/' + embedding_train_file_name

In [23]:
embeddings_train_full_path

'../saved/Embeddings/embeddings_train_kalm.npy'

In [24]:
embedding_val_file_name = 'embeddings_val_kalm.npy'
embeddings_val_full_path = embeddings_dir + 'Embeddings/' + embedding_val_file_name

In [25]:
embeddings_val_full_path

'../saved/Embeddings/embeddings_val_kalm.npy'

In [26]:
import os

def are_embeddings_saved(filepath):
    if os.path.exists(filepath):
        return True
    print("Embeddings not computed, computing..")
    return False

In [28]:
are_embeddings_saved(embeddings_val_full_path)

Embeddings not computed, computing..


False

Then, we pre-compute and then save the embeddings:

In [29]:
dataset_train.shape

(1781, 5)

In [31]:
import numpy as np

def precompute_embeddings(dataset, model, file_path):
    embeddings = []
    for index, row in dataset.iterrows():
        content = row['content']
        embedding = processor_kalm.get_embeddings(content, strategy="sum")
        embeddings.append(embedding)

    embeddings_array = np.array(embeddings)
    np.save(file_path, embeddings_array)
    print(f"Embeddings saved to {file_path}")

def load_embeddings(filename):
    return np.load(filename)

if not are_embeddings_saved(embeddings_train_full_path): precompute_embeddings(dataset_train, processor_kalm, embeddings_train_full_path)

Embeddings not computed, computing..
Embeddings saved to ../saved/Embeddings/embeddings_train_kalm.npy


In [32]:
if not are_embeddings_saved(embeddings_val_full_path): precompute_embeddings(dataset_val, processor_kalm, embeddings_val_full_path)

Embeddings not computed, computing..
Embeddings saved to ../saved/Embeddings/embeddings_val_kalm.npy
