<figure>
  <img src="https://raw.githubusercontent.com/shadowkshs/DimABSA2026/refs/heads/main/banner.png" width="100%">
</figure>

# XLM-Roberta
## Lang: Russian, Tatar, english, chineses
## Note: only included domains and not languages in tokenization
# Subtask 1: Dimensional Aspect Sentiment Regression (DimASR)

-----

## Starter Notebook
Leveraging Pretrained Language Models for Dimensional Sentiment Regression


## Introduction:

You are welcome to participate in our SemEval Shared Task!

In this starter notebook, we will take you through the process of fine-tuning a pre-trained language model on a sample data to build a sentiment regressor. The notebook was adapted from a Hugginface implementation for such tasks.

### Outline:

- Installation and importation of necessary libraries
Setting up the project parameters.
Running training and evaluation
Before you start:

- It is strongly advised that you use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box.

### NB:

The codes in this notebook are provided to familiarize yourselves with fine-tuning language models for sentiment regression. You may extend and (or) modify as appropriate to obtain competitive performances.

### Languages and Domains:
#### Track A: Subtask 1
- eng_restaurant
- eng_laptop
- jpn_hotel
- jpn_finance
- rus_restaurant
- tat_restaurant
- ukr_restaurant
- zho_restaurant
- zho_laptop
#### Track B: Subtask 1
- deu-stance
- eng-stance
- hau-stance
- kin-stance
- swa-stance
- twi-stance


### Model:
This Starter Notebook uses the bert-base-multilingual-cased pretrained model, developed by Google. The model was trained with a masked language modeling (MLM) objective on the top 104 languages with the largest Wikipedia presence. You can find the model here: https://huggingface.co/google-bert/bert-base-multilingual-cased

If your target language is not included in the common set supported by this model, you can search for a more suitable model on Hugging Face: https://huggingface.co/models



In [42]:
import json
from typing import List, Dict
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel

from scipy.stats import pearsonr
from tqdm import tqdm
import math
import re
import requests


def load_jsonl(filepath: str) -> List[Dict]:
    with open(filepath, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

def load_jsonl_url(url: str) -> List[Dict]:
    resp = requests.get(url)
    resp.raise_for_status()
    return [json.loads(line) for line in resp.text.splitlines()]

### First, visit the [DimABSA2006](https://github.com/DimABSA/DimABSA2026) repository, check the task-dataset.

### Step 1: Load the competition data

- Read JSONL files (train/dev/predict) into Colab.  
- Train files contain Valence–Arousal (VA) labels.  
- Predict files have no VA labels.  
- This script:
  1. Loads the JSONL data.
  2. Splits 10% of train data as dev set.
  3. Converts JSONL into DataFrames (ID, Text, Aspect, Valence, Arousal).
  4. Prints the first few rows for checking.


In [45]:
#task config
subtask = "subtask_1"#don't change
task = "task1"#don't change
langs = ["eng","zho"] #chang the language you want to test
domains = ["restaurant","laptop"] #change what domain you want to test

all_train = []
all_dev = []

for lang in langs:
    for domain in domains:
        train_url = f"https://raw.githubusercontent.com/DimABSA/DimABSA2026/refs/heads/main/task-dataset/track_a/{subtask}/{lang}/{lang}_{domain}_train_alltasks.jsonl"
        dev_url = f"https://raw.githubusercontent.com/DimABSA/DimABSA2026/refs/heads/main/task-dataset/track_a/{subtask}/{lang}/{lang}_{domain}_dev_{task}.jsonl"

        try:
            train_raw = load_jsonl_url(train_url)
            dev_raw = load_jsonl_url(dev_url)

            for r in train_raw:
                r["language"] = lang
                r["domain"] = domain

            for r in dev_raw:
                r["language"] = lang
                r["domain"] = domain


            all_train.extend(train_raw)
            all_dev.extend(dev_raw)
            print(f"Loaded: {lang}-{domain} ✅")

        except Exception as e:
            print(f"Skipped {lang}-{domain}: {e}")

#model config
model_name = "xlm-roberta-base" # change your transformer model
lr = 1e-5 #learning rate
epochs = 5

Loaded: eng-restaurant ✅
Loaded: eng-laptop ✅
Loaded: zho-restaurant ✅
Loaded: zho-laptop ✅


another transformer models you can try:
1. roberta-large
2. roberta-base
3. bert-base-uncased

more models please visit [huggingface](https://huggingface.co/models)

In [48]:
#==== step 1 load the data ====
# you can change the env for your task.
# train data should have the VA labels, predit data without VA labels

def jsonl_to_df(data):
    if 'Quadruplet' in data[0]:
        df = pd.json_normalize(data, 'Quadruplet', ['ID', 'Text', 'language', 'domain'])
        df[['Valence', 'Arousal']] = df['VA'].str.split('#', expand=True).astype(float)
        df = df.drop(columns=['VA', 'Category', 'Opinion'])  # drop unnecessary columns
        df = df.drop_duplicates(subset=['ID', 'Aspect'], keep='first')  # remove duplicate ID+Aspect

    elif 'Triplet' in data[0]:
        df = pd.json_normalize(data, 'Triplet', ['ID', 'Text', 'language', 'domain'])
        df[['Valence', 'Arousal']] = df['VA'].str.split('#', expand=True).astype(float)
        df = df.drop(columns=['VA', 'Opinion'])  # drop unnecessary columns
        df = df.drop_duplicates(subset=['ID', 'Aspect'], keep='first')  # remove duplicate ID+Aspect

    elif 'Aspect' in data[0]:
        df = pd.json_normalize(data, 'Aspect', ['ID', 'Text', 'language', 'domain'])
        df = df.rename(columns={df.columns[0]: "Aspect"})  # rename to Aspect
        df['Valence'] = 0  # default value
        df['Arousal'] = 0  # default value

    else:
        raise ValueError("Invalid format: must include 'Quadruplet' or 'Triplet' or 'Aspect'")

    return df

train_df = jsonl_to_df(all_train)
predict_df = jsonl_to_df(all_dev)

# split 10% for dev
train_df, dev_df = train_test_split(train_df, test_size=0.1, random_state=42)

### Display the dataframe

In [49]:
from IPython.display import display, Markdown

display(Markdown(f"### {subtask}_{lang}_{domain} train_df"))
display(train_df.head())

display(Markdown(f"### {subtask}_{lang}_{domain} dev_df"))
display(dev_df.head())

display(Markdown(f"### {subtask}_{lang}_{domain} predict_df"))
display(predict_df.head())

### subtask_1_zho_laptop train_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
2810,personal pans,rest16_quad_train_986,personal pans are the perfect size for those h...,eng,restaurant,7.5,7.0
13414,原味炒泡麵,R0947:S025,大推原味炒泡麵，不得不說這裡的炒泡麵味道還真是不錯。,zho,restaurant,7.25,6.75
21277,CPU,7057803:S011,最便宜的要小心不要買到電子垃圾CPU慢到不想用續航力也不強帶出門還要找插座,zho,laptop,3.5,6.0
12641,冰摩卡,R1174:S038,冰摩卡聞起來香。,zho,restaurant,6.33,5.0
13673,牛排,R1453:S008,牛排有嫩。,zho,restaurant,6.0,5.25


### subtask_1_zho_laptop dev_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
8680,flip,laptop_quad_train_2485,i loved the simplicity of chrome os while enjo...,eng,laptop,7.62,7.75
6311,,laptop_quad_train_756,"i am very much in the causal gamer category , ...",eng,laptop,6.38,6.38
1406,open kitchen,rest16_quad_train_137,admittedly some nights inside the restaurant w...,eng,restaurant,6.75,6.88
2223,pizzeria,rest16_quad_train_636,i stumbled upon this great pizzeria as i explo...,eng,restaurant,7.62,7.62
14423,石蚵煎蛋,R2968:S005,除了鍋貼外石蚵煎蛋也很不錯。,zho,restaurant,6.75,6.0


### subtask_1_zho_laptop predict_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
0,diner food,rest26_aspect_va_dev_1,Great diner food and breakfast is served all day,eng,restaurant,0,0
1,breakfast,rest26_aspect_va_dev_1,Great diner food and breakfast is served all day,eng,restaurant,0,0
2,food,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0
3,drinks,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0
4,service,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0


### Step 2: Build Dataset and DataLoader

- Define a custom `VADataset` class for PyTorch:
  - Joins Aspect + Text into a single input string.
  - Uses BERT tokenizer to create `input_ids` and `attention_mask`.
  - Returns `[Valence, Arousal]` labels as float tensor.
- Convert the processed DataFrames into PyTorch `Dataset` objects.
- Wrap them with `DataLoader` for mini-batch training and evaluation.

In [50]:
#==== Dataset ====
class VADataset(Dataset):
    '''
    A PyTorch Dataset for Valence–Arousal regression with language + domain context.

    - Combines language, domain, aspect, and text into a single input:
        e.g., "[ENG] [LAPTOP] keyboard: The keyboard is good"
    - Tokenizes the input using a HuggingFace tokenizer.
    - Returns:
        * input_ids: token IDs, shape [max_len]
        * attention_mask: mask, shape [max_len]
        * labels: [Valence, Arousal], shape [2], float tensor

    Args:
        dataframe (pd.DataFrame): must contain columns
            "Text", "Aspect", "Valence", "Arousal", "language", "domain".
        tokenizer: HuggingFace tokenizer.
        max_len (int): max sequence length.
    '''
    def __init__(self, dataframe, tokenizer, max_len=128):
        self.sentences = dataframe["Text"].tolist()
        self.aspects = dataframe["Aspect"].tolist()
        self.labels = dataframe[["Valence", "Arousal"]].values.astype(float)
        self.languages = dataframe["language"].tolist()
        self.domains = dataframe["domain"].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        # lang = self.languages[idx].upper() # Removed language
        domain = self.domains[idx].upper()
        # Construct input string without special tokens as XLM-R doesn't use them in the same way
        text = f"[{domain}] {self.aspects[idx]}: {self.sentences[idx]}"


        encoded = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )

        return {
            "input_ids": encoded["input_ids"].squeeze(0),
            "attention_mask": encoded["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.float)
        }



# convert to Dataset and Dataloader
tokenizer = AutoTokenizer.from_pretrained(model_name)
special_tokens_dict = {
    'additional_special_tokens': [ '[RESTAURANT]', '[LAPTOP]']
}
tokenizer.add_special_tokens(special_tokens_dict)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dataset = VADataset(train_df, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

dev_dataset = VADataset(dev_df, tokenizer)
dev_loader = DataLoader(dev_dataset, batch_size=64, shuffle=True)

### Step 3: Build and Train TransformerVARegressor

- Define **`TransformerVARegressor`**:  
  - Uses pretrained Transformer (e.g. BERT) as backbone.  
  - Adds dropout and linear layer to predict **Valence** and **Arousal**.  

- Implement helper methods:  
  - `train_epoch`: one training pass with optimizer and loss.  
  - `eval_epoch`: validation pass without gradient updates.  

- Set training parameters:  
  - `lr = 1e-5`, `epochs = 5`, `loss_fn = MSELoss`.  

- Run training loop:  
  - For each epoch, print training and validation loss to monitor progress.


In [51]:
#====step 3 build your model ====
class TransformerVARegressor(nn.Module):
    '''
    A BERT-based regressor for predicting Valence and Arousal scores.

    - Uses a pretrained BERT backbone to encode text.
    - Takes the [CLS] token representation as sentence-level embedding.
    - Adds a dropout layer and a linear head to output 2 values: [Valence, Arousal].
    - Includes helper methods for one training epoch and one evaluation epoch.

    Args:
        model_name (str): HuggingFace model name, default "bert-base-multilingual-cased".
        dropout (float): Dropout rate before the regression head.

    Methods:
        train_epoch(dataloader, optimizer, loss_fn, device):
            Train the model for one epoch.
            Returns average training loss.

        eval_epoch(dataloader, loss_fn, device):
            Evaluate the model for one epoch (no gradient).
            Returns average validation loss.
    '''
    def __init__(self, model_name=model_name, dropout=0.1):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.reg_head = nn.Linear(self.backbone.config.hidden_size, 2)  # Valence + Arousal

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0]  # [CLS] token
        x = self.dropout(cls_output)
        return self.reg_head(x)


    def train_epoch(self, dataloader, optimizer, loss_fn, device):
        self.train()
        total_loss = 0
        for batch in tqdm(dataloader):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            optimizer.zero_grad()
            outputs = self(input_ids, attention_mask)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        return total_loss / len(dataloader)

    def eval_epoch(self, dataloader, loss_fn, device):
        self.eval()
        total_loss = 0
        with torch.no_grad():
            for batch in dataloader:
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = self(input_ids, attention_mask)
                loss = loss_fn(outputs, labels)
                total_loss += loss.item()
        return total_loss / len(dataloader)

# Training bert on your data
model = TransformerVARegressor().to(device)
model.backbone.resize_token_embeddings(len(tokenizer))
lr = locals().get("lr", 1e-5)
epochs = locals().get("epochs", 5)

optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
loss_fn = nn.MSELoss()

for epoch in range(epochs):
    train_loss = model.train_epoch(train_loader, optimizer, loss_fn, device)
    val_loss = model.eval_epoch(dev_loader, loss_fn, device)
    print(f"model:{model_name} Epoch:{epoch+1}: train={train_loss:.4f}, val={val_loss:.4f}")

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
100%|██████████| 302/302 [06:44<00:00,  1.34s/it]


model:xlm-roberta-base Epoch:1: train=3.6913, val=1.3867


100%|██████████| 302/302 [06:42<00:00,  1.33s/it]


model:xlm-roberta-base Epoch:2: train=0.8932, val=0.6388


100%|██████████| 302/302 [06:42<00:00,  1.33s/it]


model:xlm-roberta-base Epoch:3: train=0.6091, val=0.6852


100%|██████████| 302/302 [06:42<00:00,  1.33s/it]


model:xlm-roberta-base Epoch:4: train=0.4995, val=0.4543


100%|██████████| 302/302 [06:43<00:00,  1.34s/it]


model:xlm-roberta-base Epoch:5: train=0.4405, val=0.5605


### Step 4: Evaluate model performance on dev set

- Define helper function `get_prd`:
  - For **dev**: get both predictions and gold labels.
  - For **pred**: only get predictions (no gold labels).
- Define `evaluate_predictions_task1`:
  - Compute Pearson Correlation Coefficient (PCC) for Valence (V) and Arousal (A).
  - Compute normalized RMSE for combined VA score.
- Run evaluation on laptop and restaurant dev sets.
- Print metrics to check how well the models perform.


In [52]:
import numpy as np
import math
from scipy.stats import pearsonr
from tqdm import tqdm # Make sure to import tqdm

#==== step 4 use dev data to check your model's performance ====

def get_prd(model, dataloader, type="dev"): # Fixed typo: dataloader
    """Gets predictions and labels from a dataloader."""
    model.eval() # Set model to evaluation mode (very important!)

    if type == "dev":
        all_preds, all_labels = [], []
        with torch.no_grad():
            # Use tqdm for a nice progress bar
            for batch in tqdm(dataloader, desc="Evaluating"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].cpu().numpy() # Fixed typo: labels

                outputs = model(input_ids, attention_mask).cpu().numpy()

                all_preds.append(outputs)
                all_labels.append(labels) # Fixed typo: labels

        preds = np.vstack(all_preds)
        labels = np.vstack(all_labels) # Fixed typo: labels

        pred_v = preds[:, 0]
        pred_a = preds[:, 1]

        gold_v = labels[:, 0] # Fixed typo: labels
        gold_a = labels[:, 1] # Fixed typo: labels

        return pred_v, pred_a, gold_v, gold_a

    elif type == "pred":
        all_preds = []
        with torch.no_grad():
            for batch in tqdm(dataloader, desc="Predicting"): # Fixed typo: dataloader
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)

                outputs = model(input_ids, attention_mask).cpu().numpy()
                all_preds.append(outputs)

        preds = np.vstack(all_preds)

        pred_v = preds[:, 0]
        pred_a = preds[:, 1]

        return pred_v, pred_a

def evaluate_predictions_task1(pred_a, pred_v, gold_a, gold_v):
    """Calculates PCC and RMSE, ensuring predictions are in the 1-9 range."""

    # CRITICAL: Clip predictions to the valid [1, 9] range before scoring.
    pred_v_clipped = np.clip(pred_v, 1, 9)
    pred_a_clipped = np.clip(pred_a, 1, 9)

    pcc_v = pearsonr(pred_v_clipped, gold_v)[0]
    pcc_a = pearsonr(pred_a_clipped, gold_a)[0]

    gold_va = gold_v + gold_a
    pred_va = pred_v_clipped + pred_a_clipped # Use clipped values

    # Use np.mean and np.sqrt for a cleaner RMSE calculation
    rmse_va = np.sqrt(np.mean((gold_va - pred_va) ** 2))

    return {
        'PCC_V': pcc_v,
        'PCC_A': pcc_a,
        'RMSE_VA': rmse_va,
    }

# --- Call the functions ---
# Make sure 'dev_loader' is the actual DataLoader object you created
pred_v, pred_a, gold_v, gold_a = get_prd(model, dev_loader, type="dev")
eval_score = evaluate_predictions_task1(pred_a, pred_v, gold_a, gold_v)
print(f"{model_name} dev_eval: {eval_score}")

Evaluating: 100%|██████████| 34/34 [00:14<00:00,  2.30it/s]

xlm-roberta-base dev_eval: {'PCC_V': np.float32(0.863335), 'PCC_A': np.float32(0.7615335), 'RMSE_VA': np.float32(1.2793247)}





Evaluating: 100%|██████████| 34/34 [00:14<00:00,  2.30it/s]xlm-roberta-base dev_eval: {'PCC_V': np.float32(0.863335), 'PCC_A': np.float32(0.7615335), 'RMSE_VA': np.float32(1.2793247)


### Step 5: Save and submit prediction results

- Define helper `df_to_jsonl`:
  - Sort by ID number.
  - Group rows by ID.
  - Save predictions in JSONL format (`ID`, `Aspect_VA`).
- Run the model on the predict sets (laptop & restaurant).
- Fill in predicted Valence/Arousal values.
- Export three JSONL files:




  - `pred_eng_laptop.jsonl`
  - `pred_eng_restaurant.jsonl`
  - `pred_zho_laptop.jsonl`
- These files can be uploaded as the final submission.


### File Naming Guidelines
When submitting your predictions on the Codabench task page:

Decide the target language(s) and domain(s). Each submission file corresponds to one language-domain combination.
For each language-domain combination, name the file pred_[lang_code]_[domain].jsonl, where
- [lang_code] represents a 3-letter language code, and
- [domain] represents a domain.
For example, Hausa predictions for the movie domain should be named pred_hau_movie.jsonl.
If submitting for multiple languages or domains, submit one prediction file per language-domain combination. For example, submitting for multiple languages or domains would look like this:
```plaintext
subtask_1
├── pred_eng_restaurant.jsonl
├── pred_eng_laptop.jsonl
└── pred_zho_laptop.jsonl

In [53]:
#==== step 5 save & submit your predict results ====
def extract_num(s):
    m = re.search(r"(\d+)$", str(s))
    return int(m.group(1)) if m else -1

def df_to_jsonl(df, out_path):
    df_sorted = df.sort_values(by="ID", key=lambda x: x.map(extract_num))
    grouped = df_sorted.groupby("ID", sort=False)

    with open(out_path, "w", encoding="utf-8") as f:
        for gid, gdf in grouped:
            record = {
                "ID": gid,
                "Aspect_VA": []
            }
            for _, row in gdf.iterrows():
                record["Aspect_VA"].append({
                    "Aspect": row["Aspect"],
                    "VA": f"{row['Valence']:.2f}#{row['Arousal']:.2f}"
                })
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

pred_dataset = VADataset(predict_df, tokenizer)
pred_loader = DataLoader(pred_dataset, batch_size=64, shuffle=False)
pred_v, pred_a, = get_prd(model, pred_loader,type="pred")

predict_df["Valence"] = np.clip(pred_v, 1, 9)
predict_df["Arousal"] = np.clip(pred_a, 1, 9)

df_to_jsonl(predict_df, f"pred_{lang}_{domain}.jsonl")

Predicting: 100%|██████████| 27/27 [00:11<00:00,  2.35it/s]


### Download the submit files

In [55]:
import os
import shutil
import zipfile
from google.colab import files

# Create the folder subtask if it does not exist
os.makedirs(subtask, exist_ok=True)

# Move the three files into the subtask folder
for fname in [f"pred_{lang}_{domain}.jsonl"]:
    if os.path.exists(fname):
        shutil.move(fname, os.path.join(subtask, fname))

# Create a zip file named "submit.zip" containing the folder subtask
with zipfile.ZipFile(f"{subtask}.zip", "w", zipfile.ZIP_DEFLATED) as zf:
    for root, _, files_in_dir in os.walk(subtask):
        for file in files_in_dir:
            path = os.path.join(root, file)
            # Keep folder structure inside the zip
            zf.write(path, os.path.relpath(path, "."))

# Download the created zip file to local machine
files.download(f"{subtask}.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# from here my code

In [None]:
import torch, os
from transformers import AutoTokenizer

save_dir = "./transformer_va_regressor"
os.makedirs(save_dir, exist_ok=True)

# Save tokenizer
tokenizer.save_pretrained(save_dir)

# Save model backbone config (for reloading)
model.backbone.config.save_pretrained(save_dir)

# Save model weights
torch.save(model.state_dict(), os.path.join(save_dir, "pytorch_model.bin"))

print(f"✅ Model and tokenizer saved to {save_dir}")


✅ Model and tokenizer saved to ./transformer_va_regressor


In [None]:
from huggingface_hub import login
login(token="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")

In [None]:


from huggingface_hub import HfApi

# ✅ Upload to Hugging Face
repo_id = "hassanshahzad2003/bert-base-multilingual-cased-task1.2"

api = HfApi()
api.create_repo(repo_id=repo_id, exist_ok=True)
api.upload_folder(
    folder_path=save_dir,
    repo_id=repo_id,
    commit_message="Upload fine-tuned TransformerVARegressor"
)

print(f"✅ Uploaded to https://huggingface.co/{repo_id}")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...gressor/pytorch_model.bin:   1%|1         | 9.67MB /  712MB            

✅ Uploaded to https://huggingface.co/hassanshahzad2003/bert-base-multilingual-cased-task1.2


### Conclusion

In this notebook, we walked through the full pipeline for **Dimensional Aspect Sentiment Regression (DimASR)**:

1. **Load data**: Import the competition JSONL files, split train/dev sets, and convert to DataFrames.  
2. **Build dataset & dataloaders**: Define a custom `VADataset` to tokenize text and prepare `[Valence, Arousal]` labels.  
3. **Train & evaluate**: Train BERT-based regressors and check model performance on the dev sets using PCC and RMSE metrics.  
4. **Predict & submit**: Run the trained models on the prediction sets, generate VA scores, and save results as JSONL for submission.  

This pipeline ensures that your model is trained, validated, and ready for competition submission. You can further improve results by tuning hyperparameters, trying different pretrained models, or applying data augmentation strategies.


In [28]:
model.eval()

TransformerVARegressor(
  (backbone): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=Tr

In [56]:
#task config
subtask = "subtask_1"#don't change
task = "task1"#don't change
langs = ["eng","zho"] #chang the language you want to test
domains = ["restaurant","laptop"] #change what domain you want to test
predict_raw ={}
train_raw={}

for lang in langs:
    for domain in domains:
        train_url = f"https://raw.githubusercontent.com/DimABSA/DimABSA2026/refs/heads/main/task-dataset/track_a/{subtask}/{lang}/{lang}_{domain}_train_alltasks.jsonl"
        predict_url = f"https://raw.githubusercontent.com/DimABSA/DimABSA2026/refs/heads/main/task-dataset/track_a/{subtask}/{lang}/{lang}_{domain}_dev_{task}.jsonl"

        train_raw[lang+"_"+domain] = load_jsonl_url(train_url)
        predict_raw[lang+"_"+domain] = load_jsonl_url(predict_url)

        for r in train_raw[lang+"_"+domain]:
                r["language"] = lang
                r["domain"] = domain

        for r in predict_raw[lang+"_"+domain]:
            r["language"] = lang
            r["domain"] = domain

In [57]:
#==== step 1 load the data ====
# you can change the env for your task.
# train data should have the VA labels, predit data without VA labels

def jsonl_to_df(data):
    if 'Quadruplet' in data[0]:
        df = pd.json_normalize(data, 'Quadruplet', ['ID', 'Text', 'language', 'domain'])
        df[['Valence', 'Arousal']] = df['VA'].str.split('#', expand=True).astype(float)
        df = df.drop(columns=['VA', 'Category', 'Opinion'])  # drop unnecessary columns
        df = df.drop_duplicates(subset=['ID', 'Aspect'], keep='first')  # remove duplicate ID+Aspect

    elif 'Triplet' in data[0]:
        df = pd.json_normalize(data, 'Triplet', ['ID', 'Text', 'language', 'domain'])
        df[['Valence', 'Arousal']] = df['VA'].str.split('#', expand=True).astype(float)
        df = df.drop(columns=['VA', 'Opinion'])  # drop unnecessary columns
        df = df.drop_duplicates(subset=['ID', 'Aspect'], keep='first')  # remove duplicate ID+Aspect

    elif 'Aspect' in data[0]:
        df = pd.json_normalize(data, 'Aspect', ['ID', 'Text', 'language', 'domain'])
        df = df.rename(columns={df.columns[0]: "Aspect"})  # rename to Aspect
        df['Valence'] = 0  # default value
        df['Arousal'] = 0  # default value

    else:
        raise ValueError("Invalid format: must include 'Quadruplet' or 'Triplet' or 'Aspect'")

    return df

predict_df={}
train_df={}
dev_df={}
for lang in langs:
    for domain in domains:
        train_df[lang+"_"+domain] = jsonl_to_df(train_raw[lang+"_"+domain])
        predict_df[lang+"_"+domain] = jsonl_to_df(predict_raw[lang+"_"+domain])
        train_df[lang+"_"+domain], dev_df[lang+"_"+domain] = train_test_split(train_df[lang+"_"+domain], test_size=0.1, random_state=42)



In [58]:
for lang in langs:
    for domain in domains:
      display(Markdown(f"### {subtask}_{lang}_{domain} train_df"))
      display(train_df[lang+"_"+domain].head())

      display(Markdown(f"### {subtask}_{lang}_{domain} dev_df"))
      display(dev_df[lang+"_"+domain].head())

      display(Markdown(f"### {subtask}_{lang}_{domain} predict_df"))
      display(predict_df[lang+"_"+domain].head())

### subtask_1_eng_restaurant train_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
481,winelist,rest16_quad_test_127,seattle ' s best winelist,eng,restaurant,8.25,8.38
2520,veal,rest16_quad_train_814,"the restaurant has a family feel , not least w...",eng,restaurant,5.67,5.33
3067,bathroom,rest16_quad_train_1157,"service ok but unfriendly , filthy bathroom .",eng,restaurant,3.33,5.83
667,crab cakes,rest16_quad_test_222,best crab cakes in town,eng,restaurant,6.88,6.5
2462,,rest16_quad_train_781,they are not helpful in the least and will giv...,eng,restaurant,3.17,7.0


### subtask_1_eng_restaurant dev_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
172,,rest16_quad_dev_118,way below average,eng,restaurant,2.67,7.0
351,pizza place,rest16_quad_test_53,mama mia – i live in the neighborhood and feel...,eng,restaurant,7.88,8.0
3132,cocktail with citrus vodka and lemon and lime ...,rest16_quad_train_1191,the have a great cocktail with citrus vodka an...,eng,restaurant,7.88,8.12
1993,place,rest16_quad_train_499,not a great place for family or general dining .,eng,restaurant,3.0,6.75
366,food,rest16_quad_test_64,"the food is great , the bartenders go that ext...",eng,restaurant,7.67,7.5


### subtask_1_eng_restaurant predict_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
0,diner food,rest26_aspect_va_dev_1,Great diner food and breakfast is served all day,eng,restaurant,0,0
1,breakfast,rest26_aspect_va_dev_1,Great diner food and breakfast is served all day,eng,restaurant,0,0
2,food,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0
3,drinks,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0
4,service,rest26_aspect_va_dev_2,It got very crowded but we still received exce...,eng,restaurant,0,0


### subtask_1_eng_laptop train_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
251,computer,laptop_quad_dev_190,"if i had it to do over , i would not purchase ...",eng,laptop,3.1,6.3
4516,unit,laptop_quad_train_2141,after charging the unit for 2 hours i discover...,eng,laptop,4.75,5.25
335,,laptop_quad_dev_253,"freezes with red lines across it , froze five ...",eng,laptop,2.0,7.67
3286,device,laptop_quad_train_1230,a wonderful device with extremely clear display .,eng,laptop,8.0,7.83
753,screen,laptop_quad_test_236,the screen does look good .,eng,laptop,6.62,6.62


### subtask_1_eng_laptop dev_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
3628,,laptop_quad_train_1485,but it lost the coil whine roulette - - badly .,eng,laptop,3.12,6.12
3096,key board,laptop_quad_train_1095,the key board is one of the best i ' ve ever t...,eng,laptop,7.67,7.5
4814,sleep time,laptop_quad_train_2357,"- boot time , sleep time and wake time are cra...",eng,laptop,7.5,7.5
5443,track pad,laptop_quad_train_2729,please note that the track pad is way better t...,eng,laptop,7.12,7.0
197,retina screen,laptop_quad_dev_147,the retina screen is amazing .,eng,laptop,8.12,8.25


### subtask_1_eng_laptop predict_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
0,touchscreen,lap26_aspect_va_dev_1,The touchscreen works very well,eng,laptop,0,0
1,HP,lap26_aspect_va_dev_2,I am so disappointed in HP,eng,laptop,0,0
2,keyboard,lap26_aspect_va_dev_3,The keyboard is big enough to use for real typing,eng,laptop,0,0
3,screen size,lap26_aspect_va_dev_4,I like the screen size,eng,laptop,0,0
4,Lenovo,lap26_aspect_va_dev_5,Lenovo is my favorite brand of computer,eng,laptop,0,0


### subtask_1_zho_restaurant train_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
4481,店內氣氛,R1518:S001,店內氣氛寧靜閒適。,zho,restaurant,6.12,5.5
4736,肉,R2872:S003,他們家的肉似乎是比較大塊而且很紮實。,zho,restaurant,6.0,5.62
6180,鴨油拌青菜,R2614:S005,濃厚的鴨油拌青菜，怎麼可能不好吃。,zho,restaurant,6.0,5.75
1165,合點壽司,R0472:S049,合點壽司就很不錯風味濃郁又滑嫩。,zho,restaurant,6.25,6.5
575,位子,R0025:S005,因為客人很多，位子其實都蠻擁擠的。,zho,restaurant,4.25,4.62


### subtask_1_zho_restaurant dev_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
2514,菲力牛排的肉質,R0680:S011,菲力牛排的肉質比較瘦，但並不會柴。,zho,restaurant,4.83,4.83
3519,味道,R1360:S043,但這道的蝦有加入比較多薑，覺得味道有點太突兀。,zho,restaurant,3.88,4.62
1837,福勝亭服務品質,R0444:S045,福勝亭服務品質是好的。,zho,restaurant,6.0,5.25
1869,鮭魚握壽司,R0456:S032,鮭魚握壽司真的超讚。,zho,restaurant,6.83,6.83
943,炸鱈魚白子,R0093:S028,炸鱈魚白子與白子蒸蛋獲在場評價最高。,zho,restaurant,8.25,7.5


### subtask_1_zho_restaurant predict_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
0,黃悶魷魚,R6480:S001,經典黃悶魷魚好吃！酸湯的也很不錯，但對我來說有點太辣。,zho,restaurant,0,0
1,酸湯,R6480:S001,經典黃悶魷魚好吃！酸湯的也很不錯，但對我來說有點太辣。,zho,restaurant,0,0
2,賣像,R9340:S003,紅醬海鮮pizza $320這個pizza真的可以算是我看過賣像最差的pizza。,zho,restaurant,0,0
3,味道,R10003:S007,菜飯裡有青江菜和香腸，從白飯的顏色就可看出青菜的菜香和香腸的油脂都有煮進去，雖然味道較清淡，...,zho,restaurant,0,0
4,客家小炒,R10003:S007,菜飯裡有青江菜和香腸，從白飯的顏色就可看出青菜的菜香和香腸的油脂都有煮進去，雖然味道較清淡，...,zho,restaurant,0,0


### subtask_1_zho_laptop train_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
461,GeforceRTX4060效能,6758530:S013,GeforceRTX4060效能相當於GeforceRTX3070.,zho,laptop,5.0,5.0
1068,接口,6785647:S018,這台不錯，我四月底也在天貓入手對岸型號是戰66六代2023版，系統顯示HPProBook44...,zho,laptop,6.0,5.5
5393,每年萬代模具,7048589:S023,微星就是每年萬代模具一直用下去了無新意2024年還用FHD螢幕真爛,zho,laptop,3.38,5.0
5040,內建控制軟體,6990704:S009,acer的筆電耐用度還不錯，就是內建控制軟體不怎的！,zho,laptop,4.17,5.5
2026,ZENBOOK系列,6766932:S018,ZENBOOK系列是超輕薄價格也比較貴一點這樣看起來價格是還不錯奧但舊的型號買新的是比較好用比較久,zho,laptop,6.75,6.25


### subtask_1_zho_laptop dev_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
6037,金屬機身,6754615:S374,金屬機身，有質感指紋辨識很方便,zho,laptop,6.6,5.9
5080,電量,6922710:S009,"可考慮K6602,螢幕和電池都好上不少,電量更適合在外用上一天,螢幕沒有大下巴屏比更高,重量...",zho,laptop,6.38,5.62
2657,續航力,7021545:S006,續航力看起來還不錯！,zho,laptop,6.0,5.5
2501,顯卡4060,6818182:S006,會建議找筆電有經過認證，向技嘉的AERO系列會比較優一點，螢幕4K，顯卡4060處理器也是比較新。,zho,laptop,6.5,6.0
2482,GIGABYTEG5MF電池,6866078:S001,殘酷的事實是，買電競筆電就別太要求續航力了三萬左右顯卡4050電池最大的就ASUSTUFGA...,zho,laptop,4.25,4.62


### subtask_1_zho_laptop predict_df

Unnamed: 0,Aspect,ID,Text,language,domain,Valence,Arousal
0,ASUS筆電,6952865:S310,ASUS筆電總是不會讓人失望！AI的時代來了！,zho,laptop,0,0
1,技嘉NB,6949617:S106,技嘉NB之前用過一次覺得還不錯~CP值算高!,zho,laptop,0,0
2,CP值,6949617:S106,技嘉NB之前用過一次覺得還不錯~CP值算高!,zho,laptop,0,0
3,OLED筆電,6696296:S765,好奇OLED筆電但是真的貴,zho,laptop,0,0
4,筆電,6931799:S046,筆電跟MD都很完美，五分奉上~,zho,laptop,0,0


In [61]:
#==== Dataset ====
class VADataset(Dataset):
    '''
    A PyTorch Dataset for Valence–Arousal regression.

    - Combines aspect and text into a single input (e.g., "keyboard: The keyboard is good").
    - Tokenizes the input using a HuggingFace tokenizer.
    - Returns:
        * input_ids: token IDs, shape [max_len]
        * attention_mask: mask, shape [max_len]
        * labels: [Valence, Arousal], shape [2], float tensor

    Args:
        dataframe (pd.DataFrame): must contain "Text", "Aspect", "Valence", "Arousal".
        tokenizer: HuggingFace tokenizer.
        max_len (int): max sequence length.
    '''
    def __init__(self, dataframe, tokenizer, max_len=128):
        self.sentences = dataframe["Text"].tolist()
        self.aspects = dataframe["Aspect"].tolist()
        self.labels = dataframe[["Valence", "Arousal"]].values.astype(float)
        self.languages = dataframe["language"].tolist()
        self.domains = dataframe["domain"].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        lang = self.languages[idx].upper()
        domain = self.domains[idx].upper()
        text = f"[{domain}] {self.aspects[idx]}: {self.sentences[idx]}"

        encoded = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )

        return {
            "input_ids": encoded["input_ids"].squeeze(0),
            "attention_mask": encoded["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.float)
        }


dev_dataset={}
dev_loader={}
for lang in langs:
    for domain in domains:
      dev_dataset[lang+"_"+domain] = VADataset(dev_df[lang+"_"+domain], tokenizer)
      dev_loader[lang+"_"+domain] = DataLoader(dev_dataset[lang+"_"+domain], batch_size=64, shuffle=False)


In [62]:
# lap dev score
for lang in langs:
    for domain in domains:
      pred_v, pred_a, gold_v, gold_a = get_prd(model, dev_loader[lang+"_"+domain],type="dev")
      eval_score = evaluate_predictions_task1(pred_a, pred_v, gold_a, gold_v)
      print(f"{lang}_{domain} dev_eval: {eval_score}")

Evaluating: 100%|██████████| 5/5 [00:02<00:00,  2.30it/s]


eng_restaurant dev_eval: {'PCC_V': np.float32(0.9167337), 'PCC_A': np.float32(0.7378213), 'RMSE_VA': np.float32(1.1724383)}


Evaluating: 100%|██████████| 8/8 [00:03<00:00,  2.35it/s]


eng_laptop dev_eval: {'PCC_V': np.float32(0.9157512), 'PCC_A': np.float32(0.751529), 'RMSE_VA': np.float32(1.32479)}


Evaluating: 100%|██████████| 12/12 [00:05<00:00,  2.32it/s]


zho_restaurant dev_eval: {'PCC_V': np.float32(0.9115498), 'PCC_A': np.float32(0.6899777), 'RMSE_VA': np.float32(1.0021284)}


Evaluating: 100%|██████████| 10/10 [00:04<00:00,  2.34it/s]

zho_laptop dev_eval: {'PCC_V': np.float32(0.8760761), 'PCC_A': np.float32(0.6418078), 'RMSE_VA': np.float32(1.0966659)}





In [63]:
#==== step 5 save & submit your predict results ====
def extract_num(s):
    m = re.search(r"(\d+)$", str(s))
    return int(m.group(1)) if m else -1

def df_to_jsonl(df, out_path):
    df_sorted = df.sort_values(by="ID", key=lambda x: x.map(extract_num))
    grouped = df_sorted.groupby("ID", sort=False)

    with open(out_path, "w", encoding="utf-8") as f:
        for gid, gdf in grouped:
            record = {
                "ID": gid,
                "Aspect_VA": []
            }
            for _, row in gdf.iterrows():
                record["Aspect_VA"].append({
                    "Aspect": row["Aspect"],
                    "VA": f"{row['Valence']:.2f}#{row['Arousal']:.2f}"
                })
            f.write(json.dumps(record, ensure_ascii=False) + "\n")
for lang in langs:
    for domain in domains:
      pred_dataset = VADataset(predict_df[lang+"_"+domain], tokenizer)
      pred_loader = DataLoader(pred_dataset, batch_size=64, shuffle=True)
      pred_v, pred_a, = get_prd(model, pred_loader,type="pred")

      predict_df[lang+"_"+domain]["Valence"] = pred_v
      predict_df[lang+"_"+domain]["Arousal"] = pred_a

      df_to_jsonl(predict_df[lang+"_"+domain], f"pred_{lang}_{domain}.jsonl")

Predicting: 100%|██████████| 6/6 [00:02<00:00,  2.72it/s]
Predicting: 100%|██████████| 5/5 [00:01<00:00,  2.83it/s]
Predicting: 100%|██████████| 10/10 [00:03<00:00,  2.57it/s]
Predicting: 100%|██████████| 8/8 [00:03<00:00,  2.36it/s]


In [64]:
import os
import shutil
import zipfile
from google.colab import files

# Create the folder subtask if it does not exist
os.makedirs(subtask, exist_ok=True)

# Move the three files into the subtask folder
for lang in langs:
    for domain in domains:
      for fname in [f"pred_{lang}_{domain}.jsonl"]:
          if os.path.exists(fname):
              shutil.move(fname, os.path.join(subtask, fname))

# Create a zip file named "submit.zip" containing the folder subtask
with zipfile.ZipFile(f"{subtask}.zip", "w", zipfile.ZIP_DEFLATED) as zf:
    for root, _, files_in_dir in os.walk(subtask):
        for file in files_in_dir:
            path = os.path.join(root, file)
            # Keep folder structure inside the zip
            zf.write(path, os.path.relpath(path, "."))

# Download the created zip file to local machine
files.download(f"{subtask}.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>