<a href="https://colab.research.google.com/github/efandresena/SemEval/blob/main/starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bert baseline for POLAR

In [None]:
from huggingface_hub import login
login()

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Introduction

In this part of the starter notebook, we will take you through the process of all three Subtasks.

## Subtask 1 - Polarization detection

This is a binary classification to determine whether a post contains polarized content (Polarized or Not Polarized).

In [4]:
## Imports

In [3]:
import pandas as pd

from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

import torch

from sklearn.metrics import f1_score

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from torch.utils.data import Dataset

In [2]:
import os
workdir = "/content/drive/MyDrive/NLP/SemEval"

In [5]:
import wandb

# Disable wandb logging for this script
wandb.init(mode="disabled")

  | |_| | '_ \/ _` / _` |  _/ -_)


## Data Import

The training data consists of a short text and binary labels

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- polarization:  1 text is polarized, 0 text is not polarized

The data is in all three subtask folders the same but only containing the labels for the specific task.

In [8]:
# Load the training and validation data for subtask 1

train = pd.read_csv(os.path.join(workdir, 'dev_phase/subtask1/train/eng.csv'))
val = pd.read_csv(os.path.join(workdir, 'dev_phase/subtask1/train/eng.csv'))

train.head()

Unnamed: 0,id,text,polarization
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0


# Dataset
-  Create a pytorch class for handling data
-  Wrapping the raw texts and labels into a format that Huggingfaceâ€™s Trainer can use for training and evaluation

In [75]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
  def __init__(self,texts,labels,tokenizer,max_length =128):
    self.texts=texts
    self.labels=labels
    self.tokenizer= tokenizer
    self.max_length = max_length # Store max_length

  def __len__(self):
    return len(self.texts)

  def __getitem__(self,idx):
    text=self.texts[idx]
    label=self.labels[idx]
    encoding=self.tokenizer(text,truncation=True,padding=False,max_length=self.max_length,return_tensors='pt')

    # Ensure consistent tensor conversion for all items
    item = {key: encoding[key].squeeze() for key in encoding.keys()}
    # Add labels only if provided
    if self.labels is not None:
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
    return item

Now, we'll tokenize the text data and create the datasets using `bert-base-uncased` as the tokenizer.

In [33]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create datasets
train_dataset = PolarizationDataset(train['text'].tolist(), train['polarization'].tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val['polarization'].tolist(), tokenizer)

Next, we'll load the pre-trained `bert-base-uncased` model for sequence classification. Since this is a binary classification task (Polarized/Not Polarized), we set `num_labels=2`.

In [34]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we'll define the training arguments and the evaluation metric. We'll use macro F1 score for evaluation.

In [35]:
# Define metrics function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
        output_dir=f"./",
        num_train_epochs=3,
        learning_rate=2e-5,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=8,
        eval_strategy="epoch",
        save_strategy="no",
        logging_steps=100,
        disable_tqdm=False
    )


Finally, we'll initialize the `Trainer` and start training.

In [36]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    data_collator=DataCollatorWithPadding(tokenizer) # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set: {eval_results['eval_f1_macro']}")

Epoch,Training Loss,Validation Loss,F1 Macro
1,No log,0.415057,0.788378
2,0.478700,0.320017,0.854775
3,0.478700,0.285305,0.876575


Macro F1 score on validation set: 0.8765745253232705


### Save the model to hugging face

In [37]:
model_folder = os.path.join(workdir, 'starter_model')

In [38]:
# Save model and tokenizer
model.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)

('/content/drive/MyDrive/NLP/SemEval/starter_model/tokenizer_config.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model/special_tokens_map.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model/vocab.txt',
 '/content/drive/MyDrive/NLP/SemEval/starter_model/added_tokens.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model/tokenizer.json')

In [39]:
from huggingface_hub import HfApi, Repository

HF_MODEL_NAME = "mirindraf/bert-base-uncased-polarization"

api = HfApi()
api.create_repo(
    repo_id=HF_MODEL_NAME,
    repo_type="model",
    exist_ok=True  # don't fail if repo already exists
)

# Step 3 â€” push folder
from huggingface_hub import upload_folder

upload_folder(
    folder_path=model_folder,
    repo_id=HF_MODEL_NAME,
    repo_type="model"
)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...r_model/model.safetensors:   2%|1         | 6.81MB /  438MB            

CommitInfo(commit_url='https://huggingface.co/mirindraf/bert-base-uncased-polarization/commit/af1ff3e23fcc8181d0c8480a1c3dbde8fe017306', commit_message='Upload folder using huggingface_hub', commit_description='', oid='af1ff3e23fcc8181d0c8480a1c3dbde8fe017306', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mirindraf/bert-base-uncased-polarization', endpoint='https://huggingface.co', repo_type='model', repo_id='mirindraf/bert-base-uncased-polarization'), pr_revision=None, pr_num=None)

### Inference

In [71]:
dev = pd.read_csv(os.path.join(workdir, 'dev_phase/subtask1/dev/eng.csv'))
dev['polarization']=0

#### # Load the trained HF model and tokenizer

In [72]:


HF_MODEL_NAME = "mirindraf/bert-base-uncased-polarization"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)


In [76]:
dummy_labels = [0] * len(dev)

dev_dataset = PolarizationDataset(
    texts=dev['text'].tolist(),
    labels=dummy_labels,
    tokenizer=tokenizer
)
# Create a new Trainer (no training needed, just for inference)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer)
)

predictions = trainer.predict(dev_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)

  trainer = Trainer(


In [77]:
# Save results
result_starter = os.path.join(workdir, 'result_starter')
os.makedirs(result_starter, exist_ok=True)
dev["polarization"] = pred_labels
dev[["id", "polarization"]].to_csv(os.path.join(result_starter, "pred_eng.csv"), index=False)

print("Inference done. CSV saved as 'pred_eng.csv'")

Inference done. CSV saved as 'pred_eng.csv'


# Subtask 2: Polarization Type Classification
Multi-label classification to identify the target of polarization as one of the following categories: Gender/Sexual, Political, Religious, Racial/Ethnic, or Other.
For this task we will load the data for subtask 2.

In [49]:
train = pd.read_csv(os.path.join(workdir,'dev_phase/subtask2/train/eng.csv'))
val = pd.read_csv(os.path.join(workdir,'dev_phase/subtask2/train/eng.csv'))
dev = pd.read_csv(os.path.join(workdir,'dev_phase/subtask2/dev/eng.csv'))
train.head()

Unnamed: 0,id,text,political,racial/ethnic,religious,gender/sexual,other
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0,0,0,0,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0,0,0,0,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0,0,0,0,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0,0,0,0,0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0,0,0,0,0


In [50]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item


In [51]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
dev_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)


In [52]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5, problem_type="multi_label_classification") # 5 labels

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

In [54]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 2: {eval_results['eval_f1_macro']}")

A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.


Epoch,Training Loss,Validation Loss,F1 Macro
1,0.2294,0.180748,0.181253
2,0.1738,0.136824,0.291544
3,0.1257,0.10845,0.425129


Macro F1 score on validation set for Subtask 2: 0.4251293418397548


In [55]:
starter_model2 = os.path.join(workdir, "starter_model2")
os.makedirs(starter_model2, exist_ok=True)
trainer.save_model(starter_model2)
tokenizer.save_pretrained(starter_model2)

('/content/drive/MyDrive/NLP/SemEval/starter_model2/tokenizer_config.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model2/special_tokens_map.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model2/vocab.txt',
 '/content/drive/MyDrive/NLP/SemEval/starter_model2/added_tokens.json',
 '/content/drive/MyDrive/NLP/SemEval/starter_model2/tokenizer.json')

In [56]:
dummy_labels = [[0,0,0,0,0]] * len(dev)  # 5 labels
dev_dataset = PolarizationDataset(
    texts=dev['text'].tolist(),
    labels=dummy_labels,
    tokenizer=tokenizer
)

In [65]:
result_starter2 = os.path.join(workdir, 'result_starter2')
os.makedirs(result_starter2, exist_ok=True)
output = os.path.join(result_starter2, "pred_eng.csv")

In [66]:
# Subtask 2: Multi-label prediction on dev set

dummy_labels = [[0]*5 for _ in range(len(dev))]
dev_dataset = PolarizationDataset(dev['text'].tolist(), dummy_labels, tokenizer, max_length=128)

# Make predictions (If loading th emodel dont forget tot load the trainer)
preds = trainer.predict(dev_dataset)

# Sigmoid to get probabilities
probs = torch.sigmoid(torch.tensor(preds.predictions)).numpy()

# Convert to binary labels using threshold
best_thresh = 0.45
binary_preds = (probs > best_thresh).astype(int)

# Create DataFrame
LABELS = ['gender/sexual','political','religious','racial/ethnic','other']
dev_result = pd.DataFrame(binary_preds, columns=LABELS)
dev_result.insert(0, 'id', dev['id'])


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [67]:

dev_result.to_csv(output, index=False)

print(f"\nâœ“ Subtask 2 predictions saved as '{output}'")
print(f"Distribution:\n{dev_result[LABELS].sum()}")



âœ“ Subtask 2 predictions saved as '/content/drive/MyDrive/NLP/SemEval/result_starter2/pred_eng.csv'
Distribution:
gender/sexual     0
political        46
religious         2
racial/ethnic    14
other             0
dtype: int64


# Subtask 3: Manifestation Identification
Multi-label classification to classify how polarization is expressed, with multiple possible labels including Vilification, Extreme Language, Stereotype, Invalidation, Lack of Empathy, and Dehumanization.



In [78]:
train = pd.read_csv(os.path.join(workdir,'dev_phase/subtask3/train/eng.csv'))
val = pd.read_csv(os.path.join(workdir,'dev_phase/subtask3/train/eng.csv'))
dev = pd.read_csv(os.path.join(workdir,'dev_phase/subtask3/dev/eng.csv'))
train.head()

Unnamed: 0,id,text,stereotype,vilification,dehumanization,extreme_language,lack_of_empathy,invalidation
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0,0,0,0,0,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0,0,0,0,0,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0,0,0,0,0,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0,0,0,0,0,0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0,0,0,0,0,0


In [79]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item

In [80]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)

In [81]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6, problem_type="multi_label_classification") # use 6 labels

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [82]:
# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

In [83]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 3: {eval_results['eval_f1_macro']}")

A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.


Epoch,Training Loss,Validation Loss,F1 Macro
1,0.3918,0.357331,0.11103
2,0.3376,0.284456,0.294097
3,0.2872,0.254631,0.543195


Macro F1 score on validation set for Subtask 3: 0.5431954950373511


##  inference subtask 3

In [84]:
result_starter3 = os.path.join(workdir, 'result_starter3')
os.makedirs(result_starter3, exist_ok=True)
output = os.path.join(result_starter3, "pred_eng.csv")

In [85]:
LABELS = ['stereotype', 'vilification', 'dehumanization', 'extreme_language', 'lack_of_empathy', 'invalidation']

dummy_labels = [[0]*len(LABELS) for _ in range(len(dev))]

dev_dataset = PolarizationDataset(
    texts=dev['text'].tolist(),
    labels=dummy_labels,
    tokenizer=tokenizer,
    max_length=128
)

# Predictions (If loading th emodel dont forget tot load the trainer)
preds = trainer.predict(dev_dataset)
probs = torch.sigmoid(torch.tensor(preds.predictions)).numpy()
binary_preds = (probs > 0.5).astype(int)

dev_result = pd.DataFrame(binary_preds, columns=LABELS)
dev_result.insert(0, 'id', dev['id'])


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [86]:
dev_result.to_csv(output, index=False)

print(f"\nâœ“ Subtask 3 predictions saved as '{output}'")
print(f"Distribution:\n{dev_result[LABELS].sum()}")



âœ“ Subtask 3 predictions saved as '/content/drive/MyDrive/NLP/SemEval/result_starter3/pred_eng.csv'
Distribution:
stereotype          34
vilification        33
dehumanization      14
extreme_language    12
lack_of_empathy      0
invalidation         6
dtype: int64
