<a href="https://colab.research.google.com/github/aysucengiz/ceng463-hw2/blob/task1/causal_final/CENG463_hw2_task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1
Given a parliamentary speech in one of several languages, identify the ideology of the speaker’s party. In other words, this involves performing binary classification to determine whether the speaker’s party leans left (0) or right (1).


* Select only one country (excluding the UK) and use the parliamentary debates from that country to complete the assigned tasks.

* Fine-tune a **multilingual masked language model:**
  - multilingual BERT
  - XLMRoberta-base
  - language-specific models like Turkish BERT or German BERT.

* In addition, you are required to experiment with a multilingual causal language model:
  - Llama-3.1-8B


https://sidih.si/cdn/121/index.html

## Prepare the Data

Steps:
- Download
- Read with pandas
- Split


### Download and Read w/ Pandas

Download from the given lin in the homework pdf and unzip

In [1]:
!wget -O trainingset-ideology-power.zip https://zenodo.org/records/10450641/files/trainingset-ideology-power.zip?download=1

--2024-12-30 18:13:37--  https://zenodo.org/records/10450641/files/trainingset-ideology-power.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.185.48.194, 188.185.45.92, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 813899321 (776M) [application/octet-stream]
Saving to: ‘trainingset-ideology-power.zip’


2024-12-30 18:14:51 (10.6 MB/s) - ‘trainingset-ideology-power.zip’ saved [813899321/813899321]



In [2]:
!unzip trainingset-ideology-power.zip -d trainingset-ideology-power

Archive:  trainingset-ideology-power.zip
   creating: trainingset-ideology-power/orientation/
  inflating: trainingset-ideology-power/orientation/orientation-at-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-ba-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-be-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-bg-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-cz-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-dk-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-ee-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-es-ct-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-es-ga-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-es-train.tsv  
  inflating: trainingset-ideology-power/orientation/orientation-fi-train.tsv  
  inflating: trainingset-ideolo

Read with pandas and print the head

In [3]:
import pandas as pd
from tabulate import tabulate

data_df = pd.read_csv("./trainingset-ideology-power/orientation/orientation-tr-train.tsv", delimiter="\t")
print(data_df.head())

        id                           speaker sex  \
0  tr00000  ca2031caa4032c51980160359953d507   M   
1  tr00001  4cee0addb3c69f6866869b180f90d45f   M   
2  tr00002  b3d7f76d74ec268492f8190ca123a6b2   M   
3  tr00003  722efac7138c8197a9d1e97eed3a8b18   M   
4  tr00004  be82a4ade406ec6774a0a2e38f6957e3   M   

                                                text  \
0  Yeni yasama döneminin ülkemiz için, milletimiz...   
1  Sayın Başkan, değerli milletvekilleri; bugün, ...   
2  Sayın Başkanım, öncelikle yüce Meclisin Başkan...   
3  24’üncü Dönem Meclis Başkanlığına seçilmenizde...   
4  24’üncü Yasama Dönemimizin tüm milletvekilleri...   

                                             text_en  label  
0  Mr. President, dear lawmakers, I salute you, a...      1  
1  Mr. President, members of lawmakers, as I spea...      1  
2  Mr. President, I'm here to share with you the ...      1  
3  Mr. President, under the principles determined...      1  
4  Mr. President, dear lawmakers, I ask 

### Split

If you intend to participate in the shared task, you can evaluate your model on the test set.
Since the test dataset lacks labels, you must split your training data into **90% for training** and
**10% for testing**. Ensure the split is performed in a `stratified` manner to maintain the proportion
of labels in both subsets.

[Split guide](https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html) from the pdf.

[Stratify guide](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn) from the pdf.

In [7]:
import sklearn.model_selection

train_df, test_df = sklearn.model_selection.train_test_split(data_df, test_size=0.1, stratify=data_df["label"])

print("Train data proportion of label: \n",train_df['label'].value_counts(normalize=True) * 100)
print("Test data proportion of label: \n",test_df['label'].value_counts(normalize=True) * 100)

Train data proportion of label: 
 label
1    58.18645
0    41.81355
Name: proportion, dtype: float64
Test data proportion of label: 
 label
1    58.178439
0    41.821561
Name: proportion, dtype: float64


## Masked LM

Fine-tune the selected masked language model for each task: For one task use ”text_en” and
for the other task use ”text” (original language).

In [None]:
!pip install transformers datasets evaluate accelerate



### Tokenize the data

In [None]:
from transformers import XLMRobertaTokenizer

roberta_tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def tokenize_it(text):
    return roberta_tokenizer(text, padding="max_length", truncation=True)

train_tokenized = tokenize_it(train_df["text"].tolist())
test_tokenized = tokenize_it(test_df["text"].tolist())

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

def data_loader(data,tokenized):
  # Convert tokenized data to tensors
  input_ids = torch.tensor(tokenized["input_ids"])
  attention_mask = torch.tensor(tokenized["attention_mask"])
  labels = torch.tensor(data["label"].tolist(), dtype=torch.long)

  # Create TensorDataset
  dataset = TensorDataset(input_ids, attention_mask, labels)

  # Create DataLoader
  data_loader_result = DataLoader(dataset, batch_size=32, shuffle=True)
  return data_loader_result

train_dataloader = data_loader(train_df,train_tokenized)
test_dataloader = data_loader(test_df,test_tokenized)

KeyError: 'text'

In [None]:
from transformers import XLMRobertaForSequenceClassification, AdamW

model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [None]:
import torch
from accelerate.test_utils.testing import get_backend

device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
model.to(device)

XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        input_ids, attention_mask, labels = batch
        batch = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1362 [00:00<?, ?it/s]

In [None]:
import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in test_dataloader:
    input_ids, attention_mask, labels = batch
    batch = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
            }
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8655514250309789}

# Causal LM

For each task, perform inference using the selected causal language model twice: Once using
”text_en” and once using ”text” (original language).

In [26]:
!pip install transformers datasets torch

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [47]:
from transformers import AutoTokenizer

model_name = "distilgpt2"  # Replace with your chosen causal model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token


In [48]:
from transformers import AutoModelForSequenceClassification

num_labels = 2  # Number of unique labels in your dataset
model_train_en = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model_train_en.config.pad_token_id = tokenizer.pad_token_id

num_labels = 2  # Number of unique labels in your dataset
model_en = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model_en.config.pad_token_id = tokenizer.pad_token_id

num_labels = 2  # Number of unique labels in your dataset
model_train_orig = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model_train_orig.config.pad_token_id = tokenizer.pad_token_id

num_labels = 2  # Number of unique labels in your dataset
model_orig = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model_orig.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to b

In [49]:
import torch
from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_train_en.to(device)
model_en.to(device)
model_train_orig.to(device)
model_orig.to(device)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [52]:
def preprocess_text(data):
    return tokenizer(data["text"], truncation=True, padding="max_length", max_length=128)
def preprocess_text_en(data):
    return tokenizer(data["text_en"], truncation=True, padding="max_length", max_length=128)


tokenized_train_en = train_df.apply(preprocess_text_en, axis=1).values  # Get the values as a list of dictionaries
tokenized_test_en = test_df.apply(preprocess_text_en, axis=1).values

tokenized_train_orig = train_df.apply(preprocess_text, axis=1).values  # Get the values as a list of dictionaries
tokenized_test_orig = test_df.apply(preprocess_text, axis=1).values

In [59]:
from torch.utils.data import DataLoader, Dataset
from tqdm.auto import tqdm
import torch


# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


# train
train_dataset_en = CustomDataset({k: [d[k] for d in tokenized_train_en] for k in tokenized_train_en[0]}, train_df["label"].tolist())
train_dataloader_en = DataLoader(train_dataset_en, batch_size=16)

# test
test_dataset_en = CustomDataset({k: [d[k] for d in tokenized_test_en] for k in tokenized_test_en[0]}, test_df["label"].tolist())
test_dataloader_en = DataLoader(test_dataset_en, batch_size=16)


# train
train_dataset_orig = CustomDataset({k: [d[k] for d in tokenized_train_orig] for k in tokenized_train_orig[0]}, train_df["label"].tolist())
train_dataloader_orig = DataLoader(train_dataset_orig, batch_size=16)

# test
test_dataset_orig = CustomDataset({k: [d[k] for d in tokenized_test_orig] for k in tokenized_test_orig[0]}, test_df["label"].tolist())
test_dataloader_orig = DataLoader(test_dataset_orig, batch_size=16)


### First train then test



In [60]:
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import torch

# Define optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)

# Total number of training steps
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader_en)

# Learning rate scheduler
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# Loss function for classification (built into GPT-style models when `labels` are passed)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    model_train_en.train()
    loop = tqdm(train_dataloader_en, desc=f"Epoch {epoch}")

    for batch in loop:
        # Move batch to device
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        # Forward pass
        outputs = model_train_en(input_ids=input_ids, labels=labels)
        loss = outputs.loss  # GPT-2 provides loss when `labels` are passed

        # Backward pass and optimization step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update learning rate scheduler
        lr_scheduler.step()

        # Update progress bar
        loop.set_postfix(loss=loss.item())


Epoch 0:   0%|          | 0/908 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/908 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/908 [00:00<?, ?it/s]

In [61]:
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import torch

# Define optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)

# Total number of training steps
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader_orig)

# Learning rate scheduler
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# Loss function for classification (built into GPT-style models when `labels` are passed)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    model_train_orig.train()
    loop = tqdm(train_dataloader_orig, desc=f"Epoch {epoch}")

    for batch in loop:
        # Move batch to device
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        # Forward pass
        outputs = model_train_orig(input_ids=input_ids, labels=labels)
        loss = outputs.loss  # GPT-2 provides loss when `labels` are passed

        # Backward pass and optimization step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update learning rate scheduler
        lr_scheduler.step()

        # Update progress bar
        loop.set_postfix(loss=loss.item())


Epoch 0:   0%|          | 0/908 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/908 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/908 [00:00<?, ?it/s]

* Original Language

In [62]:
from datasets import Dataset
from sklearn.metrics import accuracy_score
from transformers import pipeline

# Convert the test DataFrame to a Hugging Face Dataset
test_dataset = Dataset.from_pandas(test_df)

# Prepare the classifier pipeline
classifier = pipeline("text-classification", model=model_train_orig, tokenizer=tokenizer, device=0)

# Define a function for prediction
def predict(batch):
    results = classifier(batch["text"], truncation=True, max_length=128)
    predicted_labels = [int(result["label"].split("_")[-1]) for result in results]
    return {"predicted_label": predicted_labels}

# Apply the pipeline to the dataset with batching
predictions = test_dataset.map(predict, batched=True, batch_size=32)

# Compute accuracy
true_labels = test_dataset["label"]
predicted_labels = predictions["predicted_label"]

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")


Device set to use cuda:0


Map:   0%|          | 0/1614 [00:00<?, ? examples/s]

Accuracy: 58.12%


* English

In [63]:
from datasets import Dataset
from sklearn.metrics import accuracy_score
from transformers import pipeline

# Convert the test DataFrame to a Hugging Face Dataset
test_dataset = Dataset.from_pandas(test_df)

# Prepare the classifier pipeline
classifier = pipeline("text-classification", model=model_train_en, tokenizer=tokenizer, device=0)

# Define a function for prediction
def predict(batch):
    results = classifier(batch["text_en"], truncation=True, max_length=128)
    predicted_labels = [int(result["label"].split("_")[-1]) for result in results]
    return {"predicted_label": predicted_labels}

# Apply the pipeline to the dataset with batching
predictions = test_dataset.map(predict, batched=True, batch_size=32)

# Compute accuracy
true_labels = test_dataset["label"]
predicted_labels = predictions["predicted_label"]

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")


Device set to use cuda:0


Map:   0%|          | 0/1614 [00:00<?, ? examples/s]

Accuracy: 44.24%


### Test without training

* With original language

In [64]:
from datasets import Dataset
from sklearn.metrics import accuracy_score
from transformers import pipeline

# Convert the test DataFrame to a Hugging Face Dataset
test_dataset = Dataset.from_pandas(test_df)

# Prepare the classifier pipeline
classifier = pipeline("text-classification", model=model_orig, tokenizer=tokenizer, device=0)

# Define a function for prediction
def predict(batch):
    results = classifier(batch["text"], truncation=True, max_length=128)
    predicted_labels = [int(result["label"].split("_")[-1]) for result in results]
    return {"predicted_label": predicted_labels}

# Apply the pipeline to the dataset with batching
predictions = test_dataset.map(predict, batched=True, batch_size=32)

# Compute accuracy
true_labels = test_dataset["label"]
predicted_labels = predictions["predicted_label"]

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")


Device set to use cuda:0


Map:   0%|          | 0/1614 [00:00<?, ? examples/s]

Accuracy: 57.43%


* With english

In [65]:
from datasets import Dataset
from sklearn.metrics import accuracy_score
from transformers import pipeline

# Convert the test DataFrame to a Hugging Face Dataset
test_dataset = Dataset.from_pandas(test_df)

# Prepare the classifier pipeline
classifier = pipeline("text-classification", model=model_en, tokenizer=tokenizer, device=0)

# Define a function for prediction
def predict(batch):
    results = classifier(batch["text_en"], truncation=True, max_length=128)
    predicted_labels = [int(result["label"].split("_")[-1]) for result in results]
    return {"predicted_label": predicted_labels}

# Apply the pipeline to the dataset with batching
predictions = test_dataset.map(predict, batched=True, batch_size=32)

# Compute accuracy
true_labels = test_dataset["label"]
predicted_labels = predictions["predicted_label"]

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")


Device set to use cuda:0


Map:   0%|          | 0/1614 [00:00<?, ? examples/s]

Accuracy: 54.34%
