# Table of contents

Use this table of content to navigate to the respective section.


- [0. Executive Summary](#0.-Executive-Summary)
- [1. Introduction](#1.-Introduction)
- [2. Data Import / Split Dataa](#2.-Data-Import-/-Split-Data)
- [3. Transformer](#3.-Transformer)
    - [3.1 Baseline Model](#3.1-Baseline-Model)
    - [3.2 LoRA Model](#3.2-LoRA-Model)
    - [3.3 Full Fine-tuned Model](#3.3-Full-Fine-tuned-Model)
- [4. Results](#4.-Results)

# 0. Executive Summary

This notebook develops and evaluates transformer-based classifiers for detecting affective polarization in Portuguese parliamentary speeches. A manually labelled dataset of 1,499 interventions is split into stratified training, validation, and test sets (70/15/15), with the training split balanced to a 50/50 ratio between polarized and non-polarized interventions. Three modelling stages are implemented: an unfine-tuned XLM-RoBERTa baseline, parameter-efficient LoRA adapters on top of several pretrained transformers, and full end-to-end fine-tuning of the same architectures. The baseline model performs poorly and essentially fails to distinguish classes, confirming the need for supervised training. LoRA fine-tuning substantially improves performance while updating less than 1% of parameters, with XLM-RoBERTa reaching around 0.68 macro-F1 on the test set. Full fine-tuning further increases accuracy and macro-F1, with the Portuguese BERT model achieving the best results (≈0.79 accuracy, ≈0.78 macro-F1). The notebook concludes that supervised transformers, especially fully fine-tuned Portuguese BERT, provide a strong foundation for measuring affective polarization in ParlaMint-PT.

# 1. Introduction

The goal of this notebook is to build a reproducible modelling pipeline for classifying parliamentary interventions as polarized or non-polarized using transformer language models. Starting from a manually annotated sample of Portuguese parliamentary speeches, the workflow first performs data cleaning and construction of stratified train–validation–test splits, balancing the training data to mitigate class imbalance while preserving the natural label distribution in the evaluation sets. On top of these splits, several transformer architectures are trained and compared under different fine-tuning regimes, ranging from a naive baseline without task-specific training to state-of-the-art supervised setups. Throughout, common evaluation metrics (accuracy, macro-F1 and detailed classification reports) are used to assess model quality and to identify the most suitable configuration for downstream analysis of affective polarization.

# 2. Data Import / Split Data

This section loads the manually labelled intervention dataset and prepares the train/validation/test splits for all later experiments. After reading the Excel/CSV file, it keeps the key metadata columns, drops rows without text or label, and casts the polarization label to a binary integer (0 = non-polarized, 1 = polarized). Using a fixed random seed, it then performs a stratified 70/15/15 split so that the overall class distribution is preserved in the validation and test sets, while the training set is additionally balanced to a 50/50 class ratio by downsampling the majority class. Finally, the three splits are saved to disk and a short summary of the class ratios in each split is printed for inspection.

In [3]:
# Balance Training Data / Real Distribution for Val & Test
import os
import pandas as pd
from sklearn.model_selection import train_test_split

# config
INPUT_PATH = r"intervention_sample_for_manual_labeling_final.xlsx"  # or .csv
TEXT_COL   = "text"
LABEL_COL  = "intervention_polarized_label"  # 0/1
KEEP_COLS  = [
    "speech_id","intervention_id","party","speaker_name",
    "text","text_length","intervention_polarized_label","intervention_label"
]
OUT_DIR = "splits_balanced_train_70_15_15"
SEED = 42

os.makedirs(OUT_DIR, exist_ok=True)

# 1) Load
ext = os.path.splitext(INPUT_PATH)[1].lower()
if ext == ".xlsx":
    df = pd.read_excel(INPUT_PATH)
elif ext == ".csv":
    df = pd.read_csv(INPUT_PATH)
else:
    raise ValueError("Use .xlsx or .csv")

# 2) Select cols, clean
for c in KEEP_COLS:
    if c not in df.columns:
        df[c] = pd.NA
df = df[KEEP_COLS].dropna(subset=[TEXT_COL, LABEL_COL]).copy()

# ensure binary ints
df[LABEL_COL] = df[LABEL_COL].astype(int)

def show_counts(name, d):
    counts = d[LABEL_COL].value_counts().sort_index()
    n = len(d)
    pcts = (counts / n * 100).round(2)
    print(f"{name}: n={n} | 0={counts.get(0,0)} ({pcts.get(0,0.0)}%) | 1={counts.get(1,0)} ({pcts.get(1,0.0)}%)")

print("GLOBAL CLASS RATIO")
show_counts("ALL ", df)

# 3) Stratified split: 70% train_base, 15% val, 15% test
train_base, temp = train_test_split(
    df, test_size=0.30, stratify=df[LABEL_COL], random_state=SEED
)
val, test = train_test_split(
    temp, test_size=0.50, stratify=temp[LABEL_COL], random_state=SEED
)

# 4) Balance TRAIN to 50/50 by downsampling the majority
g0 = train_base[train_base[LABEL_COL] == 0]
g1 = train_base[train_base[LABEL_COL] == 1]
if len(g0) == 0 or len(g1) == 0:
    raise RuntimeError("Cannot balance train: one class has 0 samples.")

minority_n = min(len(g0), len(g1))
g0_bal = g0.sample(n=minority_n, random_state=SEED) if len(g0) > minority_n else g0
g1_bal = g1.sample(n=minority_n, random_state=SEED) if len(g1) > minority_n else g1
train = pd.concat([g0_bal, g1_bal], axis=0).sample(frac=1.0, random_state=SEED).reset_index(drop=True)

# 5) Save
train.to_csv(os.path.join(OUT_DIR, "train.csv"), index=False)
val.to_csv(os.path.join(OUT_DIR, "val.csv"), index=False)
test.to_csv(os.path.join(OUT_DIR, "test.csv"), index=False)

print("\nSPLIT SUMMARY")
show_counts("TRAIN (balanced)", train)
show_counts("VAL           ", val)   # keeps real ratio
show_counts("TEST          ", test)  # keeps real ratio
print(f"\n Files saved in: {os.path.abspath(OUT_DIR)}")


GLOBAL CLASS RATIO
ALL : n=1499 | 0=901 (60.11%) | 1=598 (39.89%)

SPLIT SUMMARY
TRAIN (balanced): n=836 | 0=418 (50.0%) | 1=418 (50.0%)
VAL           : n=225 | 0=135 (60.0%) | 1=90 (40.0%)
TEST          : n=225 | 0=135 (60.0%) | 1=90 (40.0%)

 Files saved in: C:\Users\Nomis\Desktop\splits_balanced_train_70_15_15


Overall, the labelled dataset contains 1,499 interventions, with a moderately imbalanced class ratio of about 60% non-polarized (0) and 40% polarized (1). After splitting, the training set was explicitly balanced to 836 examples (418 per class), while the validation and test sets each contain 225 examples and preserve the original distribution (roughly 60% non-polarized, 40% polarized). All three splits are stored in the directory C:\Users\Nomis\Desktop\splits_balanced_train_70_15_15

# 3. Transformer

This chapter presents the transformer-based models used for affective polarization classification. Section 3.1 introduces a simple baseline where XLM-RoBERTa is applied without any task-specific fine-tuning, providing a lower bound on performance. Section 3.2 then adds parameter-efficient LoRA adapters on top of three pretrained transformers, and Section 3.3 describes full end-to-end fine-tuning of the same architectures. Together, these experiments make it possible to compare different training strategies and quantify their impact on classification performance.

## 3.1 Baseline Model

As an initial baseline, an off-the-shelf XLM-RoBERTa model (xlm-roberta-base) is evaluated on the test split without any task-specific fine-tuning. The script loads the labeled test interventions, tokenizes the text up to 256 tokens, and feeds the resulting sequences through a randomly initialized binary classification head on top of the pretrained encoder. The predicted labels are then compared to the gold labels using accuracy, macro-F1, the confusion matrix, and a detailed classification report, providing a naive reference point against which subsequent fine-tuned models can be assessed.

In [1]:
import os
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

#CONFIG
OUT_DIR   = r"C:\Users\Nomis\Desktop\splits_balanced_train_70_15_15"
TEST_CSV  = os.path.join(OUT_DIR, "test.csv")

TEXT_COL  = "text"
LABEL_COL = "intervention_polarized_label"   # 0/1

MODEL_NAME = "xlm-roberta-base"
MAX_LEN    = 256

#LOAD TEST DATA
df_test = pd.read_csv(TEST_CSV)
print("Columns:", list(df_test.columns))
print("n_test :", len(df_test))

texts  = df_test[TEXT_COL].astype(str).tolist()
labels = df_test[LABEL_COL].astype(int).tolist()

#LOAD BASELINE XLM-R (NO FINE-TUNING)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

#TOKENIZE & RUN
encodings = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=MAX_LEN,
    return_tensors="pt"
)

input_ids      = encodings["input_ids"].to(device)
attention_mask = encodings["attention_mask"].to(device)

with torch.no_grad():
    logits = model(input_ids=input_ids, attention_mask=attention_mask).logits

preds = torch.argmax(logits, dim=-1).cpu().numpy()

#METRICS
acc = accuracy_score(labels, preds)
f1  = f1_score(labels, preds, average="macro")
cm  = confusion_matrix(labels, preds)

print("\n=== XLM-R BASELINE (no fine-tuning) on TEST ===")
print(f"Accuracy : {acc:.4f}")
print(f"Macro F1 : {f1:.4f}")

print("\nConfusion matrix [rows=true, cols=pred]:")
print(cm)

print("\nClassification report:")
print(classification_report(labels, preds, digits=4))


Columns: ['speech_id', 'intervention_id', 'party', 'speaker_name', 'text', 'text_length', 'intervention_polarized_label', 'intervention_label']
n_test : 225


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



=== XLM-R BASELINE (no fine-tuning) on TEST ===
Accuracy : 0.4000
Macro F1 : 0.2857

Confusion matrix [rows=true, cols=pred]:
[[  0 135]
 [  0  90]]

Classification report:
              precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000       135
           1     0.4000    1.0000    0.5714        90

    accuracy                         0.4000       225
   macro avg     0.2000    0.5000    0.2857       225
weighted avg     0.1600    0.4000    0.2286       225



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


The unfine-tuned XLM-RoBERTa baseline performs poorly on the polarization task. On the 225 test interventions, it achieves an accuracy of 0.40 and a macro-F1 of 0.29, predicting exclusively the polarized class (1). This is reflected in the confusion matrix, where no non-polarized examples (0) are correctly identified, confirming that a randomly initialized classification head without task-specific training is not suitable for this problem.

## 3.2 LoRA Model

This script implements the supervised fine-tuning pipeline for Portuguese parliamentary sentiment (affective polarization) classification. Starting from the manually labeled interventions, it performs a stratified 70/15/15 train–validation–test split and balances only the training set to a 50/50 class ratio, while preserving the natural label distribution in validation and test. The data is converted into Hugging Face Dataset objects, tokenized with a maximum sequence length of 256 tokens, and then used to fine-tune several transformer architectures (neuralmind/bert-base-portuguese-cased, xlm-roberta-base, and bert-base-multilingual-cased) under a common training configuration (5 epochs, learning rate 2e-5, weight decay, warmup, batch sizes). Training is carried out via the Trainer API, with accuracy and macro/weighted precision, recall, and F1 computed for both validation and test sets, complemented by a detailed classification report on the test set. For each model, the fine-tuned weights, tokenizer, and evaluation metrics are saved to disk, and a consolidated summary table is written to allow direct comparison of model performance and training cost.

In [11]:
# LoRA Model — 70/15/15 split 
# Train: balanced (50/50), Val/Test: real distribution
import os, time, warnings, numpy as np, pandas as pd, torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, Trainer
)
from peft import LoraConfig, TaskType, get_peft_model

warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

# paths & columns
DATA_PATH  = r"C:\Users\Nomis\Downloads\intervention_sample_for_manual_labeling_new.xlsx"
TEXT_COL   = "text"
LABEL_COL  = "intervention_polarized_label"

# training config
SEED       = 42
EPOCHS     = 5
LR         = 2e-5
BSZ_TRAIN  = 16
BSZ_EVAL   = 32
MAX_LEN    = 256
WARMUP     = 0.06
WEIGHT_DEC = 0.01
OUT_DIR    = "./pt_sentiment_models_3split_lora"

# LoRA hyperparams
LORA_R         = 8
LORA_ALPHA     = 16
LORA_DROPOUT   = 0.1

MODEL_NAMES = [
    "neuralmind/bert-base-portuguese-cased",
    "xlm-roberta-base",
    "bert-base-multilingual-cased",
]

# load data
df = pd.read_excel(DATA_PATH)
df = df.dropna(subset=[TEXT_COL, LABEL_COL]).copy()
df[LABEL_COL] = df[LABEL_COL].astype(int)

print("Full (original) label counts:\n", df[LABEL_COL].value_counts())
print("Full (original) label %:\n", (df[LABEL_COL].value_counts(normalize=True) * 100).round(3))

# 70/15/15 (stratified)
train_df, temp_df = train_test_split(
    df, test_size=0.3, stratify=df[LABEL_COL], random_state=SEED
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.5, stratify=temp_df[LABEL_COL], random_state=SEED
)

print("\nSplit sizes:")
print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

# balance only TRAIN (50/50)
n0 = (train_df[LABEL_COL] == 0).sum()
n1 = (train_df[LABEL_COL] == 1).sum()
n_min = min(n0, n1)

train_balanced = pd.concat([
    train_df[train_df[LABEL_COL] == 0].sample(n_min, random_state=SEED),
    train_df[train_df[LABEL_COL] == 1].sample(n_min, random_state=SEED),
]).sample(frac=1, random_state=SEED).reset_index(drop=True)

print("\nTRAIN balanced counts:\n", train_balanced[LABEL_COL].value_counts())
print("VALID counts:\n", val_df[LABEL_COL].value_counts())
print("TEST counts:\n", test_df[LABEL_COL].value_counts())

# to HF datasets
train_ds = Dataset.from_pandas(train_balanced[[TEXT_COL, LABEL_COL]].reset_index(drop=True))
val_ds   = Dataset.from_pandas(val_df[[TEXT_COL, LABEL_COL]].reset_index(drop=True))
test_ds  = Dataset.from_pandas(test_df[[TEXT_COL, LABEL_COL]].reset_index(drop=True))

# device
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device)

# helpers
def build_tokenizer(model_name):
    return AutoTokenizer.from_pretrained(model_name, use_fast=True)

def tokenize_fn(batch, tok):
    return tok(batch[TEXT_COL], truncation=True, padding=False, max_length=MAX_LEN)

def _rename_to_labels(ds):
    if LABEL_COL in ds.column_names:
        ds = ds.rename_column(LABEL_COL, "labels")
    return ds

def _set_torch_format(ds):
    cols = [c for c in ["input_ids","attention_mask","token_type_ids","labels"] if c in ds.column_names]
    ds.set_format("torch", columns=cols)
    return ds

def prepare_datasets(tok):
    tds = train_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    vds = val_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    tts = test_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    tds = _rename_to_labels(tds); vds = _rename_to_labels(vds); tts = _rename_to_labels(tts)
    tds = _set_torch_format(tds); vds = _set_torch_format(vds); tts = _set_torch_format(tts)
    return tds, vds, tts

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    pM, rM, f1M, _  = precision_recall_fscore_support(labels, preds, average="macro",    zero_division=0)
    pW, rW, f1W, _  = precision_recall_fscore_support(labels, preds, average="weighted", zero_division=0)
    return {
        "accuracy":acc, "precision_macro":pM, "recall_macro":rM, "f1_macro":f1M,
        "precision_weighted":pW, "recall_weighted":rW, "f1_weighted":f1W
    }

def make_args(out_dir):
    return TrainingArguments(
        output_dir=out_dir,
        per_device_train_batch_size=BSZ_TRAIN,
        per_device_eval_batch_size=BSZ_EVAL,
        learning_rate=LR,
        num_train_epochs=EPOCHS,
        warmup_ratio=WARMUP,
        weight_decay=WEIGHT_DEC,
        logging_steps=50,
        seed=SEED,
        fp16=(device=="cuda"),
        dataloader_num_workers=0
    )

def target_modules_for(model):
    names = [n for n, _ in model.named_parameters()]
    # XLM-R / RoBERTa-style use q_proj/k_proj/v_proj; BERT-style often "query/key/value"
    if any(".q_proj." in n for n in names) or any(".k_proj." in n for n in names):
        return ["q_proj","k_proj","v_proj"]
    return ["query","key","value"]

def train_and_eval(model_name):
    print(f"\n===== {model_name} (LoRA) =====")
    t0 = time.time()

    tok = build_tokenizer(model_name)
    tds, vds, tts = prepare_datasets(tok)
    collator = DataCollatorWithPadding(tokenizer=tok)

    base = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2, problem_type="single_label_classification"
    )

    lcfg = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT,
        target_modules=target_modules_for(base)
    )
    model = get_peft_model(base, lcfg)

    # how many params train
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    model_out_dir = os.path.join(OUT_DIR, model_name.replace("/", "_"))
    args = make_args(model_out_dir)
    trainer = Trainer(
        model=model, args=args, train_dataset=tds, eval_dataset=vds,
        tokenizer=tok, data_collator=collator, compute_metrics=compute_metrics
    )

    trainer.train()
    metrics_val  = trainer.evaluate(eval_dataset=vds)
    metrics_test = trainer.evaluate(eval_dataset=tts)

    # Detailed report on TEST
    pred_out = trainer.predict(tts)
    preds  = np.argmax(pred_out.predictions, axis=-1)
    labels = np.array(tts["labels"])
    report = classification_report(labels, preds, target_names=["0","1"], digits=4, zero_division=0)
    print("Test-set report:\n", report)

    # ---- save LoRA adapter + tokenizer ----
    # This saves only PEFT adapter weights (small) into model_out_dir
    trainer.save_model(model_out_dir)     # adapter_config + adapter_model
    tok.save_pretrained(model_out_dir)
    # also store a small README with reload snippet
    with open(os.path.join(model_out_dir, "LOAD_ADAPTER.txt"), "w", encoding="utf-8") as f:
        f.write(
            "Reload example:\n"
            "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n"
            "from peft import PeftModel\n\n"
            f'base = AutoModelForSequenceClassification.from_pretrained("{model_name}", num_labels=2)\n'
            f'tok  = AutoTokenizer.from_pretrained(r"{model_out_dir}")\n'
            f'model= PeftModel.from_pretrained(base, r"{model_out_dir}")\n'
            "model.eval()\n"
        )

    print(f"Saved LoRA adapter + tokenizer to: {model_out_dir}")
    print(f"Time: {time.time()-t0:.1f}s")

    return {
        "model": model_name,
        "mode": "lora",
        "epochs": EPOCHS,
        "val_accuracy": metrics_val.get("eval_accuracy", np.nan),
        "val_f1_macro": metrics_val.get("eval_f1_macro", np.nan),
        "test_accuracy": metrics_test.get("eval_accuracy", np.nan),
        "test_f1_macro": metrics_test.get("eval_f1_macro", np.nan),
        "params_trainable": int(trainable),
        "time_sec": round(time.time()-t0, 1),
        "saved_to": model_out_dir
    }, report

# run
os.makedirs(OUT_DIR, exist_ok=True)
rows, reports = [], {}

for m in MODEL_NAMES:
    try:
        row, rep = train_and_eval(m)
        rows.append(row); reports[m] = rep
    except Exception as e:
        print(f"[ERROR] {m}: {e}")
        rows.append({"model": m, "error": str(e)})

summary = pd.DataFrame(rows)
print("\n==== Summary (70/15/15 split — LoRA) ====")
print(summary.to_string(index=False))

summary.to_csv(os.path.join(OUT_DIR, "model_comparison_3split_lora.csv"), index=False)
with open(os.path.join(OUT_DIR, "per_class_reports_3split_lora.txt"), "w", encoding="utf-8") as f:
    for m, r in reports.items():
        f.write(f"===== {m} =====\n{r}\n\n")

print(f"\nSaved all outputs to: {OUT_DIR}")


Full (original) label counts:
 intervention_polarized_label
0    901
1    598
Name: count, dtype: int64
Full (original) label %:
 intervention_polarized_label
0    60.107
1    39.893
Name: proportion, dtype: float64

Split sizes:
Train: 1049, Val: 225, Test: 225

TRAIN balanced counts:
 intervention_polarized_label
1    418
0    418
Name: count, dtype: int64
VALID counts:
 intervention_polarized_label
0    135
1     90
Name: count, dtype: int64
TEST counts:
 intervention_polarized_label
0    135
1     90
Name: count, dtype: int64
Using device: cuda

===== neuralmind/bert-base-portuguese-cased (LoRA) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 443,906 / 109,368,580 (0.41%)


Step,Training Loss
50,0.6952
100,0.6893
150,0.6788
200,0.6803
250,0.6703


Test-set report:
               precision    recall  f1-score   support

           0     0.8393    0.3481    0.4921       135
           1     0.4793    0.9000    0.6255        90

    accuracy                         0.5689       225
   macro avg     0.6593    0.6241    0.5588       225
weighted avg     0.6953    0.5689    0.5455       225

Saved LoRA adapter + tokenizer to: ./pt_sentiment_models_3split_lora\neuralmind_bert-base-portuguese-cased
Time: 151.5s

===== xlm-roberta-base (LoRA) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 1,034,498 / 279,079,684 (0.37%)


Step,Training Loss
50,0.6882
100,0.6806
150,0.6621
200,0.6524
250,0.6599


Test-set report:
               precision    recall  f1-score   support

           0     0.8539    0.5630    0.6786       135
           1     0.5662    0.8556    0.6814        90

    accuracy                         0.6800       225
   macro avg     0.7101    0.7093    0.6800       225
weighted avg     0.7388    0.6800    0.6797       225

Saved LoRA adapter + tokenizer to: ./pt_sentiment_models_3split_lora\xlm-roberta-base
Time: 154.7s

===== bert-base-multilingual-cased (LoRA) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 443,906 / 178,298,884 (0.25%)


Step,Training Loss
50,0.6895
100,0.6884
150,0.6857
200,0.6862
250,0.6794


Test-set report:
               precision    recall  f1-score   support

           0     0.7765    0.4889    0.6000       135
           1     0.5071    0.7889    0.6174        90

    accuracy                         0.6089       225
   macro avg     0.6418    0.6389    0.6087       225
weighted avg     0.6687    0.6089    0.6070       225

Saved LoRA adapter + tokenizer to: ./pt_sentiment_models_3split_lora\bert-base-multilingual-cased
Time: 154.3s

==== Summary (70/15/15 split — LoRA) ====
                                model mode  epochs  val_accuracy  val_f1_macro  test_accuracy  test_f1_macro  params_trainable  time_sec                                                                saved_to
neuralmind/bert-base-portuguese-cased lora       5      0.622222      0.618576       0.568889       0.558815            443906     151.5 ./pt_sentiment_models_3split_lora\neuralmind_bert-base-portuguese-cased
                     xlm-roberta-base lora       5      0.662222      0.662162     

The full labelled dataset comprises 1,499 interventions, with 60% non-polarized and 40% polarized speeches. It is split into 1,049 training, 225 validation, and 225 test examples; the training set is subsequently balanced to 418 instances per class, while validation and test retain the natural class distribution (135 non-polarized, 90 polarized). All experiments are run on GPU (cuda) using LoRA fine-tuning, updating only 0.25–0.41% of the model parameters. Among the three architectures, XLM-RoBERTa with LoRA achieves the strongest results with a test accuracy of 0.68 and a macro-F1 of 0.68, while the Portuguese BERT and multilingual BERT baselines reach macro-F1 scores of 0.56 and 0.61, respectively, indicating that LoRA adapters on top of XLM-RoBERTa offer the best trade-off between performance and training efficiency for this task.

## 3.3 Full Fine-tuned Model

This script implements the main supervised fine-tuning experiments on the full transformer models. Starting from the 1,499 manually labelled interventions, it performs a stratified 70/15/15 split and balances only the training set to a 50/50 ratio between polarized and non-polarized interventions, while validation and test retain the natural class distribution. The text is converted into Hugging Face Dataset objects, tokenized with a maximum length of 256 tokens, and used to fully fine-tune three architectures—neuralmind/bert-base-portuguese-cased, xlm-roberta-base, and bert-base-multilingual-cased—with a shared configuration of 5 epochs, learning rate 2×10⁻⁵, weight decay, warmup, and batch sizes of 16/32. In contrast to the LoRA setting, all model parameters (encoder and classification head) are updated. Training is carried out via the Trainer API, computing accuracy and macro/weighted precision, recall, and F1 on the validation and test sets, and generating a detailed classification report on the test data. For each model, the fine-tuned weights, tokenizer, and evaluation metrics are saved to disk, and a summary table is exported to compare performance and training cost across architectures.

In [10]:
# Full Fine Tuning — 70/15/15 split
import os, json, time, warnings, numpy as np, pandas as pd, torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, Trainer
)

warnings.filterwarnings("ignore", category=UserWarning, module="transformers")

# paths & columns 
DATA_PATH  = r"C:\Users\Nomis\Downloads\intervention_sample_for_manual_labeling_new.xlsx"  # <-- new file
TEXT_COL   = "text"
LABEL_COL  = "intervention_polarized_label"

# ---- training config ----
SEED       = 42
EPOCHS     = 5
LR         = 2e-5
BSZ_TRAIN  = 16
BSZ_EVAL   = 32
MAX_LEN    = 256
WARMUP     = 0.06
WEIGHT_DEC = 0.01
OUT_DIR    = r".\pt_sentiment_models_3split_new"  # results root
TRAINING_MODE = "full"   

MODEL_NAMES = [
    "neuralmind/bert-base-portuguese-cased",
    "xlm-roberta-base",
    "bert-base-multilingual-cased",
]

# load & inspect data 
df = pd.read_excel(DATA_PATH)
df = df.dropna(subset=[TEXT_COL, LABEL_COL]).copy()
df[LABEL_COL] = df[LABEL_COL].astype(int)

print("Full (original) label counts:\n", df[LABEL_COL].value_counts())
print("Full (original) label %:\n", (df[LABEL_COL].value_counts(normalize=True) * 100).round(3))

# 70/15/15 (stratified)
train_df, temp_df = train_test_split(df, test_size=0.30, stratify=df[LABEL_COL], random_state=SEED)
val_df,   test_df = train_test_split(temp_df, test_size=0.50, stratify=temp_df[LABEL_COL], random_state=SEED)

print("\nSplit sizes:")
print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

# balance only TRAIN (50/50)
n0, n1   = (train_df[LABEL_COL] == 0).sum(), (train_df[LABEL_COL] == 1).sum()
n_min    = min(n0, n1)
train_balanced = pd.concat([
    train_df[train_df[LABEL_COL] == 0].sample(n_min, random_state=SEED),
    train_df[train_df[LABEL_COL] == 1].sample(n_min, random_state=SEED),
]).sample(frac=1, random_state=SEED).reset_index(drop=True)

print("\nTRAIN balanced counts:\n", train_balanced[LABEL_COL].value_counts())
print("VALID counts:\n", val_df[LABEL_COL].value_counts())
print("TEST counts:\n", test_df[LABEL_COL].value_counts())

# to HF datasets
train_ds = Dataset.from_pandas(train_balanced[[TEXT_COL, LABEL_COL]].reset_index(drop=True))
val_ds   = Dataset.from_pandas(val_df[[TEXT_COL, LABEL_COL]].reset_index(drop=True))
test_ds  = Dataset.from_pandas(test_df[[TEXT_COL, LABEL_COL]].reset_index(drop=True))

# device
device_str = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("Using device:", device_str)

# helpers 
def build_tokenizer(model_name):
    return AutoTokenizer.from_pretrained(model_name, use_fast=True)

def tokenize_fn(batch, tok):
    return tok(batch[TEXT_COL], truncation=True, padding=False, max_length=MAX_LEN)

def _rename_to_labels(ds):
    if LABEL_COL in ds.column_names:
        ds = ds.rename_column(LABEL_COL, "labels")
    return ds

def _set_torch_format(ds):
    cols = [c for c in ["input_ids","attention_mask","token_type_ids","labels"] if c in ds.column_names]
    ds.set_format("torch", columns=cols)
    return ds

def prepare_datasets(tok):
    tds = train_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    vds = val_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    tts = test_ds.map(lambda b: tokenize_fn(b, tok), batched=True)
    tds = _rename_to_labels(tds); vds = _rename_to_labels(vds); tts = _rename_to_labels(tts)
    tds = _set_torch_format(tds); vds = _set_torch_format(vds); tts = _set_torch_format(tts)
    return tds, vds, tts

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    pM, rM, f1M, _  = precision_recall_fscore_support(labels, preds, average="macro",    zero_division=0)
    pW, rW, f1W, _  = precision_recall_fscore_support(labels, preds, average="weighted", zero_division=0)
    return {
        "accuracy": acc,
        "precision_macro": pM, "recall_macro": rM, "f1_macro": f1M,
        "precision_weighted": pW, "recall_weighted": rW, "f1_weighted": f1W,
    }

def make_args(out_dir):
    return TrainingArguments(
        output_dir=out_dir,
        per_device_train_batch_size=BSZ_TRAIN,
        per_device_eval_batch_size=BSZ_EVAL,
        learning_rate=LR,
        num_train_epochs=EPOCHS,
        warmup_ratio=WARMUP,
        weight_decay=WEIGHT_DEC,
        logging_steps=50,
        seed=SEED,
        fp16=(device_str=="cuda"),
        dataloader_num_workers=0,
        save_steps=0,  
    )

def set_train_mode(model):
    if TRAINING_MODE == "head":
        for name, p in model.named_parameters():
            if not name.startswith("classifier"):
                p.requires_grad = False
    elif TRAINING_MODE == "full":
        for p in model.parameters():
            p.requires_grad = True
    return model

def train_eval_save(model_name):
    print(f"\n===== {model_name} ({TRAINING_MODE}) =====")
    t0 = time.time()

    # per-model dirs
    run_dir   = os.path.join(OUT_DIR, model_name.replace("/", "_"))
    save_dir  = os.path.join(run_dir, "final_model")
    os.makedirs(run_dir, exist_ok=True); os.makedirs(save_dir, exist_ok=True)

    tok = build_tokenizer(model_name)
    tds, vds, tts = prepare_datasets(tok)
    collator = DataCollatorWithPadding(tokenizer=tok)

    base = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2, problem_type="single_label_classification"
    )

    if TRAINING_MODE == "lora" and peft_available:
        names = [n for n,_ in base.named_parameters()]
        targets = ["q_proj","k_proj","v_proj"] if any(".q_proj." in n for n in names) else ["query","key","value"]
        from peft import LoraConfig, TaskType, get_peft_model
        lcfg = LoraConfig(task_type=TaskType.SEQ_CLS, r=8, lora_alpha=16, lora_dropout=0.1, target_modules=targets)
        model = get_peft_model(base, lcfg)
    else:
        model = set_train_mode(base)

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

    args = make_args(run_dir)
    trainer = Trainer(
        model=model, args=args, train_dataset=tds, eval_dataset=vds,
        tokenizer=tok, data_collator=collator, compute_metrics=compute_metrics
    )

    trainer.train()

    # Evaluate on val + test
    metrics_val  = trainer.evaluate(eval_dataset=vds)
    metrics_test = trainer.evaluate(eval_dataset=tts)

    # Detailed classification report (TEST)
    pred_out = trainer.predict(tts)
    preds  = np.argmax(pred_out.predictions, axis=-1)
    labels = np.array(tts["labels"])
    report = classification_report(labels, preds, target_names=["0","1"], digits=4, zero_division=0)
    print("\nTest-set report:\n", report)

    # SAVE TRAINED MODEL + TOKENIZER
    trainer.save_model(save_dir)           # model weights, config
    tok.save_pretrained(save_dir)          # tokenizer files
    # also store metrics
    with open(os.path.join(run_dir, "val_metrics.json"), "w") as f: json.dump(metrics_val, f, indent=2)
    with open(os.path.join(run_dir, "test_metrics.json"), "w") as f: json.dump(metrics_test, f, indent=2)
    with open(os.path.join(run_dir, "test_report.txt"), "w", encoding="utf-8") as f: f.write(report)

    print(f"Saved trained model + tokenizer → {save_dir}")
    print(f"Time: {time.time()-t0:.1f}s")

    return {
        "model": model_name,
        "mode": TRAINING_MODE,
        "epochs": EPOCHS,
        "val_accuracy": metrics_val.get("eval_accuracy", np.nan),
        "val_f1_macro": metrics_val.get("eval_f1_macro", np.nan),
        "test_accuracy": metrics_test.get("eval_accuracy", np.nan),
        "test_f1_macro": metrics_test.get("eval_f1_macro", np.nan),
        "params_trainable": int(trainable),
        "save_dir": os.path.abspath(save_dir),
        "time_sec": round(time.time()-t0, 1),
    }

# run all
os.makedirs(OUT_DIR, exist_ok=True)
rows = []
for m in MODEL_NAMES:
    try:
        rows.append(train_eval_save(m))
    except Exception as e:
        print(f"[ERROR] {m}: {e}")
        rows.append({"model": m, "error": str(e)})

summary = pd.DataFrame(rows)
print("\n==== Summary (70/15/15 split, TRAIN balanced) ====")
print(summary.to_string(index=False))

summary.to_csv(os.path.join(OUT_DIR, "model_comparison_3split.csv"), index=False)
print(f"\nAll outputs saved under: {os.path.abspath(OUT_DIR)}")


Full (original) label counts:
 intervention_polarized_label
0    901
1    598
Name: count, dtype: int64
Full (original) label %:
 intervention_polarized_label
0    60.107
1    39.893
Name: proportion, dtype: float64

Split sizes:
Train: 1049, Val: 225, Test: 225

TRAIN balanced counts:
 intervention_polarized_label
1    418
0    418
Name: count, dtype: int64
VALID counts:
 intervention_polarized_label
0    135
1     90
Name: count, dtype: int64
TEST counts:
 intervention_polarized_label
0    135
1     90
Name: count, dtype: int64
Using device: cuda

===== neuralmind/bert-base-portuguese-cased (full) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 108,924,674 / 108,924,674 (100.00%)


Step,Training Loss
50,0.6196
100,0.4741
150,0.3734
200,0.2701
250,0.2116



Test-set report:
               precision    recall  f1-score   support

           0     0.8595    0.7704    0.8125       135
           1     0.7019    0.8111    0.7526        90

    accuracy                         0.7867       225
   macro avg     0.7807    0.7907    0.7825       225
weighted avg     0.7965    0.7867    0.7885       225

Saved trained model + tokenizer → .\pt_sentiment_models_3split_new\neuralmind_bert-base-portuguese-cased\final_model
Time: 199.0s

===== xlm-roberta-base (full) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 278,045,186 / 278,045,186 (100.00%)


Step,Training Loss
50,0.6582
100,0.573
150,0.5226
200,0.4719
250,0.4121



Test-set report:
               precision    recall  f1-score   support

           0     0.8942    0.6889    0.7782       135
           1     0.6529    0.8778    0.7488        90

    accuracy                         0.7644       225
   macro avg     0.7736    0.7833    0.7635       225
weighted avg     0.7977    0.7644    0.7665       225

Saved trained model + tokenizer → .\pt_sentiment_models_3split_new\xlm-roberta-base\final_model
Time: 352.0s

===== bert-base-multilingual-cased (full) =====


Map:   0%|          | 0/836 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 177,854,978 / 177,854,978 (100.00%)


Step,Training Loss
50,0.6332
100,0.5228
150,0.4507
200,0.3512
250,0.255



Test-set report:
               precision    recall  f1-score   support

           0     0.8482    0.7037    0.7692       135
           1     0.6460    0.8111    0.7192        90

    accuracy                         0.7467       225
   macro avg     0.7471    0.7574    0.7442       225
weighted avg     0.7673    0.7467    0.7492       225

Saved trained model + tokenizer → .\pt_sentiment_models_3split_new\bert-base-multilingual-cased\final_model
Time: 216.5s

==== Summary (70/15/15 split, TRAIN balanced) ====
                                model mode  epochs  val_accuracy  val_f1_macro  test_accuracy  test_f1_macro  params_trainable                                                                                                  save_dir  time_sec
neuralmind/bert-base-portuguese-cased full       5      0.760000      0.752566       0.786667       0.782539         108924674 C:\Users\Nomis\Downloads\pt_sentiment_models_3split_new\neuralmind_bert-base-portuguese-cased\final_model     1

The full dataset comprises 1,499 labeled interventions, split into 1,049 training, 225 validation, and 225 test examples; the training set is balanced to 418 polarized and 418 non-polarized cases, while validation and test retain the original 60/40 class ratio (135 vs. 90). Under full fine-tuning, all three transformer models achieve substantially better performance than the unfine-tuned baseline and their LoRA counterparts: neuralmind/bert-base-portuguese-cased performs best with a test accuracy of 0.79 and macro-F1 of 0.78, followed by xlm-roberta-base (accuracy 0.76, macro-F1 0.76) and bert-base-multilingual-cased (accuracy 0.75, macro-F1 0.74). All models show balanced performance across both classes, indicating that full fine-tuning on the balanced training split yields robust affective polarization classifiers for Portuguese parliamentary speeches.

# 4. Results

The experiments confirm that task-specific fine-tuning is essential for reliable detection of affective polarization. The unfine-tuned XLM-RoBERTa baseline performs poorly, achieving only 0.40 accuracy and 0.29 macro-F1 on the test set and predicting almost exclusively the polarized class. Introducing LoRA adapters already leads to substantial gains: XLM-RoBERTa with LoRA reaches 0.68 accuracy and 0.68 macro-F1, while the Portuguese and multilingual BERT models achieve macro-F1 scores of 0.56 and 0.61, respectively. Full end-to-end fine-tuning further improves performance across all architectures. The best results are obtained with neuralmind/bert-base-portuguese-cased, which attains 0.79 test accuracy and 0.78 macro-F1, followed by xlm-roberta-base (0.76/0.76) and bert-base-multilingual-cased (0.75/0.74). Overall, the results indicate that fully fine-tuned Portuguese BERT provides the most effective and balanced classifier for the binary polarization task on ParlaMint-PT.