## **Introduction - Sentiment Analysis Part II**
So far we have tried:
- roberta pre trained
- zero shot learning
- Vader on translated tweets

Issue: none of them works well (not good classification and very imbalanced data). Potential solutions:
1. Fine-tune a model on domain-specific data: collect an Italian political sentiment dataset, fine-tune an Italian language model like AlBERTo, UmBERTo, or BERTino on this dataset. You can manually label a subset of your tweets (even ~1,000) and use active learning to scale up.
2. Use a weak supervision framework: if labeling is too expensive, use Snorkel or Skweak to create heuristic rules (e.g., hashtags, emojis, known party slogans). Combine weak labels into a probabilistic label model. Train a classifier on this noisy but large-scale data.
3. Refine your zero-shot prompt setup - If you use a zero-shot pipeline (e.g., HuggingFace pipeline("zero-shot-classification")), try:
    - Framing labels in political terms: instead of generic "positive/negative/neutral", try "supportive", "opposing", "neutral toward the politician" or their policies.
    - Use prompt engineering to improve performance: candidate_labels = ["support", "oppose", "neutral"], hypothesis_template = "This tweet expresses a {} opinion about politics."
4. Use sentiment + emotion + toxicity hybrid: often political tweets are more emotional than classically "positive" or "negative". Use an emotion classifier (anger, joy, fear, etc.). Combine with toxicity detection (like HateBERT, ToxicBERT, or Italian-specific versions).
Fuse signals into a sentiment interpretation model.

For sure, what we will do first is to consider only a filtered subset of the data, which should be the ones that are most expressive.

In [3]:
import torch
import pandas as pd
import numpy as np

## **Load the Data**

In [4]:
politicians_cleaned = pd.read_csv('politicians_data/politicians_classified.csv')

In [5]:
data = politicians_cleaned.drop(columns=['Unnamed: 0', 'classification'])

In [6]:
labeled_data = pd.read_csv('labeled_tweets_with_tone.csv')

## **Active Learning**
Active learning is a technique used to minimize labeling effort by selectively choosing which examples a human should label next.
How it works:
- Start with a small labeled set (e.g., 200 tweets).
- Train an initial model (using transfer learning).
- Use that model to predict on the unlabeled tweets.
- Select the most uncertain predictions (e.g., low confidence margin between classes).
- Manually label these tweets (they’re the most informative).
- Retrain the model with the new labeled data.
- Repeat 
Goal: Reach high model performance with far fewer labels than random sampling would require

SET OF LABELS:
1. Neutral / Informational: Factual updates, announcements, or descriptive statements with no clear emotional tone or stance.
2. Supportive / Affirmative / Celebratory: Positive endorsement or praise — for allies, institutions, values, events. Celebrating some achievement.
3. Attacking / Critical / Angry / Outraged: Expressing indignation, moral outrage, or rhetorical escalation. Direct attack, blame, or disapproval — usually aimed at political opponents.
4. Call to Action / Propaganda

We start with 200 manually labeled tweets stored in labeled data.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split


# For simplicity, create an 'unlabeled_data' removing known labeled tweets
unlabeled_data = data[~data['Content'].isin(labeled_data['Content'])]


### **Fine Tune**

In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

# Load tokenizer and model (Italian BERT model)
model_name = "dbmdz/bert-base-italian-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [16]:
# Prepare dataset for Trainer
# Ensure all unique labels
label_list = labeled_data['tone_label'].unique()
label2id = {label: idx for idx, label in enumerate(label_list)}
id2label = {idx: label for label, idx in label2id.items()}

# Apply mapping
labeled_data['label_id'] = labeled_data['tone_label'].map(label2id)

from datasets import Dataset

train_dataset = Dataset.from_pandas(labeled_data)

# Tokenization
def tokenize_function(examples):
    return tokenizer(examples['Content'], padding="max_length", truncation=True)

# Tokenize and assign correct label column
train_dataset = train_dataset.map(tokenize_function, batched=True)
train_dataset = train_dataset.rename_column("label_id", "labels")

# Remove other unnecessary columns
columns_to_remove = [col for col in train_dataset.column_names if col not in ['input_ids', 'attention_mask', 'labels']]
train_dataset = train_dataset.remove_columns(columns_to_remove)

# Set format
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    logging_steps=10,
    save_steps=100,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

print("Starting fine-tuning on labeled data...")
trainer.train()




Starting fine-tuning on labeled data...


Step,Training Loss
10,1.6101
20,1.5998
30,1.4506
40,1.3782
50,1.3904
60,1.2276
70,1.1318


TrainOutput(global_step=75, training_loss=1.3748049672444662, metrics={'train_runtime': 650.573, 'train_samples_per_second': 0.922, 'train_steps_per_second': 0.115, 'total_flos': 157870885478400.0, 'train_loss': 1.3748049672444662, 'epoch': 3.0})

In [25]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import scipy

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

# SAMPLE SUBSET FOR FASTER INFERENCE
subset_size = 1000
sampled_unlabeled_data = unlabeled_data.sample(n=subset_size, random_state=42).reset_index(drop=True)

# Prepare inference dataset
inference_dataset = Dataset.from_pandas(sampled_unlabeled_data)
inference_dataset = inference_dataset.map(tokenize_function, batched=True)
inference_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

inference_loader = DataLoader(inference_dataset, batch_size=16)

# Inference on subset
model.eval()
all_probs = []

print(f"Performing inference on {subset_size} sampled tweets...")

with torch.no_grad():
    for batch in tqdm(inference_loader):
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        max_probs, _ = probs.max(dim=-1)
        all_probs.extend(max_probs.cpu().numpy())

# Get most uncertain samples (lowest confidence)
uncertainties = 1 - np.array(all_probs)
uncertain_indices = np.argsort(uncertainties)[-50:]  # Top 50 uncertain samples

uncertain_samples = unlabeled_data.iloc[uncertain_indices]

print("\nTop 50 most uncertain tweets for labeling:")
print(uncertain_samples['tweet'])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Performing inference on 1000 sampled tweets...


  3%|▎         | 2/63 [00:31<16:05, 15.83s/it]


KeyboardInterrupt: 

In [None]:
average_uncertainty = np.mean(uncertainties)
print(f"\nAverage uncertainty over unlabeled data: {average_uncertainty:.4f}")


### **Part 1 - Label 100 random tweets**

In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

# Load your tweets
df = data.sample(n=100, random_state=42).reset_index(drop=True)  

# Define tone labels
tone_labels = [
    "Neutral / Informational",
    "Supportive / Positive / Celebratory",
    "Critical / Angry",
    "Call to Action / Propaganda",
]

# Store dropdowns and layout
dropdowns = []
for i, row in df.iterrows():
    tweet_display = widgets.HTML(value=f"<b>Tweet {i+1}</b>: {row['Content']}")
    dropdown = widgets.Dropdown(
        options=[""] + tone_labels,
        description='Tone:',
        layout=widgets.Layout(width='70%')
    )
    dropdowns.append(dropdown)
    display(tweet_display, dropdown)

# Save button logic
def save_labels(_):
    for i, dropdown in enumerate(dropdowns):
        df.at[i, 'tone_label'] = dropdown.value
    df.to_csv("labeled_tweets_with_tone.csv", index=False)
    print("✅ Labels saved to labeled_tweets_with_tone.csv")

save_button = widgets.Button(description="💾 Save Labels", button_style='success')
save_button.on_click(save_labels)
display(save_button)


### **Part 2 - Fine Tune Model on labeled tweets**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# LOAD AND PREPROCESS LABELED DATA
df = pd.read_csv("labeled_tweets_with_tone.csv")

# Label encoding
labels = df["tone_label"].unique().tolist()
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}
df["label_id"] = df["tone_label"].map(label2id)

# Split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["Content"].tolist(), df["label_id"].tolist(), test_size=0.2, random_state=42
)


In [None]:
df.shape # check that the number of labeled tweets has increased

In [None]:
# TOKENIZE
from transformers import AutoTokenizer

model_name = "Musixmatch/umberto-commoncrawl-cased-v1"  # You can swap for AlBERTo if preferred
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

In [None]:
import torch

# CONVERT TO PYTORCH
class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)


In [None]:
import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

model_dir = "./saved_mo"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

# Prepare data loaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
for epoch in range(3):  
    print(f"Epoch {epoch + 1}")
    model.train()
    total_loss = 0
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()
    print(f"Train loss: {total_loss / len(train_loader):.4f}")

    # Evaluation loop
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            predictions = torch.argmax(outputs.logits, dim=-1)
            correct += (predictions == batch["labels"]).sum().item()
            total += batch["labels"].size(0)
    print(f"Validation accuracy: {correct / total:.2%}")

# Save the model
print(f"Saving model to {model_dir}")
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)



## 200 labels
Epoch 1
Train loss: 1.5477
Validation accuracy: 30.00%

Epoch 2
Train loss: 1.4712
Validation accuracy: 25.00%

Epoch 3
Train loss: 1.3206
Validation accuracy: 35.00%


### **Phase 3**

In [None]:
all_df = data

# Exclude already labeled ones (assuming tone_label exists only for labeled rows)
unlabeled_df = all_df[~all_df["Content"].isin(df["Content"])].reset_index(drop=True)

# Take a reasonable subset if the dataset is huge
texts_to_predict = unlabeled_df["Content"].tolist()


In [None]:
from torch.nn.functional import softmax

device = torch.device("cpu")
model.to(device)

# Make sure inputs are also on CPU
predict_encodings = tokenizer(texts_to_predict, truncation=True, padding=True, return_tensors="pt")
predict_encodings = {k: v.to(device) for k, v in predict_encodings.items()}

batch_size = 32
all_probs = []

model.eval()
with torch.no_grad():
    for i in range(0, len(texts_to_predict), batch_size):
        batch_texts = texts_to_predict[i:i+batch_size]
        batch_enc = tokenizer(batch_texts, truncation=True, padding=True, return_tensors="pt")
        batch_enc = {k: v.to("cpu") for k, v in batch_enc.items()}  # or "mps" if you fix it
        outputs = model(**batch_enc)
        batch_probs = softmax(outputs.logits, dim=1).numpy()
        all_probs.append(batch_probs)

import numpy as np
probs = np.vstack(all_probs)

In [None]:
import numpy as np

# Margin = difference between top two probabilities
sorted_probs = np.sort(probs, axis=1)
margins = sorted_probs[:, -1] - sorted_probs[:, -2]

# Lower margin = higher uncertainty
uncertainty_indices = np.argsort(margins)  # ascending = most uncertain first


In [None]:
# get average certainty
top_probs = np.max(probs, axis=1)

average_certainty = np.mean(top_probs)

print(f"Average certainty: {average_certainty:.4f}")

In [None]:
# visualize certainty
import matplotlib.pyplot as plt

plt.hist(top_probs, bins=20)
plt.title("Certainty (Max probability) Distribution")
plt.xlabel("Certainty")
plt.ylabel("Number of samples")
plt.show()

## PRENDERE AVERAGE CERTAINTY!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [None]:
K = 100  # or any number you want to label next
uncertain_samples = unlabeled_df.iloc[uncertainty_indices[:K]].copy()

# Save for annotation
uncertain_samples.to_csv("next_batch_to_label.csv", index=False)

In [None]:
# Get predicted label indices
predicted_ids = np.argmax(probs, axis=1)

# Get prediction confidences (max prob per row)
confidences = np.max(probs, axis=1)

# Map back to label names
id2label = {v: k for k, v in label2id.items()}
predicted_labels = [id2label[i] for i in predicted_ids]

# Attach to DataFrame
unlabeled_df["model_prediction"] = predicted_labels
unlabeled_df["model_confidence"] = confidences

## **Label New Batch**

In [None]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
import os

# 📥 Load the next batch to label
df = pd.read_csv("next_batch_to_label.csv").reset_index(drop=True)

# 📦 Define tone labels
tone_labels = [
    "Neutral / Informational",
    "Supportive / Affirmative / Celebratory",
    "Critical / Angry",
    "Call to Action / Propaganda",
]

# 🧠 Interactive widgets
dropdowns = []
for i, row in df.iterrows():
    tweet_display = widgets.HTML(value=f"<b>Tweet {i+1}</b>: {row['Content']}")
    dropdown = widgets.Dropdown(
        options=[""] + tone_labels,
        description='Tone:',
        layout=widgets.Layout(width='70%')
    )
    dropdowns.append(dropdown)
    display(tweet_display, dropdown)

# 💾 Save button logic — append to existing file
def save_labels(_):
    for i, dropdown in enumerate(dropdowns):
        df.at[i, 'tone_label'] = dropdown.value

    labeled_path = "labeled_tweets_with_tone.csv"

    if os.path.exists(labeled_path):
        existing = pd.read_csv(labeled_path)
        combined = pd.concat([existing, df], ignore_index=True).drop_duplicates(subset=["Content"])
    else:
        combined = df

    combined.to_csv(labeled_path, index=False)
    print(f"✅ Appended {len(df)} tweets. Total labeled: {len(combined)}")

# 🖱️ Save button UI
save_button = widgets.Button(description="💾 Save Labels", button_style='success')
save_button.on_click(save_labels)
display(save_button)
