<a href="https://colab.research.google.com/github/antndlcrx/Oxford-Methods-Spring-School/blob/main/prompting_classification_simulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://cdn.githubraw.com/antndlcrx/oss_2024/main/images/dpir_oss.png?raw=true:,  width=70" alt="My Image" width=500>

# **Promting, Classification, Simulating Human Behaviour**

## 🧩 **Prompting**

Autoregressive language models (like GPT) are trained with the objective to maximize word probabilities (conditional on context) it sees in training data:

$$
\mathcal{L} = - \sum_{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1})
$$

That is, the model learns to predict the next token $x_t $ given all previous tokens $x_1, ..., x_{t-1}$. If the context includes a pattern like:

```
Question: What is the capital of France?
Answer:
```

Then the model will learn to assign high probability to a token like `Paris` because that is a pattern it has frequently seen during training.

---

### 📊 **Prompting Shapes the Distribution**

When you prompt with something like:

```
The sentiment of the tweet "I love this movie" is
```

You are conditioning the model's token prediction to match a **semantic pattern** it has likely encountered — e.g., instruction-following or QA-style input. The model is not "understanding" the task in the human sense, but is drawing on:

- **Statistical regularities** from training data
- **Task-format correlations** (e.g., text → label)
- **Implicit memory of patterns** (e.g., QA pairs, sentence completions, reasoning chains)

### 📝**Few-Shot & Zero-Shot Prompting Mimic Training Structure**

- In **zero-shot prompting**, you rely on the fact that LMs have seen many generic task instructions and formats. If your prompt resembles something in pretraining (e.g., `"Classify the following..."`), it may trigger the correct continuation.

- In **few-shot prompting**, you explicitly show the model input-output examples, which help **anchor the distribution** of the next tokens toward the expected kind of label or response.

The model learns *implicitly* to perform tasks from the co-occurrence of instruction → response patterns in pretraining data (e.g., datasets scraped from forums, tutorials, quizzes, documentation, etc.).

---

### 📎💡 **Prompting is In-Context Learning**

Large models can learn *in context* without updating weights. During inference, **the examples you give in the prompt condition the generation, effectively allowing the model to "learn" how to behave just from the prompt structure**.

This behavior is emergent in large-scale LMs — GPT-2 barely does this, but GPT-3+ show strong few-shot capabilities.

## **1**.&nbsp; **🎯 Classification & Annotation with LLMs**


Many research tasks require **classifying text**—for example, identifying sentiment, stance, or speaker intent. Traditionally, this meant training a supervised classifier on a labeled dataset, which can be time-consuming and expensive to create.

Large Language Models (LLMs) offer new ways to approach classification and annotation with far less data—and often, no training at all.

We will explore two main strategies:

- **Zero- and Few-Shot Prompting**: Provide the model with a natural language prompt and optionally a few examples. No training required—just carefully designed inputs.
- **Fine-Tuning**: When you have labeled data, you can train the model to specialize on your task, improving performance and consistency.

In this tutorial, we'll explore both methods, compare their strengths and limitations, and discuss how LLMs can support social science classification tasks.

---

Both **prompting** and **fine-tuning** leverage the powerful pretraining of LLMs, which have learned broad language patterns from massive amounts of text. This allows them to recognise structure, syntax, and even subtle meanings across a wide range of domains.

- **zero- and few-shot prompting**, we tap into this general knowledge by writing a prompt that frames the task clearly—often including labels and instructions in natural language. When done well, the model can perform surprisingly well without ever being trained on your specific data.

- **fine-tuning**, we go one step further: we adjust the model's parameters based on labeled examples. This helps the model specialize—reducing ambiguity, aligning with your domain, and improving consistency, especially for edge cases or more nuanced labels.

💡 What makes both approaches possible is the rich contextual representations LLMs learn during pretraining. Prompting uses these representations as-is; fine-tuning tweaks them for your task.

In practice, **prompting is faster and cheaper, while fine-tuning offers better performance when enough annotated data is available**.

### **1. 1**.&nbsp; **Sentiment Analysis and Stance Detection**

Our main example here is a research [article by Bestvater and Monroe](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/743A9DD62DF3F2F448E199BDD1C37C8D/S1047198722000109a.pdf/sentiment-is-not-stance-target-aware-opinion-classification-for-political-text-analysis.pdf) on the difference between sentiment and stance. The authors use data from Twitter (X), which are 20,000 posts about the 2017 Women's March in Washington DC. As you might guess, the dataset features tweets that support the March and those that oppose it, as well as tweets in positive tone and negative tone. The authors' point is that we should not use tone as a proxy for opinion on the women's march, and instead estimate it directly.   

<img src="https://cdn.githubraw.com/antndlcrx/oss_2024/main/images/pa_abstract.png?raw=true:,  width=100" alt="My Image" width=900>

In [None]:
# Get data
import pandas as pd

DATA_PATH = 'https://raw.githubusercontent.com/antndlcrx/oss_2024/main/data/WM_tweets_groundtruth.csv'
wm_data = pd.read_csv(DATA_PATH)

In [None]:
wm_data.head()

Unnamed: 0,text,stance,sentiment,balanced_train,vader_scores
0,YES! I'm still with her and always will be. ht...,1,1.0,0.0,0.5754
1,Pics or it didn't happen. https://t.co/o1GddSmwk2,1,0.0,0.0,0.0
2,I love this nasty woman. @MaribethMonroe #wome...,1,1.0,1.0,-0.0129
3,RT @YiawayYeh: Marching for love. Nashville #...,1,1.0,1.0,0.6369
4,These people are just Sad. https://t.co/0LK6iG...,0,0.0,1.0,-0.4767


The dataset has `text` feature that we will use to predict our targets, which are `stance` and `sentiment`.

In [None]:
# before we start any analysis, we clean up the data
# particularly, we want to have text labels for our targets and a cleaned version of the text
# (we do not want to waste time and resources on urls, at least if we assume those are not useful
# to predict sentiment and stance)

wm_data['stance_cat'] = wm_data['stance'].map({1: 'support', 0: 'oppose'})
wm_data['sentiment_cat'] = wm_data['sentiment'].map({1.0: 'positive', 0.0: 'negative'})
wm_data['text_cleaned'] = wm_data['text'].str.replace(r'http\S+|www.\S+', '', case=False, regex=True) # remove urls

In [None]:
# next, we want to split data into training, validation, and test splits
# this will allow us to monitor model's out of sample prediction (for fine-tuning approach)
# the prompting approach only needs validation data to track performance

# we use a specialised funciton to split data at random with a fixed seed
from sklearn.model_selection import train_test_split

# to save resources and time
example_data = wm_data.sample(5000, random_state=42)

# Split data: 80% train, 10% val, 10% test
train_data, temp_data = train_test_split(example_data, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42) # split data again into validation and test

print(train_data.shape, "\n", "-"*10, "\n", val_data.shape, "\n", "-"*10, "\n",
      test_data.shape)

(4000, 8) 
 ---------- 
 (500, 8) 
 ---------- 
 (500, 8)


let us pick up a model: we want a text generation model. We will experiment with the [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) model.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available else "cpu"

model_name = "Qwen/Qwen2.5-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [None]:
def generate_text(model, tokenizer, prompt, max_new_tokens=20):
    """
    Generate text continuation from a prompt using a language model.

    Args:
        model: A pretrained Hugging Face language model (e.g., GPT-2).
        tokenizer: The corresponding tokenizer for the model.
        prompt (str): Input string to prompt the model with.
        max_new_tokens (int): Maximum number of tokens to generate beyond the prompt.

    Returns:
        str: The generated continuation (excluding the original prompt).
    """

    # tokenize the prompt and convert it to input tensors
    # return_tensors="pt" makes tokenizer return PyTorch tensors
    # .to(model.device) moves tensors to the same device as the model (e.g., GPU or CPU)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # generate new tokens based on the prompt
    # model.generate returns a tensor of shape [1, prompt_len + max_new_tokens]
    outputs = model.generate(**inputs,
                             max_new_tokens=max_new_tokens,
                             pad_token_id=tokenizer.eos_token_id)

    # decode the generated token IDs back into human-readable text
    # skip_special_tokens=True removes things like <pad> or <eos>
    out_decoded = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    # remove the prompt portion to isolate only the new model-generated content
    # (since decode returns the full prompt + completion)
    if out_decoded.startswith(prompt):
        return out_decoded[len(prompt):].strip()
    else:
        # Fallback: if for some reason the output doesn't contain the prompt
        return out_decoded.strip()

In [None]:
text = val_data['text_cleaned'].iloc[97]
instr = f"Does the author of the following text support or oppose the Women's March? Answer with: support or oppose. \nText: <{text}>\nAnswer:"
print(instr)

"Does the author of the following text support or oppose the Women's March? Answer with: support or oppose. \nText: <The biggest lie to come out of #WomensMarch right here. #RayPopWatch #ManOfQuality >\nAnswer:"

In [None]:
generate_text(model, tokenizer, instr)

'oppose'

### 🎯 **Reminder: Classification Performance Metrics**

When evaluating classifiers, we often use a set of core metrics to understand different aspects of model performance:

- **Accuracy** measures overall correctness—how often the model was right:
  $$
  \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
  $$
  It is useful when classes are balanced, but can be misleading when they are not. For instance, if 99% of your data is class A and only 1% is class B, you will get near perfect accuracy by just always guessing A.

- **Precision** tells us, *of all predicted positives, how many were correct?*
  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$
  It is important when false positives are costly (e.g., flagging innocent people).

- **Recall** tells us, *of all actual positives, how many did we catch?*
  $$
  \text{Recall} = \frac{TP}{TP + FN}
  $$
  Useful when false negatives are costly (e.g., missing diseases in medical screening).

- **F1 Score** is the harmonic mean of precision and recall—it balances both:
  $$
  \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$
  Ideal when you need a single number to weigh both false positives and false negatives.

### Zero-Shot Experiment

First, we try to test if the model can do a reasonable classification by itself, with only our instruction as guidance and no examples. This is called "Zero-Shot Learning".

In [None]:
# for faster inference, we develop our generate function further to process input
# exaples in batches.

def generate_batch(model, tokenizer, prompts, max_new_tokens=20):
    """
    Generate model completions for a batch of prompts.

    Args:
        model: Pretrained language model.
        tokenizer: Corresponding tokenizer.
        prompts (list of str): List of input prompts.
        max_new_tokens (int): Maximum number of tokens to generate beyond prompt.

    Returns:
        list of str: List of generated completions (excluding prompt).
    """
    # tokenize all prompts into a padded batch of tensors
    # truncation ensures our inputs do not exceed the context_len of the model
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(model.device)

    # generate completions
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and remove prompts from output
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    # output just the tokens after prompt (normally model outputs the prompt as well)
    cleaned = [gen[len(prompt):].strip() if gen.startswith(prompt) else gen.strip()
               for prompt, gen in zip(prompts, decoded)]

    return cleaned

In [None]:
from tqdm import tqdm # get progress bar to track our inference progress

batch_size = 16 # process 16 examples at a time
predictions = [] # emty list where we will store predictions

# Build prompts, for each example in our validation data
prompts = [
    f"Does the author of the following text support or oppose the Women's March? Answer with: support or oppose.\nText: <{text}>\nAnswer:"
    for text in val_data['text_cleaned']
]

# Run in batches
for i in tqdm(range(0, len(prompts), batch_size)):
    batch = prompts[i:i + batch_size]
    preds = generate_batch(model, tokenizer, batch, max_new_tokens=10)  # 10 is enough for short answers
    predictions.extend(preds)

# val_data["llm_prediction"] = predictions

100%|██████████| 32/32 [00:32<00:00,  1.00s/it]


In [None]:
len(set(predictions))

val_data["llm_prediction"] = predictions

In [None]:
from sklearn.metrics import classification_report

# Make sure predictions and true labels are aligned
y_true = val_data["stance_cat"]
y_pred = val_data["llm_prediction"]

# print a full classification report with metrics:
print(classification_report(y_true, y_pred, digits=3))

              precision    recall  f1-score   support

      oppose      0.221     1.000     0.363        66
     support      1.000     0.465     0.635       434

    accuracy                          0.536       500
   macro avg      0.611     0.733     0.499       500
weighted avg      0.897     0.536     0.599       500



In [None]:
#@title Exercise:
# Try experimenting with changing the prompt. Can you get a better result?

# What issues, if any, did you face?

### **🧪 From Zero-Shot to Few-Shot Classification**
Our zero-shot approach gave us a quick way to classify examples without any labeled training data, but as we saw, its **performance was limited**—especially **in the face of class imbalance** or ambiguous phrasing.


This is where few-shot learning comes in. Instead of relying purely on general world knowledge, we can guide the model by giving it a few labeled examples of what "support" and "oppose" look like. These examples act as demonstrations, helping the model better grasp the specific decision boundary we care about. **Few-shot prompting** provides a powerful middle ground: it **is flexible like zero-shot, but gains performance from contextually relevant examples—without requiring full model retraining**.

In [None]:
import random

def build_few_shot_prompt(text, train_df, n_shots=3, seed=42):
    """
    Construct a few-shot classification prompt using n_shots per class.

    Args:
        text (str): The test input (tweet or sentence).
        train_df (pd.DataFrame): A labeled dataframe with 'text_cleaned' and 'stance_cat'.
        n_shots (int): Number of examples per class to include.
        seed (int): Random seed for reproducibility.

    Returns:
        str: A complete few-shot prompt ready for inference.
    """
    random.seed(seed)

    # Sample n_shots positive and negative examples
    pos_examples = train_df[train_df["stance_cat"] == "support"].sample(n=n_shots, random_state=seed)
    neg_examples = train_df[train_df["stance_cat"] == "oppose"].sample(n=n_shots, random_state=seed)

    # Format them into "Text: ... → Label: ..." format
    demo_examples = []

    # combine the sampled positive and negative examples into a single DataFrame (results in 2*n_shot examples)
    # sampling (shuffling here) helps ensure that the model doesn't always see one class before the other, class order is random
    for _, row in pd.concat([pos_examples, neg_examples]).sample(frac=1, random_state=seed).iterrows():
        demo = f"Text: <{row['text_cleaned']}>\nAnswer: {row['stance_cat']}"
        demo_examples.append(demo)

    # build the full few-shot prompt
    full_prompt = (
        "Does the author of the following text support or oppose the Women's March?. Answer with: support or oppose."
        + "\n".join(demo_examples)
        + f"\nText: <{text}>\nAnswer:"
    )

    return full_prompt

In [None]:
print(build_few_shot_prompt(val_data['text_cleaned'].iloc[0],
                      train_data))

Does the author of the following text support or oppose the Women's March?. Answer with: support or oppose.Text: <Say their names: TRAYVON MARTIN! TRAYVON MARTIN!TRAYVON MARTIN!TRAYVON MARTIN!TRAYVON MARTIN!TRAYVON MARTIN!TRAYVON MARTIN! #whyIMarch>
Answer: support
Text: <This right here is my new favorite photograph. >
Answer: support
Text: <@RealJamesWoods @SpecialKMB1969 @womensmarch That's why they r not real women....real women know that someone has to pick that crap up!>
Answer: oppose
Text: <RT @MikeColeNESN: Probably gonna want that one back. >
Answer: support
Text: <Fuck you your trash >
Answer: oppose
Text: <#WomensMarch Life and Family do not touch, Abortion=Murder PROVIDA >
Answer: oppose
Text: <RT @llgb: I won't back down! >
Answer:


In [None]:
#@title Batched Few-Shot Inference

from tqdm import tqdm

def run_few_shot_classification(
    model, tokenizer,
    val_data, train_data,
    n_shots=3,
    batch_size=8,
    max_new_tokens=10,
    seed=42
):
    """
    Run few-shot classification on a validation dataset using prompts built from a labeled training set.

    Reuses:
    - build_few_shot_prompt(): for creating individual prompts.
    - generate_batch(): for batch generation and postprocessing.

    Args:
        model: Hugging Face LLM (e.g. GPT2).
        tokenizer: Corresponding tokenizer.
        val_data (pd.DataFrame): Validation data with 'text_cleaned'.
        train_data (pd.DataFrame): Labeled data with 'stance_cat'.
        n_shots (int): Number of examples per class in prompt.
        batch_size (int): Number of prompts to generate at once.
        max_new_tokens (int): Max length of model completion.
        seed (int): Seed for reproducibility.

    Returns:
        List of predicted stance strings.
    """

    # build prompts
    # we pick examples from training data and run prediction on validation
    prompts = [
        build_few_shot_prompt(text=row["text_cleaned"], train_df=train_data, n_shots=n_shots, seed=seed)
        for _, row in val_data.iterrows()
    ]

    # we will store all prediciton here:
    predictions = []

    # run batched inference with progress bar
    for i in tqdm(range(0, len(prompts), batch_size)):
        batch = prompts[i:i + batch_size]
        batch_preds = generate_batch(model, tokenizer, batch, max_new_tokens=max_new_tokens)
        # predictions.extend(batch_preds)

        ## optional: clean predictions to just 'support' or 'oppose'
        for pred in batch_preds:
            pred = pred.lower()
            if "support" in pred:
                predictions.append("support")
            elif "oppose" in pred:
                predictions.append("oppose")
            else:
                predictions.append("unknown")

    return predictions

In [None]:
### hint: uncomment and run in case CUDA out of memory issue
# del model
# torch.cuda.empty_cache()

fewshot_preds = run_few_shot_classification(
    model=model,
    tokenizer=tokenizer,
    val_data=val_data,
    train_data=train_data,
    n_shots=3,
    batch_size=16 # be careful! too big of batch size will result in CUDA out of memory!
)

# fewshot_preds
val_data["fewshot_prediction"] = fewshot_preds

100%|██████████| 32/32 [02:03<00:00,  3.87s/it]


In [None]:
## for comparison, display zero-shot results again
# Make sure predictions and true labels are aligned
y_true = val_data["stance_cat"]
y_pred = val_data["llm_prediction"]

# print a full classification report with metrics:
print(classification_report(y_true, y_pred, digits=3))

              precision    recall  f1-score   support

      oppose      0.221     1.000     0.363        66
     support      1.000     0.465     0.635       434

    accuracy                          0.536       500
   macro avg      0.611     0.733     0.499       500
weighted avg      0.897     0.536     0.599       500



In [None]:
# Make sure predictions and true labels are aligned
y_true = val_data["stance_cat"]
y_pred = val_data["fewshot_prediction"]

# print a full classification report with metrics:
print(classification_report(y_true, y_pred, digits=3))

              precision    recall  f1-score   support

      oppose      0.281     0.985     0.438        66
     support      0.996     0.618     0.762       434

    accuracy                          0.666       500
   macro avg      0.639     0.801     0.600       500
weighted avg      0.902     0.666     0.720       500



In [None]:
#@title Exercise:

# Would giving more examples help? Try experimenting with different number of few shot expamples.

# Feel free to experiment with the instruction text again; you can also adjust the pattern of how examples are fed to the model

we explored how language models can be used for classification without traditional supervised training. We compared zero-shot and few-shot prompting, and observed that
- **zero-shot prompting offers convenience, it struggles in imbalanced settings—heavily favoring the dominant class**.
- By contrast, **few-shot prompting improved performance by giving the model some grounded examples of what to look for**.

However, even few-shot prompting has limits: the model still doesn't adapt its internal weights, meaning it can only go so far in learning complex patterns or reducing bias. This brings us to the next logical step: **fine-tuning**—where we **update the model's parameters using labeled data to achieve more accurate and robust predictions**. Lets now explore how to do that using encoder-only models and the Hugging Face Trainer API. 🧠📊

### **1. 2**.&nbsp; 📎 **Fine-Tuning Encoder-Only Models for Classification**

Fine-tuning encoder-only models like **BERT** is one of the most common and effective approaches to text classification. These models are trained to deeply understand the relationships between words in a sentence by processing the entire input bidirectionally (i.e., looking at the whole sentence at once). For classification tasks, we typically **add a small classifier head**—a feedforward layer—on top of the model's final hidden representation (often the embedding of the `[CLS]` token). During fine-tuning, we update **both the base model weights and the classifier head** using a labeled dataset specific to our task (e.g., sentiment, stance, topic). This allows the model to adapt its internal representations to what matters for the task, improving accuracy. Fine-tuning is especially useful when we have **moderate-to-large amounts of annotated data**, and want **consistent, high-quality predictions**.

To run fine-tuning, we will rely on the Hugging Face `Trainer` API.

---

#### **📎 The `Trainer` API**

The Hugging Face `Trainer` is a high-level training interface designed to **simplify the process of fine-tuning transformer models**. It abstracts away much of the boilerplate involved in training, evaluation, and logging.

With just a few lines of code, `Trainer` handles the entire training loop: batching, gradient updates, evaluation, model saving, and more. You configure it using a `TrainingArguments` object, specifying things like batch size, number of epochs, evaluation strategy, and logging frequency.

It iss especially useful for supervised tasks like **text classification**, and integrates tightly with the 🤗 `datasets` and `transformers` libraries. By using `Trainer`, you can focus on your task and data, while letting the framework handle the training details.

In [None]:
train_data["label"] = train_data["stance"]
val_data["label"] = val_data["stance"]
test_data["label"] = test_data["stance"]

The Hugging Face `Trainer` expects datasets in the `datasets.Dataset` format (from the 🤗 datasets library), not plain pandas `DataFrames`. So we need to convert `train_data`, `val_data`, and `test_data`:

In [None]:
from datasets import Dataset

# Convert to HF Dataset
train_ds = Dataset.from_pandas(train_data[['text', 'label']])
val_ds = Dataset.from_pandas(val_data[['text', 'label']])
test_ds = Dataset.from_pandas(test_data[['text', 'label']])

In [None]:
# select a pretrained model checkpoint for sequence classification
model_name = "microsoft/deberta-v3-small"

# load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# define a preprocessing function that tokenizes the "text" field of the dataset
# - truncation=True cuts off long sequences to fit the max_length
# - padding="max_length" ensures all sequences are the same length (important for batching)
# - max_length=128 is an arbitrary cutoff, adjust depending on your data/model
def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

# apply the tokenizer to the entire training, validation, and test datasets
# The map function applies `tokenize()` to every example in the dataset
# batched=True allows it to process multiple examples at once (faster)
train_ds = train_ds.map(tokenize, batched=True)
val_ds = val_ds.map(tokenize, batched=True)
test_ds = test_ds.map(tokenize, batched=True)

# convert the tokenized datasets to PyTorch tensors
# only keep the relevant columns: tokenized input_ids, attention_mask, and label
train_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
# load a pretrained transformer model for sequence classification
# - this wraps a base model (like DeBERTa) with a classification head on top
# - you specify how many labels (classes) you want the model to predict
# - OPTIONAL: the label2id and id2label mappings help the model produce readable predictions and logs

label_list = sorted(train_data["stance_cat"].unique())  # e.g. ["oppose", "support"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for label, i in label2id.items()}

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,               # name of the model checkpoint from Hugging Face hub
    num_labels=len(label2id), # total number of unique classification labels
    label2id=label2id,        # dictionary mapping label names to integers
    id2label=id2label         # dictionary mapping label integers back to names
)

# set up training configuration using Hugging Face's TrainingArguments
# this controls how training is run and where things are saved
args = TrainingArguments(
    output_dir="./deberta-classifier",      # folder to save checkpoints and final model
    evaluation_strategy="epoch",            # run evaluation at the end of each training epoch
    save_strategy="epoch",                  # save model checkpoints at the end of each epoch
    per_device_train_batch_size=32,         # batch size per GPU/CPU during training
    per_device_eval_batch_size=32,          # batch size per GPU/CPU during evaluation
    num_train_epochs=3,                     # how many full passes over the training data
    weight_decay=0.01,                      # L2 regularization to prevent overfitting
    logging_dir='./logs',                   # directory for saving training logs (for TensorBoard etc.)
    logging_steps=10,                       # log loss and metrics every 10 training steps
    load_best_model_at_end=True,            # automatically load the best checkpoint (lowest eval loss or highest metric)
    metric_for_best_model="f1",             # metric used to choose the best model
    report_to="none"                        # disable reporting to third-party tools (e.g. WandB)
)

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#@title Set compute metrics function
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(eval_pred):
    """
    Compute classification metrics given model predictions and true labels.

    Args:
        eval_pred: a tuple of (logits, labels)

    Returns:
        dict: A dictionary with accuracy, precision, recall, and f1 score.
    """
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)  # Convert logits to predicted class labels

    # Compute metrics using sklearn
    # "weighhted f1 average" accounts for class imbalance by weighting each label’s score by its true frequency in the dataset.
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [None]:
# takes about 2 mins on T4 GPU

# initialize Hugging Face Trainer object for training and evaluation
trainer = Trainer(
    model=model,                     # your fine-tunable model (e.g., DeBERTa or BERT)
    args=args,                       # TrainingArguments object (contains batch size, epochs, etc.)
    train_dataset=train_ds,         # tokenized training dataset (must contain input_ids, attention_mask, and label)
    eval_dataset=val_ds,            # tokenized validation set (used for periodic evaluation and saving best model)
    processing_class=tokenizer,            # tokenizer used to preprocess inputs (for saving and inference)
    compute_metrics=compute_metrics # Optional: function to compute custom metrics like accuracy, F1, etc.
)

# Train the model!
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.283,0.239419,0.88,0.88,0.88,0.88
2,0.2888,0.224569,0.906,0.905403,0.906,0.905695
3,0.1232,0.264508,0.91,0.907291,0.91,0.908486


TrainOutput(global_step=375, training_loss=0.21788315947850545, metrics={'train_runtime': 231.8742, 'train_samples_per_second': 51.752, 'train_steps_per_second': 1.617, 'total_flos': 397416351744000.0, 'train_loss': 0.21788315947850545, 'epoch': 3.0})

In [None]:
metrics = trainer.evaluate(test_ds)
print(metrics)

{'eval_loss': 0.2824077904224396, 'eval_accuracy': 0.898, 'eval_precision': 0.89125, 'eval_recall': 0.898, 'eval_f1': 0.8932177263969172, 'eval_runtime': 2.1041, 'eval_samples_per_second': 237.632, 'eval_steps_per_second': 7.604, 'epoch': 3.0}


### **🧾 Final Thoughts: Classification with Language Models**

We explored three classification strategies:

- **Zero-shot prompting**: quick, no labels needed, but struggles with nuance or imbalance.
- **Few-shot prompting**: adds guidance with examples, improves accuracy.
- **Fine-tuning**: best overall performance when enough labeled data is available.

Key insights:

**LLMs can classify text even without training, using prompts**.

**Performance improves with more context or supervision**.

**Fine-tuning is powerful—but needs quality data and more compute**.

⚠️ Caveat: Results vary by task and data. What works here may not generalize without adaptation.

🔍 Takeaway: Choose the right method for your task, depending on goals, data, and resources.

## **2**.&nbsp; **Simulating Human Behavior**


### 🎯 **Motivation**

As LLMs become increasingly sophisticated, social scientists are exploring whether these models can be used to simulate human behavior—including survey responses, experimental choices, and political attitudes. This emerging research agenda **opens exciting possibilities for rapid hypothesis testing and scalable, low-cost experimentation**, but also **raises critical questions about representation and validity**.

---

### 📚 **A short Literature Overview**

- [**Argyle et al. (2023)**](https://arxiv.org/pdf/2209.06899) propose that LLMs can act as **proxies for real human subpopulations** by conditioning the model on sociodemographic profiles from major U.S. surveys. They generate "silicon samples" from GPT-3 and show that the resulting distributions go beyond superficial mimicry, capturing nuanced patterns in attitudes and beliefs.

- [**Aher et al. (2023)**](https://proceedings.mlr.press/v202/aher23a/aher23a.pdf) introduce the concept of a **Turing Experiment**—simulating participants in classic psychology and economics studies using LLMs. They demonstrate that LLMs can replicate well-known effects in the Ultimatum Game, Garden Path Sentences, and the Milgram Shock Experiment, while also revealing model-specific artifacts such as “hyper-accuracy distortion.”

- [**Wang et al. (2024)**](https://www.nature.com/articles/s42256-025-00986-z) caution against overinterpreting these capabilities. They identify two theoretical limitations in how LLMs are trained that can lead to **misrepresentation and flattening of demographic identities**. Through human studies with 3,200 participants across 16 identity groups, they show LLM outputs often fail to reflect group heterogeneity and may reinforce essentialist assumptions.

- [**Bisbee et al. (2024)**](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/B92267DC26195C7F36E63EA04A47D2FE/S1047198724000056a.pdf/synthetic_replacements_for_human_survey_data_the_perils_of_large_language_models.pdf
) examine whether LLMs can **simulate public opinion**, asking ChatGPT to respond to survey questions as if it were individuals with specified identities. While the mean responses align with real data from the ANES, they find **reduced variance**, **unstable regression estimates**, and significant sensitivity to prompt phrasing and timing—undermining their reliability for statistical inference.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "Qwen/Qwen2.5-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

### **Dataset: European Social Survey**

[European Social Survey](https://www.europeansocialsurvey.org/) is an academically driven cross-national survey that has been conducted across Europe since its establishment in 2001. Every two years, face-to-face interviews are conducted with newly selected, cross-sectional samples.

It features a multitude of questions on various topics, like politics, subjective well-being, human values, etc.. As such, it offers a great tool to study the proposition by Argyle et al. 2023. You can explore the dataset interactively in the [ESS Data Portal](https://ess.sikt.no/en/datafile/242aaa39-3bbb-40f5-98bf-bfb1ce53d8ef).

We can test, whether the model can simulate people's reported behaviour and evaluate whether the simulated aggregate statistics match those of the ones expressed in surveys.

In [2]:
# load data
import pandas as pd

DATA_LINK = "https://raw.githubusercontent.com/antndlcrx/LLM-for-Social-Science-Research/main/ESS11.csv"
ess = pd.read_csv(DATA_LINK) ## round 11, 2023

  ess = pd.read_csv(DATA_LINK) ## round 11, 2023


For this tutorial, we will try to predict people's vote choice by prompting the model to role-play as a generic respondent with the characteristics we give to the model. Notice, in this task formulation, we are already setting the model up for "failure", by setting the task as modeling of a *generic* respondent from a subpopulation group. To make this scientific exercise have merit, we would need to advance the task formulation and narrow down the portrait of respondents we want to predict (by providing more details about them).

But for our tutorial purposes, we will just start with the basic set up. The variables we want to take are: some background information on a respondent, and the target variable (voting behaviour), that we will be predicting.

- `gndr`, which is a binary gender variable in ESS
- `eduyrs` number of years in education
- `region` denoting the region respondent is from.

We will focus on Germany, to make exercise manageble. Our target variable will be `prtvgde1`, which is a retrospective account of which party did a respondent choose in the last national election.  

In [3]:
ess.loc[ess['cntry']=="DE"][["gndr", "region", "prtvgde1"]]

Unnamed: 0,gndr,region,prtvgde1
3738,1,DEF,1.0
3739,2,DE2,3.0
3740,1,DEB,66.0
3741,2,DEF,2.0
3742,1,DE8,2.0
...,...,...,...
6153,2,DED,4.0
6154,2,DE6,5.0
6155,2,DEA,2.0
6156,2,DEA,2.0


In [4]:
#@title Prompt Creation
import random

# Since survey responses are stored as numeric codes, we need to map them to human-readable strings.
# These mappings will help us construct natural prompts for the language model.
variable_mapping = {
    "gndr": {1: "man", 2: "woman"},
    "prtvgde1": {
        1: "CDU/CSU", 2: "SPD", 3: "The Left", 4: "The Greens", 5: "FDP", 6: "AFD",
        7: "Free Voters", 8: "dieBasis", 9: "Die PARTEI", 55: "Other",
        66: "Not applicable", 77: "Refusal", 88: "Don't know", 99: "No answer"
    },
    "region": {
        "DE1": "Baden-Württemberg", "DE2": "Bayern", "DE3": "Berlin", "DE4": "Brandenburg",
        "DE5": "Bremen", "DE6": "Hamburg", "DE7": "Hessen", "DE8": "Mecklenburg-Vorpommern",
        "DE9": "Niedersachsen", "DEA": "Nordrhein-Westfalen", "DEB": "Rheinland-Pfalz",
        "DEC": "Saarland", "DED": "Sachsen", "DEE": "Sachsen-Anhalt",
        "DEF": "Schleswig-Holstein", "DEG": "Thüringen"
    }
}

# Filter dataset to Germany only
ess_de = ess.loc[ess['cntry'] == "DE"]
print(ess_de.shape)

# Sample a small subset for quick inference
ess_de = ess_de.sample(300, random_state=42)

# Apply the variable mappings to replace numeric codes with human-readable text
for var, mapping in variable_mapping.items():
    if var in ess_de.columns:
        ess_de[var] = ess_de[var].map(mapping)

def make_prompt(data):
    """
    Construct prompts for each respondent to simulate survey voting behavior prediction.

    Args:
        data (pd.DataFrame): Preprocessed DataFrame with variables like agea, gndr, eduyrs, region, prtvgde1.

    Returns:
        List[str]: A list of prompts ready for model input.
    """
    # Extract all unique political party labels (excluding missing)
    party_options = data['prtvgde1'].dropna().unique().tolist()
    prompts = []

    for _, row in data.iterrows():
        # Shuffle order of party options to avoid positional bias in completions
        shuffled = random.sample(party_options, len(party_options))
        options = ", ".join(shuffled)

        # Create a plain-text prompt suitable for causal decoder-only models (like GPT-2 or LLaMA in instruct mode)
        prompt = (
            f"It is the year 2023. Imagine a person: a {row['agea']} years old {row['gndr']} "
            f"with {row['eduyrs']} years of education. This person lives in {row['region']}. "
            f"Which party out of the following parties: {options} did this person vote for in the last national election?\n"
            f"Answer: This person voted for"
        )

        prompts.append(prompt)

    return prompts

(2420, 558)


In [5]:
test = make_prompt(ess_de.sample(10))
for t in test:
    print(t, "\n")

It is the year 2023. Imagine a person: a 68 years old man with 15 years of education. This person lives in Bayern. Which party out of the following parties: Die PARTEI, SPD, The Greens, Don't know, CDU/CSU did this person vote for in the last national election?
Answer: This person voted for 

It is the year 2023. Imagine a person: a 74 years old woman with 13 years of education. This person lives in Hessen. Which party out of the following parties: CDU/CSU, The Greens, Don't know, Die PARTEI, SPD did this person vote for in the last national election?
Answer: This person voted for 

It is the year 2023. Imagine a person: a 56 years old man with 15 years of education. This person lives in Nordrhein-Westfalen. Which party out of the following parties: Don't know, SPD, CDU/CSU, The Greens, Die PARTEI did this person vote for in the last national election?
Answer: This person voted for 

It is the year 2023. Imagine a person: a 40 years old man with 16 years of education. This person lives

In [6]:
# make prompts for all respondents and store them as a variable in our df
ess_de['prompt'] = make_prompt(ess_de)

In [7]:
# let us check how long our prompts are
tokenized = tokenizer(ess_de['prompt'].tolist(), padding=False, truncation=False, return_tensors=None)
lengths = [len(x) for x in tokenized['input_ids']]

max_len = max(lengths)
print(f"📏 Max tokenized prompt length: {max_len} tokens")

📏 Max tokenized prompt length: 106 tokens


In [9]:
#@title Model Generate Function
def generate_text(model, tokenizer, prompt, max_new_tokens=20):
    """
    Generate text continuation from a prompt using a language model.

    Args:
        model: A pretrained Hugging Face language model (e.g., GPT-2).
        tokenizer: The corresponding tokenizer for the model.
        prompt (str): Input string to prompt the model with.
        max_new_tokens (int): Maximum number of tokens to generate beyond the prompt.

    Returns:
        str: The generated continuation (excluding the original prompt).
    """

    # tokenize the prompt and convert it to input tensors
    # return_tensors="pt" makes tokenizer return PyTorch tensors
    # .to(model.device) moves tensors to the same device as the model (e.g., GPU or CPU)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # generate new tokens based on the prompt
    # model.generate returns a tensor of shape [1, prompt_len + max_new_tokens]
    outputs = model.generate(**inputs,
                             max_new_tokens=max_new_tokens,
                             pad_token_id=tokenizer.eos_token_id)

    # decode the generated token IDs back into human-readable text
    # skip_special_tokens=True removes things like <pad> or <eos>
    out_decoded = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    # remove the prompt portion to isolate only the new model-generated content
    # (since decode returns the full prompt + completion)
    if out_decoded.startswith(prompt):
        return out_decoded[len(prompt):].strip()
    else:
        # Fallback: if for some reason the output doesn't contain the prompt
        return out_decoded.strip()

In [10]:
ess_de['prompt'].iloc[0]

"It is the year 2023. Imagine a person: a 35 years old man with 13 years of education. This person lives in Hamburg. Which party out of the following parties: CDU/CSU, Free Voters, Die PARTEI, Refusal, SPD, Don't know, The Greens, The Left, AFD, Other, FDP, Not applicable did this person vote for in the last national election?\nAnswer: This person voted for"

In [11]:
generate_text(model, tokenizer, ess_de['prompt'].iloc[0])

'the CDU/CSU party in the last national election.'

In [12]:
#@title Generate in Batches Function
# for faster inference, we develop our generate function further to process input
# exaples in batches.

def generate_batch(model, tokenizer, prompts, max_new_tokens=20):
    """
    Generate model completions for a batch of prompts.

    Args:
        model: Pretrained language model.
        tokenizer: Corresponding tokenizer.
        prompts (list of str): List of input prompts.
        max_new_tokens (int): Maximum number of tokens to generate beyond prompt.

    Returns:
        list of str: List of generated completions (excluding prompt).
    """
    # tokenize all prompts into a padded batch of tensors
    # truncation ensures our inputs do not exceed the context_len of the model
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(model.device)

    # generate completions
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and remove prompts from output
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    # output just the tokens after prompt (normally model outputs the prompt as well)
    cleaned = [gen[len(prompt):].strip() if gen.startswith(prompt) else gen.strip()
               for prompt, gen in zip(prompts, decoded)]

    return cleaned

In [13]:
from tqdm import tqdm

batch_size = 16               # adjust depending on GPU memory
max_new_tokens = 15           # generation length – adjust as needed
predictions = []              # will store all model completions

#convert prompt column to list
all_prompts = ess_de['prompt'].tolist()

# run inference in batches
for i in tqdm(range(0, len(all_prompts), batch_size)):
    batch = all_prompts[i:i + batch_size]
    batch_preds = generate_batch(model, tokenizer, batch, max_new_tokens=max_new_tokens)
    predictions.extend(batch_preds)

# store completions in a new column
ess_de["llm_vote_prediction"] = predictions

100%|██████████| 19/19 [00:44<00:00,  2.32s/it]


In [15]:
#@title test what we got
ess_de["llm_vote_prediction"].value_counts()
# ess_de['prtvgde1'].unique()

Unnamed: 0_level_0,count
llm_vote_prediction,Unnamed: 1_level_1
the CDU/CSU party in the last national election.,214
the CDU/CSU.,69
Die PARTEI in the last national election.,4
the SPD party in the last national election.,3
Die PARTEI.,3
the AFD (Alternative für Deutschland) party in the last national election.,2
the SPD.,2
the Die PARTEI party in the last national election.,1
the FDP party in the last national election.,1
the SPD (Social Democratic Party) in the last national election.,1


In [19]:
#@title CLean Up
# Known valid party names (from ground truth)
valid_parties = ess_de["prtvgde1"].dropna().unique().tolist()

# Simple helper function to match prediction string to valid party
def extract_party(pred, valid_labels):
    pred = pred.lower()
    for party in valid_labels:
        if party.lower() in pred:
            return party
    return "Unknown"

# Apply to predictions
ess_de["llm_vote_cleaned"] = ess_de["llm_vote_prediction"].apply(lambda x: extract_party(x, valid_parties))

# Check what we got
print(ess_de["llm_vote_cleaned"].value_counts())

llm_vote_cleaned
CDU/CSU       283
Die PARTEI      8
SPD             6
AFD             2
FDP             1
Name: count, dtype: int64


In [20]:
from sklearn.metrics import classification_report

# Drop missing values
# eval_df = ess_de[["prtvgde1", "llm_vote_cleaned"]].dropna()

# ensure both columns are string type
y_true = ess_de["prtvgde1"].astype(str)
y_pred = ess_de["llm_vote_cleaned"].astype(str)

print(classification_report(
    y_true,
    y_pred,
    digits=3,
    zero_division=0
))

                precision    recall  f1-score   support

           AFD      0.000     0.000     0.000         6
       CDU/CSU      0.198     0.933     0.327        60
    Die PARTEI      0.000     0.000     0.000         3
    Don't know      0.000     0.000     0.000        23
           FDP      0.000     0.000     0.000        19
   Free Voters      0.000     0.000     0.000         1
Not applicable      0.000     0.000     0.000        62
         Other      0.000     0.000     0.000         3
       Refusal      0.000     0.000     0.000        18
           SPD      0.167     0.021     0.038        47
    The Greens      0.000     0.000     0.000        40
      The Left      0.000     0.000     0.000        18

      accuracy                          0.190       300
     macro avg      0.030     0.080     0.030       300
  weighted avg      0.066     0.190     0.071       300



### **Exercise**

Clearly, that did not work well at all. This is hardly surprising, we are trying to predict voting behaviour on gender, age, education and region with noting else. But even for that the perfomrance is poor. However, this is just the first run of a prototype!

There are several avenues we could explore to improve performance. Lets try them out:

- Adjust prompt, try making model "mobilize" its internal knowledge. Ask it what an expert in German politics would have predicted?
- Add more information about repondent: this is an obvious avenue of improvement. Think about variables that can be informative of voting behaviour. Plug them into the variable mapping and add them to prompt.

## **Preference Alignment and Instruct Models**

Instruct model would be easier to direct with instructions, since it was trained to follow instructions.



In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
# model_name = "Qwen/Qwen2.5-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

For instruct models, we use a different format for prompts.

The `messages` list in the code below follows the chat message format used in many instruct-tuned models (similar to ChatGPT-style interactions).

The `system` message provides context and behavior guidance to the assistant.

The `user` message is your actual prompt.

Then, `assistent` is prepended to the user message, to signify to the model that it now starts generating response as an assistant.

In [8]:
# code to run generation with instruct model

prompt = "Give me a short introduction to large language model."

# format messages for chat-style generation (used in instruction-tuned models)
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# apply the tokenizer's built-in chat formatting to convert messages to a string
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False, # we want raw text string, not token IDs (yet)
    add_generation_prompt=True # appends assistant role tag (e.g. "<|assistant|>") to signal model to continue
)

# convert formatted text into token IDs that the model can understand

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# generate the assistant's reply from the model
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

# isolate the newly generated tokens (remove the prompt part)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
# decode the generated tokens into human-readable text
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Pay attention to the following methods:

- `apply_chat_template` formats the messages into a single text string the model understands.

- `add_generation_prompt=True` appends the final prompt like "<|assistant|>" at the end, preparing the model to generate the assistant’s reply.

In [9]:
print(text, "\n", response)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
 
 A Large Language Model (LLM) is an artificial intelligence system designed to understand and generate human-like text based on vast amounts of natural language data. These models can be trained on a wide range of tasks such as translation, summarization, question-answering, and more complex language understanding.

Key features include:

1. **Complexity**: They involve massive amounts of computing resources for training.
2. **Capacity**: Can process and analyze huge volumes of text data.
3. **Learning**: Utilize deep learning techniques like neural networks to learn patterns in the data.
4. **Versatility**: Capable of performing various types of language processing tasks effectively.

Examples of LLMs include GPT series from OpenAI, BERT from Google, and others that have been developed

Let us now use the assistant model as our "expert scientist" predicting voting behaviour.

In [17]:
#@title Make Chat Prompts

def make_chat_prompts(data, system_message):
    """
    Construct chat-style prompts for each respondent for use with instruct-tuned models.

    Args:
        data (pd.DataFrame): Preprocessed survey data with sociodemographic columns.
        system_message (str): The system-level instruction (e.g., expertise, role of assistant).

    Returns:
        List[str]: List of final chat-formatted prompts (one per row) ready to be tokenized.
    """
    party_options = data['prtvgde1'].dropna().unique().tolist()
    prompts = []

    for _, row in data.iterrows():
        # shuffle party options to remove bias from fixed order
        shuffled = random.sample(party_options, len(party_options))
        options = ", ".join(shuffled)

        # user message — the core of the prompt
        user_instruction = (
            f"It is the year 2023. Imagine a person: a {row['agea']} years old {row['gndr']} "
            f"with {row['eduyrs']} years of education. This person lives in {row['region']}. "
            f"Which party out of the following parties: {options} did this person vote for in the last national election?"
        )

        # format as list of messages (system + user)
        chat = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_instruction}
        ]

        # apply chat template for the model you're using
        prompt_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        prompts.append(prompt_text)

    return prompts

In [18]:
system_message = (
    "You are a helpful assistant in survey research. "
    "You excel at making educated predictions about people's potential responses in a survey given some background information. "
    "You have a solid political science background, with specialization in political behavior and voting. "
    "You pull together your expert knowledge in social sciences to make the prediction."
)

# create formatted prompts
ess_de["prompt_instr"] = make_chat_prompts(ess_de, system_message)

In [19]:
ess_de["prompt_instr"].iloc[0]

"<|im_start|>system\nYou are a helpful assistant in survey research. You excel at making educated predictions about people's potential responses in a survey given some background information. You have a solid political science background, with specialization in political behavior and voting. You pull together your expert knowledge in social sciences to make the prediction.<|im_end|>\n<|im_start|>user\nIt is the year 2023. Imagine a person: a 35 years old man with 13 years of education. This person lives in Hamburg. Which party out of the following parties: Free Voters, The Greens, AFD, Refusal, SPD, The Left, Not applicable, FDP, Die PARTEI, Other, CDU/CSU, Don't know did this person vote for in the last national election?<|im_end|>\n<|im_start|>assistant\n"

In [20]:
#@title Update Generate Batch

def generate_batch(model, tokenizer, prompts, max_new_tokens=100):
    """
    Generate completions from chat-formatted prompts using a decoder-only LLM.

    Args:
        model: Pretrained chat or instruct-tuned language model.
        tokenizer: Corresponding tokenizer.
        prompts (list of str): List of full chat-formatted prompts.
        max_new_tokens (int): Maximum number of tokens to generate.

    Returns:
        list of str: List of generated responses (decoded completions).
    """
    # tokenize batched prompts with padding
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(model.device)

    # generate new tokens from the model
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id
    )

    # decode the full generated sequences
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)

    # optionally: extract only the assistant’s part of the response
    cleaned = []
    for gen in decoded:
        if "\nAnswer" in gen:
            cleaned.append(gen.split("\nAnswer")[-1].strip())
        else:
            cleaned.append(gen.strip())

    return cleaned

In [21]:
from tqdm import tqdm

batch_size = 16               # adjust depending on GPU memory
max_new_tokens = 15           # generation length – adjust as needed
predictions = []              # will store all model completions

#convert prompt column to list
all_prompts = ess_de['prompt'].tolist()

# run inference in batches
for i in tqdm(range(0, len(all_prompts), batch_size)):
    batch = all_prompts[i:i + batch_size]
    batch_preds = generate_batch(model, tokenizer, batch, max_new_tokens=max_new_tokens)
    predictions.extend(batch_preds)

# store completions in a new column
ess_de["instr_vote_prediction"] = predictions

100%|██████████| 19/19 [00:46<00:00,  2.47s/it]


In [23]:
ess_de["instr_vote_prediction"].value_counts()
# ess_de['prtvgde1'].unique()

Unnamed: 0_level_0,count
instr_vote_prediction,Unnamed: 1_level_1
: This person voted for the CDU/CSU.\nYou are an AI assistant. You will,24
": This person voted for the CDU/CSU.\n\nTo arrive at this answer, we need",7
": This person voted for the CDU/CSU.\nTo arrive at this answer, we need",6
: This person voted for the SPD (Social Democratic Party) in the last national election.\nTo determine,5
: This person voted for the SPD (Social Democratic Party) in the last national election.\nIn Germany,4
...,...
: This person voted for the CDU/CSU.\n\nThe reasoning behind this answer involves understanding the,1
: This person voted for the Green Party (Grüne) in the last federal election.\nTo,1
: This person voted for The Left (Die Linke) in the last national election.\nYou are,1
: This person voted for the CDU/CSU party.\nThe information provided does not specify any,1


In [None]:
ess_de["instr_vote_prediction"].tolist()

In [16]:
ess_de['prtvgde1'].unique()

array(['Not applicable', 'CDU/CSU', 'SPD', 'Refusal', 'FDP', 'The Greens',
       "Don't know", 'AFD', 'Free Voters', 'The Left', 'Other',
       'Die PARTEI'], dtype=object)

In [26]:
# official parties from the original dataset
party_labels = ess_de["prtvgde1"].dropna().unique().tolist()

def extract_party_name(generation, party_labels):
    """
    Extract party name from model output by matching known party labels.
    """
    generation = generation.lower()  # lowercase to make matching easier

    for party in party_labels:
        if party.lower() in generation:
            return party  # return clean label if found

    return "unknown"  # fallback if no party matched

In [28]:
ess_de["instr_vote_prediction_clean"] = ess_de["instr_vote_prediction"].apply(
    lambda x: extract_party_name(x, party_labels)
)

ess_de["instr_vote_prediction_clean"].value_counts()

Unnamed: 0_level_0,count
instr_vote_prediction_clean,Unnamed: 1_level_1
CDU/CSU,173
SPD,77
unknown,13
The Greens,10
AFD,10
Die PARTEI,8
FDP,7
Free Voters,1
The Left,1


In [29]:
from sklearn.metrics import classification_report

# Drop missing values
# eval_df = ess_de[["prtvgde1", "llm_vote_cleaned"]].dropna()

# ensure both columns are string type
y_true = ess_de["prtvgde1"].astype(str)
y_pred = ess_de["instr_vote_prediction_clean"].astype(str)

print(classification_report(
    y_true,
    y_pred,
    digits=3,
    zero_division=0
))

                precision    recall  f1-score   support

           AFD      0.000     0.000     0.000         6
       CDU/CSU      0.202     0.583     0.300        60
    Die PARTEI      0.000     0.000     0.000         3
    Don't know      0.000     0.000     0.000        23
           FDP      0.143     0.053     0.077        19
   Free Voters      0.000     0.000     0.000         1
Not applicable      0.000     0.000     0.000        62
         Other      0.000     0.000     0.000         3
       Refusal      0.000     0.000     0.000        18
           SPD      0.156     0.255     0.194        47
    The Greens      0.100     0.025     0.040        40
      The Left      0.000     0.000     0.000        18
       unknown      0.000     0.000     0.000         0

      accuracy                          0.163       300
     macro avg      0.046     0.070     0.047       300
  weighted avg      0.087     0.163     0.101       300



### Exercise

- Test out different versions of System Prompt, and Instruction prompt. How does this affect results you get?
- Try adding different variables. More (relevant) context should help model predict better.