[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/10.llms/HW9_LLM_Inference.ipynb)

# HW9: LLM Inference

In this homework, you will experiment with different ways of improving LLM classification performance.

In [None]:
import torch

from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# use the 4B model

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

## Loading data

For the majority of this homework, we will be using data from *Who Feels What and Why? Annotation of a Literature Corpus with Semantic Roles of Emotions* [(Kim and Klinger, 2018)](https://aclanthology.org/C18-1114.pdf).

In [None]:
!wget https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/train.jsonl
!wget https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/test.jsonl

In [None]:
import json

def load_data(filepath):
    with open(filepath, "r") as f:
        data = [
            json.loads(line) for line in f
        ]
    return data

In [None]:
train_data = load_data("train.jsonl")
test_data = load_data("test.jsonl")

### Question 1

Take a look through the paper, as well as the actual dataset. What are the classification labels? **Fill them in below.**

In [None]:
train_data[0]

In [None]:
# FILL ME IN
labels = [
]

## Setting up the LLM

For greater consistency, we set the temperature to a low value (0.01) by default, but this can be changed with the generation_config setting.

In [None]:
from textwrap import dedent
import itertools
import inspect

def call_llm(prompt, system_prompt="You are a helpful assistant.", generation_config=None):  
    if generation_config is None:
        generation_config = {
            "max_new_tokens": 10,
            "temperature": 0.01
        }
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # conduct text completion
    generated = model.generate(
        **model_inputs,
        **generation_config
    )

    # let's break this down:
    #                      | we take the element of the batch (our batch size is 1)
    #                      |  |-----------------------------| skip our original input
    output_ids = generated[0][len(model_inputs.input_ids[0]):].tolist()

    # decode into token space
    return tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

## Classification

In [None]:
def evaluate(classifier):
    predictions = classifier(train_data, test_data)
    return sum(pred == target["label"] for pred, target in zip(predictions, test_data)) / len(test_data)

In [None]:
from collections import Counter
from textwrap import dedent
import random


def classify_majority_label(train_data, test_data):
    """Majority label baseline"""

    majority_class = Counter([d["label"] for d in train_data]).most_common(1)[0][0]
    for i, datum in enumerate(tqdm(test_data)):
        test_predictions.append(majority_class)

    return test_predictions

In [None]:
evaluate(classify_majority_label)

Fill the rest of these in!

## Question 2

We've implemented a majority vote baseline for you. 

Implement a zero-shot prompting classifier. Try at least 3 versions of the prompt to compare their outputs.

**In a few sentences,** describe how different prompting choices result in different outputs.

In [None]:
def classify_zero_shot(train_data, test_data):
    """Classification with zero-shot prompting."""

## Question 3

Implement the following:

1. Few-shot (k=3) classification
3. Zero-shot with chain-of-thought
4. Few-shot (k=3) with chain of thought (you will need to write reasoning chains)
5. Zero-shot with self-consistency (use `generation_config` to change the temperature)

For each of these, print out the raw LLM output for the first 5 data points in the test data.

Use the `evaluate` function to measure the accuracy of your method. **Write a few sentences comparing the performance of different prompting methods (including the above, and zero-shot from Q2.**

In [None]:
def classify_few_shot(train_data, test_data):
    """Classification with 3-shots."""

In [None]:
def classify_zero_shot_cot(train_data, test_data):
    """Classification with zero-shot chain-of-thought."""

In [None]:
def classify_few_shot_cot(train_data, test_data):
    """Classification with 3-shot chain-of-thought."""

In [None]:
def classify_zero_shot_self_consistency(train_data, test_data):
    """Implement self-consistency for zero-shot prompting."""

In [None]:
for name, fn in [
    ("majority", classify_majority_label),
    ("zero-shot", classify_zero_shot),
    ("few-shot", classify_few_shot),
    ("zero-shot-cot", classify_zero_shot_cot),
    ("few-shot-cot", classify_few_shot_cot),
    ("self-consistency", classify_zero_shot_self_consistency)
]:
    score = evaluate(fn)
    print(f"{name}\t{score}")