# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs. 

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**! 
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models: 
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies: 
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
> 
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

### My Answer
1. GPT-2: Few-shot prompting works well for GPT-2 due to its lack of advanced reasoning capabilities, allowing it to mimic examples and patterns more effectively.
2. GPT-4: Given the fact that GPT-4 is the a advanced, few-shot CoT prompting works well and is excellent for complex, multi-step tasks such as math problems or logical reasoning.
3. Vicuna: Few-shot prompting and zero-shot CoT prompting are well-suited to Vicuna, which benefits from instruction-tuning to follow prompts effectively.
4. Lama-2-7b-base: Few-shoting promoting is effective as the model is rather basic and needs guidance for examples. 

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
> 
> * Please provide your code.
> * Please report the best prompt that you found for each model and task (i.e., NLI and multiple choice QA), and the decoding scheme parameters that you used. 
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a "contradiction" or an "entailment", or the relation is "neutral". The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

In [16]:
# Import necessary libraries
import torch 
import transformers 
import pandas as pd
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

In [17]:
# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

Device: cpu


In [22]:
# load pretrained tokeniser
tokenizer_1_4b = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")

# Load model
model_1_4b = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    # torch_dtype=torch.float32, # always default
).to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [23]:
# load another model and pretrained tokeniser

model_410m = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-410m",
    # trust_remote_code=True,
    # torch_dtype=torch.float32, # always default
).to(device)

tokenizer_410m = AutoTokenizer.from_pretrained("EleutherAI/Pythia-410m")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### TASK 1. Natural Language Inference

In [24]:
# Define test sentences and labels
statements = [
    "A person on a horse jumps over a broken down airplane. A person is training his horse for a competition.",
    "A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse.",
    "Children smiling and waving at camera. There are children present.",
    "A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk.",
    "An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work.",
    "High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear.",
    ]

labels = ["neutral", "entailment", "entailment", "contradiction", "neutral", "contradiction"]

In [28]:
# NLI: zero-shot

sentence_idx = 0
zero_shot_full_prompt = "Input: " + statements[sentence_idx] + " Output: "
# tokenize input
input_ids_1_4b = tokenizer_1_4b(zero_shot_full_prompt, return_tensors="pt").input_ids.to(device)
# generate predictions, model 1.4b
zero_shot_prediction_1_4b = model_1_4b.generate(
    input_ids_1_4b,
    max_new_tokens=10,
    do_sample=True,
    temperature=0.4,
)
# model 410m
input_ids_410m = tokenizer_410m(zero_shot_full_prompt, return_tensors="pt").input_ids.to(device)
zero_shot_prediction_410m = model_410m.generate(
    input_ids_410m,
    max_new_tokens=10,
    do_sample=True,
    temperature=0.4,
)

# print
print(tokenizer_410m.decode(zero_shot_prediction_410m[0], skip_special_tokens=False))
print("Expected labels: " + labels[sentence_idx])`
print(tokenizer_1_4b.decode(zero_shot_prediction_1_4b[0], skip_special_tokens=False))
print("Expected labels: " + labels[sentence_idx])

In [None]:
# NLI: few-shot
few_shot_prompt = """
Input: A young girl practices ballet in a studio with mirrored walls. The girl is learning to dance. Output: entailment
Input: A scientist conducts experiments in a laboratory filled with beakers and equipment. The scientist is studying chemical reactions. Output: entailment
Input: A woman wearing a snorkel swims gracefully among colorful coral reefs. The woman is afraid of the ocean. Output: contradiction
Input: A family enjoys a picnic in a sun-drenched meadow, surrounded by wildflowers. The family is indoors. Output: contradiction
Input: A mother pushes her baby in a stroller through a bustling city street. The mother is caring for her child. Output: neutral
Input: A student sits at a desk, surrounded by textbooks and notes, studying for an upcoming exam. The student is focused on academics. Output: neutral
"""

few_shot_full_prompt = few_shot_prompt + "Input: " + statements[sentence_idx] + " Output: "

input_ids_1_4b = tokenizer_1_4b(few_shot_full_prompt, return_tensors="pt").input_ids.to(device)

few_shot_prediction_1_4b = model_1_4b.generate(
    input_ids_1_4b, 
    max_new_tokens=30, 
    do_sample=True,
    temperature=0.4,
)

print(tokenizer_1_4b.decode(few_shot_prediction_1_4b[0], skip_special_tokens=False))
print("Expected labels: " + labels[sentence_idx])

In [None]:
# Evaluation on accuracy


#### DISCUSSION

### TASK 2 MULTIPLE Q&A

In [86]:
# import additional libraries
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset
from tqdm import tqdm
import matplotlib.pyplot as plt

In [87]:
# set device again
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [88]:
# define questions, options, and correct answers
q_a_list = [
    {
        "question": "The only baggage the woman checked was a drawstring bag, where was she heading with it?",
        "options": ["garbage can", "military", "jewelry store", "safe", "airport"],
        "correct answer": "airport"
    },
    {
        "question": "To prevent any glare during the big football game he made sure to clean the dust of his what?",
        "options": ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"],
        "correct answer": "television"
    },
    {
        "question": "The president is the leader of what institution?",
        "options": ["walmart", "white house", "country", "corporation", "government"],
        "correct answer": "country"
    },
    {
        "question": "What kind of driving leads to accidents?",
        "options": ["stressful", "dangerous", "fun", "illegal", "deadly"],
        "correct answer": "dangerous"
    },
    {
        "question": "Can you name a good reason for attending school?",
        "options": ["get smart", "boredom", "colds and flu", "taking tests", "spend time"],
        "correct answer": "get smart"
    },
    {
        "question": "Stanley had a dream that was very vivid and scary. He had trouble telling it from what?",
        "options": ["imagination", "reality", "dreamworker", "nightmare", "awake"],
        "correct answer": "reality"
    }]

# design prompts
prompts = []
for qa in q_a_list:
    question = qa["question"]
    options = qa["options"]
    correct_answer = qa["correct answer"]
    prompt = f"Question: {question} Options: {', '.join(options)} Answer: {correct_answer}"
    prompts.append(prompt)
    print(prompt)

Question: The only baggage the woman checked was a drawstring bag, where was she heading with it? Options: garbage can, military, jewelry store, safe, airport Answer: airport
Question: To prevent any glare during the big football game he made sure to clean the dust of his what? Options: television, attic, corner, they cannot clean corner and library during football match they cannot need that, ground Answer: television
Question: The president is the leader of what institution? Options: walmart, white house, country, corporation, government Answer: country
Question: What kind of driving leads to accidents? Options: stressful, dangerous, fun, illegal, deadly Answer: dangerous
Question: Can you name a good reason for attending school? Options: get smart, boredom, colds and flu, taking tests, spend time Answer: get smart
Question: Stanley had a dream that was very vivid and scary. He had trouble telling it from what? Options: imagination, reality, dreamworker, nightmare, awake Answer: real

In [95]:
# define a function taking model and prompts as inputs and return predictions
def get_predictions(model, tokenizer, prompts):
    predictions = []
    for prompt in prompts:
        # use tokeniser to decode
        inputs = tokenizer(prompt, return_tensors = "pt", max_length = 512, truncation = True)
        # use model to predict
        outputs = model.generate(**inputs)
        # decode predictions to text
        predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens = True)
        predictions.append(predicted_answer)
        
    return predictions

model_predictions = get_predictions(model_1_4b, tokenizer_1_4b, prompts)
print(model_predictions)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


ValueError: Input length of input_ids is 37, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.

In [74]:
# Tokenize data
max_length = 128  # Adjust as needed
dataset = QuestionAnswerDataset(questions, options, answers, tokenizer_1_4b, max_length)

In [76]:
# Split data into train and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

In [77]:
# Create data loaders
batch_size = 8  # Adjust as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [78]:
# Initialize model
model_1_4b = AutoModelForCausalLM.from_pretrained("EleutherAI/Pythia-1.4b").to(device)

# Define optimizer and loss function
optimizer = AdamW(model_1_4b.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()

# Train model
train_model(model_1_4b, train_loader, optimizer, criterion)

NameError: name 'AdamW' is not defined

#### DISCUSSION

## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture. 

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?
> * How was the context represented? What is the difference / similarity to modern LLMs?
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.
> * Which training data was used? What is the difference / similarity to modern LLMs?
> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> 
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.
