# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs. 

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**! 
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models: 
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies: 
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
> 
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

### My Answer
1. GPT-2: Few-shot prompting works well for GPT-2 due to its lack of advanced reasoning capabilities, allowing it to mimic examples and patterns more effectively.
2. GPT-4: Given the fact that GPT-4 is the a advanced, few-shot CoT prompting works well and is excellent for complex, multi-step tasks such as math problems or logical reasoning.
3. Vicuna: Few-shot prompting and zero-shot CoT prompting are well-suited to Vicuna, which benefits from instruction-tuning to follow prompts effectively.
4. Lama-2-7b-base: Few-shoting promoting is effective as the model is rather basic and needs guidance for examples. 

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
> 
> * Please provide your code.
> * Please report the best prompt that you found for each model and task (i.e., NLI and multiple choice QA), and the decoding scheme parameters that you used. 
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a "contradiction" or an "entailment", or the relation is "neutral". The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

In [4]:
# TASK 1. Natural Language Inference

# Import necessary libraries
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_metric

# Load the models and tokenizer
model_name_410m = "EleutherAI/pythia-410m"
model_name_1_4b = "EleutherAI/pythia-1.4b"

tokenizer_410m = AutoTokenizer.from_pretrained(model_name_410m)
model_410m = AutoModelForSequenceClassification.from_pretrained(model_name_410m, num_labels=3)

tokenizer_1_4b = AutoTokenizer.from_pretrained(model_name_1_4b)
model_1_4b = AutoModelForSequenceClassification.from_pretrained(model_name_1_4b, num_labels=3)

# Set the padding token
if tokenizer_410m.pad_token is None:
    tokenizer_410m.add_special_tokens({'pad_token': '[PAD]'})
    model_410m.resize_token_embeddings(len(tokenizer_410m))

if tokenizer_1_4b.pad_token is None:
    tokenizer_1_4b.add_special_tokens({'pad_token': '[PAD]'})
    model_1_4b.resize_token_embeddings(len(tokenizer_1_4b))

# Define the test sentences and labels
sentences = [
    ("A person on a horse jumps over a broken down airplane.", "A person is training his horse for a competition.", "neutral"),
    ("A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse.", "entailment"),
    ("Children smiling and waving at camera.", "There are children present.", "entailment"),
    ("A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk.", "contradiction"),
    ("An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background.", "An older man drinks his juice as he waits for his daughter to get off work.", "neutral"),
    ("High fashion ladies wait outside a tram beside a crowd of people in the city.", "The women do not care what clothes they wear.", "contradiction")
]

# Define a function to classify sentence pairs
def classify_nli(model, tokenizer, premise, hypothesis, max_length=128):
    inputs = tokenizer(premise, hypothesis, return_tensors='pt', truncation=True, padding=True, max_length=max_length)
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()
    return ["entailment", "neutral", "contradiction"][predicted_class_id]

# Test the models and compute accuracy
def evaluate_model(model, tokenizer):
    correct = 0
    for premise, hypothesis, gold_label in sentences:
        prediction = classify_nli(model, tokenizer, premise, hypothesis)
        if prediction == gold_label:
            correct += 1
    return correct / len(sentences)

# Evaluate both models
accuracy_410m = evaluate_model(model_410m, tokenizer_410m)
accuracy_1_4b = evaluate_model(model_1_4b, tokenizer_1_4b)

print(f"Accuracy of Pythia-410m: {accuracy_410m:.2f}")
print(f"Accuracy of Pythia-1.4b: {accuracy_1_4b:.2f}")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of GPTNeoXForSequenceClassification were not initialized from the model checkpoint at EleutherAI/pythia-410m and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of GPTNeoXForSequenceClassification were not initialized from the model checkpoint at EleutherAI/pythia-1.4b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided an

Accuracy of Pythia-410m: 0.33
Accuracy of Pythia-1.4b: 0.33


In [None]:
# TASK 2

# Define the test questions and answer options
questions = [
    ("The only baggage the woman checked was a drawstring bag, where was she heading with it?", ["garbage can", "military", "jewelry store", "safe", "airport"], "airport"),
    ("To prevent any glare during the big football game he made sure to clean the dust of his what?", ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"], "television"),
    ("The president is the leader of what institution?", ["walmart", "white house", "country", "corporation", "government"], "country"),
    ("What kind of driving leads to accidents?", ["stressful", "dangerous", "fun", "illegal", "deadly"], "dangerous"),
    ("Can you name a good reason for attending school?", ["get smart", "boredom", "colds and flu", "taking tests", "spend time"], "get smart"),
    ("Stanley had a dream that was very vivid and scary. He had trouble telling it from what?", ["imagination", "reality", "dreamworker", "nightmare", "awake"], "reality")
]

# Define a function to generate the prompt and predict the answer
def predict_answer(model, tokenizer, question, options):
    prompt = f"Question: {question}\nOptions:\n"
    for i, option in enumerate(options):
        prompt += f"{i + 1}. {option}\n"
    prompt += "Answer:"
    
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=len(inputs['input_ids'][0]) + 1, num_return_sequences=1)
    
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = output_text.split("Answer:")[1].strip().split()[0]
    
    if answer.isdigit() and int(answer) <= len(options):
        return options[int(answer) - 1]
    return None

# Test the models and compute accuracy
def evaluate_model(model, tokenizer):
    correct = 0
    for question, options, gold_label in questions:
        prediction = predict_answer(model, tokenizer, question, options)
        if prediction == gold_label:
            correct += 1
    return correct / len(questions)

# Evaluate both models
accuracy_410m = evaluate_model(model_410m, tokenizer_410m)
accuracy_1_4b = evaluate_model(model_1_4b, tokenizer_1_4b)

print(f"Accuracy of Pythia-410m: {accuracy_410m:.2f}")
print(f"Accuracy of Pythia-1.4b: {accuracy_1_4b:.2f}")

## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture. 

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?
> * How was the context represented? What is the difference / similarity to modern LLMs?
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.
> * Which training data was used? What is the difference / similarity to modern LLMs?
> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> 
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.
