# INFO 3350/6350

## Lecture 19: Zero-shot learning

## To do

* Week 14: No new reading
* PS5 in flight
  * Due Thursday, 11/30 (the week after Thanksgiving), at 11:59pm
* Final exam/project to be released by Thanksgiving
  * Due Saturday, December 9, at **noon** per Registrar

# Zero-shot learning

Our goal is to learn what **zero-shot** (and few-shot) **learning** is, how to do it, and how to evaluate different prompting strategies.

## What is zero-shot learning and how is it different from fine-tuning?

In lecture 16, we saw how to **fine-tune** an existing pretrained language model by changing its weights in response to a new task. In contrast, the **zero-shot** paradigm leaves model weights untouched. This makes it much faster than fine-tuning, though zero-shot accuracy is sometimes lower.

In the zero-shot paradigm, the main idea is to construct an input to the model and then compare which label is most likely, all without changing model parameters. And since there's no training (i.e. changing model parameters), there's no train/test split. All data is treated as evaluation data here.

## Example task: predicting genre of a reviewed book

Let's consider the book review genre classification task.  In this book review task, an *example* consists of a review's text, and its *label* is the genre of the book. Here's a sample example in the dataset:
```
This series is quite seriously a joke
in the realm of vampire novels. Even
though it's targeted for a young
audience, there's really no excuse for
this poorly done series...
```
Its corresponding label is `fantasy`.

With zero-shot learning, we can't just feed examples into an LM the way we could with fine-tuning. We need to add additional text to the beginning and/or end of an example because the model's parameters are not being changed. The **prompt** (also called a **template**) is all the extra text that we add around an example in order to get a useful output from an LLM. We'll see several different prompts and prompting strategies below.

With the current generation of powerful generative language models, this paradigm will often be the first thing to try when evaluating a new task.

## Setup

### Setup: choose a runtime with a GPU

You need a GPU to run the code below. If you don't have one, you will get an error like: `RuntimeError: Found no NVIDIA driver on your system.`



### Setup: install and import packages

In [1]:
# Install packages that we need
!pip install sentencepiece
!pip install transformers



In [2]:
# Import packages

# For downloading data
import gdown
# For working with JSON files
import json
# For working with LMs
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import numpy as np
import pandas as pd
import random
# For status bars
from tqdm.notebook import tqdm
# To display markdown
from IPython.display import display, Markdown

### Load the model

We'll use **FLAN-T5 Large** for our genre-prediction task. FLAN-T5 is a generative encoder-decoder model. If you are interested in other models, you can download others from Hugging Face, such as [Falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-Instruct) or [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf).

In [3]:
# Load the model
device = "cuda"
model_id = "google/flan-t5-large"
model_filename_string = 'flan-t5-large'
model_string = 'FLAN-T5 Large'
model = T5ForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = T5Tokenizer.from_pretrained(model_id)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Language models predict the next word

Before we get to how to set up zero-shot learning, let's use a simplified example to see how it works.

Pretrained language models take as input a sentence (give or take; perhaps much more text) and produce the most likely next token.

We can see a simple example when we ask the language model for the most likely token that follows the phrase: `"A cerulean warbler is a kind of "`. The token "bird" is the [factually correct answer](https://www.allaboutbirds.org/guide/Cerulean_Warbler/overview), so we can hope that the token "bird" will be the most likely.

The following code takes as input a string of text and returns the top 4 predicted next tokens. The important part is that it takes in a sentence and outputs which tokens are most likely according to the model.

In [4]:
# Give the model text as input, and get the top k predictions for the next token
def get_top_k_token_predictions(text, k=4):
  inputs = tokenizer(text, return_tensors="pt")
  input_ids = inputs.input_ids.to(device)
  decoder_inputs = tokenizer("<pad>", return_tensors="pt")
  decoder_input_ids = decoder_inputs.input_ids.to(device)
  with torch.no_grad():
    outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

  # Get the probability distribution over next tokens
  #print('Logits layer shape:', outputs.logits.shape) # print shape of output logits
  logits = outputs.logits[0, -1, :].cpu()
  probabilities = logits.softmax(dim=0).numpy()
  top_k_token_ids = np.argsort(probabilities)[::-1][1:k+1]
  print('Top four predictions for next token:')
  for i, token_id in enumerate(top_k_token_ids):
    print(f'  {probabilities[token_id]:.4f} probability: {tokenizer.decode(token_id)} (token ID {token_id})')

In [5]:
get_top_k_token_predictions("A cerulean warbler is a kind of ")

Top four predictions for next token:
  0.1145 probability: bird (token ID 5963)
  0.0910 probability: fly (token ID 3971)
  0.0776 probability: fish (token ID 2495)
  0.0466 probability: snake (token ID 17599)


The top prediction is the word "bird", as we hoped! The other top words are all nouns.

But what happens when we have a question that isn't immediately answered by the model's next token? For example, say we want to know what color a cerulean warbler is, and we give the model as input `"A cerulean warbler is "`:



In [6]:
get_top_k_token_predictions("A cerulean warbler is ")

Top four predictions for next token:
  0.1209 probability: weave (token ID 21938)
  0.0636 probability: bird (token ID 5963)
  0.0570 probability: </s> (token ID 1)
  0.0550 probability: war (token ID 615)


Not one of the top four predictions is even a color! This input text is less specific than the first one we tried--it's not clear what kind of word should follow the text here. We can get around this by comparing the likelihoods of specific words. Run the following code to get the likelihoods of the two options "red", or "blue":

In [7]:
def get_probability_of_specific_words(text, words):
  inputs = tokenizer(text, return_tensors="pt")
  input_ids = inputs.input_ids.to(device)
  decoder_inputs = tokenizer("<pad>", return_tensors="pt")
  decoder_input_ids = decoder_inputs.input_ids.to(device)
  with torch.no_grad():
    outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
  logits = outputs.logits[0, -1, :].cpu()
  probabilities = logits.softmax(dim=0).numpy()
  for word in words:
    token_id = tokenizer.encode(word)[0]
    print(f'  {probabilities[token_id]:.8f} probability: {word} (token ID {token_id})')

In [8]:
choices = ['red', 'blue']
get_probability_of_specific_words("A cerulean warbler is ", choices)

  0.00002273 probability: red (token ID 1131)
  0.00249429 probability: blue (token ID 1692)


Even though neither "blue" nor "red" is among the most likely next tokens, the model assigns more probability to "blue" than to "red". In fact, the model places more than 100 times as much probability on "blue" than on "red"! So in this more limited evaluation, we can say that the model chose the correct answer.

We can combine this process of comparing likelihoods of specific answers with changing the text that we use as input to the model.

## How to set up a zero-shot evaluation

Let's consider the book review task that was introduced in lecture 16.

In [9]:
# Download prepared book review data
texts_url = 'https://drive.google.com/uc?id=1qEZ3k9fZa_KITSQtFlhHq7zImUbBYCY2'
labels_url = 'https://drive.google.com/uc?id=1d-6abYwcwKfbYYdVH7mys7IeF4j9BMXP'
texts_filename = 'book_review_texts.json'
labels_filename = 'book_review_labels.json'
gdown.download(texts_url, texts_filename, quiet=True)
gdown.download(labels_url, labels_filename, quiet=True)
# And now load the data
with open(texts_filename, 'r') as f:
  all_texts = json.load(f)
with open(labels_filename, 'r') as f:
  all_labels = json.load(f)

We've seen what an **example** (also called a document) is before. In this book review task, an example consists of a review's text, and its **label** is the genre of the book. Here's a sample example in the dataset:
```
This series is quite seriously a joke
in the realm of vampire novels. Even
though it's targeted for a young
audience, there's really no excuse for
this poorly done series...
```
Its corresponding label is `fantasy`.

But we can't just feed examples into an LM the way we could with fine-tuning. With **zero-shot learning**, we need to add additional text to the beginning and/or end of an example because the model's parameters are not being changed.

The **prompt** is all the extra text that we add around an example in order to get a useful output from an LLM. We'll see several different prompts throughout this part of the tutorial.

### Construct a dataset and task: `history/biography` vs. `poetry`

**Constructing the dataset and task**

For this task, we'll limit ourselves just to reviews of books in two genres: `history/biography` and `poetry`. We'll try to tell them apart.

First, we'll select just 100 documents that are labeled with `history_biography` or `poetry` ...

In [10]:
# This function keeps only the first n examples that are
# labeled with either label_1 or label_2 (with 50% from each label)
def subsample_two_classes(all_texts, all_labels, label_1, label_2, n):
  # Convert to numpy array for easier indexing
  all_texts = np.array(all_texts)
  all_labels = np.array(all_labels)
  # Take the first n/2 examples from each class in order to have a balanced task
  idxs_label_1 = np.where(all_labels == label_1)[0].tolist()
  idxs_label_2 = np.where(all_labels == label_2)[0].tolist()
  n_each_class = int(n/2)
  idxs_label_1 = idxs_label_1[:n_each_class]
  idxs_label_2 = idxs_label_2[:n_each_class]
  subset_idxs = idxs_label_1 + idxs_label_2
  # Shuffle the order of examples
  random.shuffle(subset_idxs)
  # Actually select the indexes and return the
  subset_texts = list(all_texts[subset_idxs])
  subset_labels = list(all_labels[subset_idxs])
  return subset_texts, subset_labels

In [11]:
# Now take only book reviews for books in these two genres
task_texts, task_labels = subsample_two_classes(all_texts, all_labels,
                                                'history_biography', 'poetry',
                                                n=100)

The labels are currently called `history_biography` and `poetry`. The actual names of the labels matter in this zero-shot setting because we are evaluating the likelihood of each (unlike in the fine-tuning paradigm, where the names do not matter). Let's replace the unwieldy underscore with a slash so that `history_biography` becomes `history/biography`.

In [12]:
# Map the original label to a new name for that label
# We will evaluate with the new name
original_label_to_new_name = {
    'history_biography': 'history/biography',
    'poetry': 'poetry',
}
# This is the list of choices the model will evaluate
#possible_choices = ['history/biography', 'poetry']
possible_choices = list(original_label_to_new_name.values())

# This function renames the labels
def rename_labels(labels, label_dict):
  return [label_dict[l] for l in labels]

# Now call the function to rename history_biography to history/biography
task_labels = rename_labels(task_labels, original_label_to_new_name)

### How to set up evaluation for a single example



Let's start evaluating the model's performance on the task of telling book review genres apart, and let's do so by evaluating a single example.

An important choice here is the **prompt**. A prompt is a kind of template that maps an example to a specific text input for the language model. We'll shortly see some examples.

There are several steps we need to implement for evaluation:

1. Choose a prompt
2. Choose an example
3. Create a prompted example that we will give as input to the model
4. Get the model's *loss* for each possible choice of continuation that we are evaluating. Loss is a measure of how likely the continuation is.
5. Determine if the correct answer has the *lowest* loss. If it does, then we say the model classified the example correctly. If it doesn't, then we say the model classified the example incorrectly.

***1. Choose a prompt***

First, let's try a prompt that explicitly states the question. A prompt is also called a **template** because it provides a formula for taking an example's text and the possible choices and producing an input to the model. It is essentially a pattern with gaps, carefully structured so that likely continuations will be relevant. A prompt is necessary because we don't expect the next word after an example to be particularly meaningful.

Here's an example prompt for the book review genre prediction task:

```
Which genre of book is the following review about?
Review: <text>
Choices: <choice 1> or <choice 2>
Answer:
```

The following function takes as input an example's text and the allowable choices that we are evaluating. It outputs the **prompted example**, which is what we will call the combination of a template filled in with an example's text.

In [13]:
def apply_prompt_1(text, possible_choices):
  return f'Which genre of book is the following review about?\nReview: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

***2. Choose an example***

Let's use the following example in this section:

In [14]:
text = "I am doing some preliminary research and decided to start with Shakespeare."
label = "history/biography"

***3. Apply the prompt***

Here's the same example after the prompt has been applied:

```
Which genre of book is the following review about?
Review: I am doing some preliminary research and decided to start with Shakespeare.
Choices: history/biography or poetry
Answer:
```

In [15]:
# Apply the prompt to the example
prompted_text = apply_prompt_1(text, possible_choices)
print(prompted_text)

Which genre of book is the following review about?
Review: I am doing some preliminary research and decided to start with Shakespeare.
Choices: history/biography or poetry
Answer:


***4. Evaluate the language model on the prompted input***

First, we need to tokenize the prompted example and put the result on the GPU:

In [16]:
# Tokenize the input
input = tokenizer(prompted_text, return_tensors='pt', truncation=True)
# Put input tensors on the GPU
input_ids = input.input_ids.to(device)

Our goal is to determine which possible label the model places more probability on. Hopefully, it places more probability on the correct choice, which in this case is `history/biography`.

The key to this step is that we are actually going to query the evaluation several times: once for each possible label choice. For a given choice, we will get the model's **loss** with that choice as the proposed continuation of the prompted example. A choice with a lower loss is *more* likely. In the cerulean warbler example above, we took the model's prediction to be the word with the highest probability. Here we will equivalently define model's prediction to be the choice with the *lowest* loss.

Technical note: The Hugging Face FLAN-T5 model reports loss as the normalized [cross-entropy loss](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) (the standard loss function in language modeling) per number of tokens in the proposed label. This means that longer labels will not be penalized unfairly.

In [17]:
losses_and_targets = []
for target_pretokenized in possible_choices:
    # Tokenize the current label choice
    target = tokenizer(target_pretokenized, return_tensors='pt', truncation=True)
    # Put target tensor on GPU
    target_ids = target.input_ids.to(device)
    with torch.no_grad():
        # Run the prompted example through the model and get the loss of the
        # current possible choice
        outputs = model(input_ids, labels=target_ids)
    loss = outputs.loss.item()
    losses_and_targets.append((loss, target_pretokenized))

Now we can see the model's loss for each answer:

In [18]:
for loss, target in losses_and_targets:
  print(f'{target}: {loss:.4f}')

history/biography: 0.1607
poetry: 0.3800


***5. Does the correct label have the lowest loss?***

The last step is to programmatically rank these targets and determine if the correct label has the lowest loss (i.e. is the most likely continuation among the choices). If it does, then the model gave a correct prediction for this promtped example. If the other label has a lower loss, then the model gave an incorrect prediction.

In [19]:
losses_and_targets.sort()
lowest_loss, best_choice = losses_and_targets[0]
correct_prediction = (best_choice == label)
print(f'The model made a correct prediction: {correct_prediction}')

The model made a correct prediction: True


***Putting it altogether into an evaluation function***

Let's put it all together into one function that takes in a prompted example with its label and determines whether the language model places the most probability on that correct answer as the text's continuation. We'll be calling this function a lot shortly.

In [20]:
# This function classifies one example, determining if the model places more
# probability on the right answer.
# Input: an example text that has already been prompted,
#        the corresponding label,
#        a list of possible choices to evaluate
#        a flag for whether to print additional info
# Output: whether the prediction is correct (True if correct, False if incorrect)
def classify_example(text, label, possible_choices, verbose):
    # Print the example and label if we're in verbose mode
    if verbose:
      # Format the text with indents so it's easier to read when printed
      indented_text = text.replace("\n", "\n\t")
      print(f'Input text to the model:\n\t{indented_text}')
      print(f'Label: {label}')
    # Tokenize the input
    input = tokenizer(text, return_tensors='pt', truncation=True)
    # Put input tensors on the GPU
    input_ids = input.input_ids.to(device)
    # Compare the scores of possible targets
    losses_and_targets = []
    for target_pretokenized in possible_choices:
        target = tokenizer(target_pretokenized, return_tensors='pt', truncation=True)
        # put target tensor on GPU
        target_ids = target.input_ids.to(device)
        with torch.no_grad():
            # Run the prompted example through the model
            outputs = model(input_ids, labels=target_ids)
        loss = outputs.loss.item()
        losses_and_targets.append((loss, target_pretokenized))
    # This example was classified correctly if the correct choice has the
    # highest log-likelihood per token
    # (we normalize by number of tokens so that longer answers aren't penalized)
    losses_and_targets.sort()
    _, best_choice = losses_and_targets[0]
    if best_choice == label:
        correct_prediction = True
        is_correct_text = 'Correct'
    else:
        correct_prediction = False
        is_correct_text = 'Wrong'
    if verbose:
        print(f'{is_correct_text} prediction: {best_choice}\n')
    # Return True if the prediction is correct, False otherwise
    return correct_prediction

When we run the above code on the same example, we see that we get the same answer we had before we put all the code together into a function:

In [21]:
correct_prediction = classify_example(prompted_text, label, possible_choices, verbose=True)

Input text to the model:
	Which genre of book is the following review about?
	Review: I am doing some preliminary research and decided to start with Shakespeare.
	Choices: history/biography or poetry
	Answer:
Label: history/biography
Correct prediction: history/biography



### Evaluating a whole dataset



We can now use the evaluation function we just built to evaluate an entire dataset instead of just a single example.

We will evaluate how well the model performs on this dataset with this prompt by using simple **accuracy**, which is an appropriate for a two-class problem with evenly balanced classes. If we had imbalanced classes or multiple labels, we would want use use weighted average F1 or similar.



In [22]:
# This function takes in a whole dataset of prompted examples with labels
# And returns the accuracy
def classify_dataset(prompted_examples, labels, possible_choices, verbose=False):
    num_examples = len(prompted_examples)
    correct_predictions = [] # 0 = incorrect, 1 = correct
    for i in tqdm(range(num_examples)):
        prompted_example = prompted_examples[i]
        label = labels[i]
        # Print the first five examples: this will be true if we are at
        # the first five examples and the verbose argument was already set to true
        verbose_example = (i < 5) & verbose
        correct_prediction = classify_example(prompted_example, label,
                                              possible_choices, verbose_example)
        # Convert true/false into an integer
        # (so we can easily get the percentage that are true)
        correct_predictions.append(int(correct_prediction))
    accuracy = sum(correct_predictions) / len(correct_predictions)
    return accuracy

Now we can run the function on the whole dataset. Remember that the (unprompted) dataset is stored in the variables named `task_texts` and `task_labels`.

In [23]:
# First prompt the examples
task_texts_prompt_1 = [apply_prompt_1(t, possible_choices) for t in task_texts]
display(Markdown('**Prompt 1:**'))
# Then evaluate
accuracy = classify_dataset(task_texts_prompt_1, task_labels,
                                       possible_choices, verbose=True)
# Print the accuracy
display(Markdown(f'**Prompt 1 accuracy: {accuracy*100:.2f}%**'))

**Prompt 1:**

  0%|          | 0/100 [00:00<?, ?it/s]

Input text to the model:
	Which genre of book is the following review about?
	Review: I am doing some preliminary research and decided to start with Shakespeare.
	Choices: history/biography or poetry
	Answer:
Label: history/biography
Correct prediction: history/biography

Input text to the model:
	Which genre of book is the following review about?
	Review: Beautiful prose, intriguing character, immersive historical (WWII) backdrop. A masterpiece!
	Choices: history/biography or poetry
	Answer:
Label: history/biography
Correct prediction: history/biography

Input text to the model:
	Which genre of book is the following review about?
	Review: During the latter part of Nero's reign, a number of prominent Romans were implicated in a conspiracy to kill Nero and install Gaius Calpurnius Piso as emperor. Among these were Lucan, a famous poet, and Seneca, a philosopher and Nero's tutor when he was young. This novel is a collection of communications mostly between Tigellinus, the head of the Pra

**Prompt 1 accuracy: 76.00%**

**The language model achieves 76% accuracy on this task with this prompt.**

**Random baseline:** How well could we expect to do on this task if we guessed labels uniformly at random (i.e., if we guessed `poetry` half of the time and `history/biography` the other half of the time)? Since the dataset's labels are split evenly between the two classes, this random baseline would achieve 50% accuracy on average. So the language model is well above random chance performance with this prompt.

We're now in a position to easily change the prompt and see how performance changes.

### Changing the prompt impacts task performance

The ingredients we have for an evaluation so far are:

- the model (FLAN-T5 Large)
- the dataset (Goodreads book reviews)
- the task (classifying history/biography reviews against poetry reviews)
- the label names (`history/biography` vs. `poetry`)
- the prompt (`Which genre of book is the following review about?
Review: <text>
Choices: <choice 1> or <choice 2>
Answer:`)

We're now going to change *just* the prompt and see whether performance changes above or below the 76% accuracy that this prompt affords. Everything else (the model, dataset, and task) will be kept the same. This is easy and quick in the zero-shot classification setting because there is no re-training or re-fine-tuning step. We just have to change the prompted examples that we give as input to the model.

We might have just gotten lucky with the first prompt that we chose. Let's try four new prompt variants.

Here is the example input under the different prompts:

**Prompt 2:**
```
Review: I am doing some preliminary research and decided to start with Shakespeare.
Choices: history/biography or poetry
Genre:
```

**Prompt 3:**
```
Review: I am doing some preliminary research and decided to start with Shakespeare.
Genre:
```

**Prompt 4:**
```
Review: I am doing some preliminary research and decided to start with Shakespeare.
Which genre of book is the review about?
```

**Prompt 5:**
```
Review: I am doing some preliminary research and decided to start with Shakespeare.
Choices: history/biography or poetry
Answer:
```

Run the following code to get the results for these prompts. It will take a minute or two to run.


In [24]:
%%time
# First define the prompt template functions

def apply_prompt_2(text, possible_choices):
  return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nGenre:'

def apply_prompt_3(text, possible_choices):
  return f'Review: {text}\nGenre:'

def apply_prompt_4(text, possible_choices):
  return f'\nReview: {text}\nWhich genre of book is the review about?'

def apply_prompt_5(text, possible_choices):
  return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

# Now evaluate all four of the new prompts
for prompt_num, prompt_fn in zip([2, 3, 4, 5],
                                 [apply_prompt_2, apply_prompt_3, apply_prompt_4, apply_prompt_5]):
  display(Markdown(f'**Prompt {prompt_num}:**'))
  task_texts_prompt = [prompt_fn(t, possible_choices) for t in task_texts]
  accuracy = classify_dataset(task_texts_prompt, task_labels,
                              possible_choices, verbose=False)
  # Print the accuracy, followed by an empty line break
  display(Markdown(f'**Prompt {prompt_num} accuracy: {accuracy*100:.2f}%**'))
  print()

**Prompt 2:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 2 accuracy: 83.00%**




**Prompt 3:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 3 accuracy: 75.00%**




**Prompt 4:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 4 accuracy: 71.00%**




**Prompt 5:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 5 accuracy: 80.00%**


CPU times: user 1min 4s, sys: 6.34 s, total: 1min 11s
Wall time: 1min 13s


Prompt 2 achieves the highest accuracy, of 85%. All five prompts do pretty well, but prompt 2 does the best. This is just like the original prompt 1 that we tried, but the final word in the prompted examples is `Genre:` rather than `Answer:`. Even small changes like this can impact zero-shot performance! That's why it's important to try several different prompt templates.

**How much does prompting actually help?**

We've seen that prompting achieves relatively high accuracy. How well would the model perform without any prompting at all? What if we just input the raw example text to the model, in the same way that we could if we were fine-tuning the model?

Here, the original example will be unchanged. Our running example becomes simply:
`I am doing some preliminary research and decided to start with Shakespeare.` It is then input to the model with no additional text.

Let's find out how well this works!

In [25]:
# Evaluate without prompting the examples
display(Markdown('**Original text:**'))
accuracy = classify_dataset(task_texts, task_labels,
                            possible_choices, verbose=False)
display(Markdown(f'**Original text accuracy: {accuracy*100:.2f}%**'))

**Original text:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Original text accuracy: 59.00%**

**Final results:** Using the original text without a prompt only achieves 59% accuracy. All prompts do better than random guessing (50% accuracy) but prompt 2 still achieves the highest accuracy.


| Prompt number | Prompt template | Accuracy |
|:------:|:--------|:----:|
| 1 | `Which genre of book is the following review about?`<br/>`Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:` |76%|
| 2 | `Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Genre:` | 83%|
| 3 | `Review: <text>`<br/>`Genre:` | 75%|
| 4 | `Review: <text>`<br/>`Which genre of book is the review about?`|71%|
| 5 | `Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:`| 80%|
| Original text | `<text>` | 59% |
| Random baseline | n/a | 50% |

**Takeaway:** When you are doing zero-shot classification, the choice of prompt can make a **big** difference in performance. We haven't tried all possible prompts, so it may be possible to do even better at this task--experiment and find out!



### Changing the dataset/task: `fantasy_paranormal` vs. `romance`

Remember that the ingredients we have for an evaluation so far are:

- the model
- the dataset
- the task
- the label names
- the prompt

Now let's do the same process of trying out different prompts with a different dataset and task combination. Say that we're interested in the differences between fantasy and romance books instead of the differences between history/biography and poetry. We will do the following:

1. Pick new book reviews that are labeled as fantasy and romance
2. Rerun the evaluation code on the new dataset/task with the same prompts as before.


We'll try to classify reviews as either `fantasy` or `romance`. We'll use the same prompts as above on just the reviews of books in these two genres. We can even reuse all our code!

In [26]:
# First take 100 book reviews of fantasy and romance books
task_texts, task_labels = subsample_two_classes(all_texts, all_labels,
                                                'fantasy_paranormal', 'romance',
                                                n=100)
# Convert labels
original_label_to_new_name = {
    'fantasy_paranormal': 'fantasy',
    'romance': 'romance',
}
# This is the list of choices the model will evaluate
possible_choices = list(original_label_to_new_name.values())
# Now call the function to rename fantasy_paranormal to fantasy
task_labels = rename_labels(task_labels, original_label_to_new_name)

In [27]:
%%time
# Now evaluate the new dataset/task with the same prompts as before
for prompt_num, prompt_fn in zip([1, 2, 3, 4, 5, 6],
                                 [apply_prompt_1, apply_prompt_2, apply_prompt_3,
                                  apply_prompt_4, apply_prompt_5]):
  display(Markdown(f'**Prompt {prompt_num}:**'))
  task_texts_prompt = [prompt_fn(t, possible_choices) for t in task_texts]
  accuracy = classify_dataset(task_texts_prompt, task_labels,
                              possible_choices, verbose=False)
  # Print the accuracy, followed by an empty line break
  display(Markdown(f'**Prompt {prompt_num} accuracy: {accuracy*100:.2f}%**'))
  print()
# Also evaluate the original text
display(Markdown('**Original text:**'))
accuracy = classify_dataset(task_texts, task_labels,
                            possible_choices, verbose=False)
display(Markdown(f'**Original text accuracy: {accuracy*100:.2f}%**'))

**Prompt 1:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 1 accuracy: 72.00%**




**Prompt 2:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 2 accuracy: 65.00%**




**Prompt 3:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 3 accuracy: 62.00%**




**Prompt 4:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 4 accuracy: 70.00%**




**Prompt 5:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 5 accuracy: 63.00%**




**Original text:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Original text accuracy: 63.00%**

CPU times: user 1min 37s, sys: 10.8 s, total: 1min 47s
Wall time: 1min 50s


**Results:** Telling these two genres apart is more difficult than the previous task. Prompt 1 achieves the highest accuracy.


| Prompt number | Prompt template | Accuracy |
|:------:|:--------|:----:|
| 1 | `Which genre of book is the following review about?`<br/>`Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:` |72%|
| 2 | `Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Genre:` | 65%|
| 3 | `Review: <text>`<br/>`Genre:` | 62%|
| 4 | `Review: <text>`<br/>`Which genre of book is the review about?`|70%|
| 5 | `Review: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:`| 63%|
| Original text | `<text>` |  63%|
| Random baseline | n/a | 50% |

**Takeaway:** A different prompt works better for this task than the previous task. It's important to not just rely on a single prompt, even when the tasks are very similar. Seemingly small differences can impact performance.

## Predicting narrative vs. non-narrative

Let's try designing prompts for another task. This time, we'll use a dataset of texts that are labeled as either narrative or non-narrative. The [original data](https://doi.org/10.6084/m9.figshare.21656780.v1) accompanied the article "Towards a Data-Driven Theory of Narrativity" by Andrew Piper and Sunyam Bagga in New Literary History, 2023. We saw this dataset in lecture 18.

For our evaluation ingredients, we'll be changing everything except the language model:
- the model
- **the dataset**
- **the task**
- **the label names**
- **the prompt**

### Load prepared narrativity data

In [28]:
# First download the file to the Colab runtime
url = 'https://drive.google.com/uc?id=1gmVTLcfdzjjpr9P0x7OPjoS1hpu76t0-'
filename = 'narrativity.csv'
gdown.download(url, filename, quiet=True)
# Load the data
narrativity_df = pd.read_csv(filename)
# Store the data in lists
all_texts = narrativity_df['text'].tolist()
all_labels = narrativity_df['label'].tolist()
# Sample only the first 100 documents
all_texts = all_texts[:100]
all_labels = all_labels[:100]

### Evaluate different prompts on this dataset

Just like with book review genre classification, we can try different prompts. Let's use slight variations on the prompts we used for the book reviews tasks. Here is a single example prompted in the different ways. Its label is `narrative`.

**Original text: raw example without additional text**

```
The year before, in 1835, it had been the right
of states to tell postmasters to suppress
abolitionist mailings. Now, in the last full
year of the Jackson presidency...
```

**Prompt 1:**

```
Which genre does the following text belong to?
Text: The year before, in 1835, it had been the right of states to tell postmasters to suppress abolitionist mailings. Now, in the last full year of the Jackson presidency...
Choices: narrative or non-narrative
Answer:
```

**Prompt 2:**

```
Text: The year before, in 1835, it had been the right of states to tell postmasters to suppress abolitionist mailings. Now, in the last full year of the Jackson presidency...
Choices: narrative or non-narrative
Answer:
```

**Prompt 3:**

```
Text: The year before, in 1835, it had been the right of states to tell postmasters to suppress abolitionist mailings. Now, in the last full year of the Jackson presidency...
Answer:
```

**Prompt 4:**

```
Text: The year before, in 1835, it had been the right of states to tell postmasters to suppress abolitionist mailings. Now, in the last full year of the Jackson presidency...
Which genre does this text belong to?
```

We can implement these prompts just like we did before.
Run the following code to try all these different prompts:

In [29]:
%%time
possible_choices = ['narrative', 'non-narrative']

def apply_prompt_1(text, possible_choices):
  return f'Which genre does the following text belong to?\nText: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_2(text, possible_choices):
  return f'Text: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_3(text, possible_choices):
  return f'Text: {text}\nAnswer:'

def apply_prompt_4(text, possible_choices):
  return f'\Text: {text}\nWhich genre does this text belong to?'

for prompt_num, prompt_fn in zip([1, 2, 3, 4],
                                 [apply_prompt_1, apply_prompt_2,
                                  apply_prompt_3, apply_prompt_4]):
  display(Markdown(f'**Prompt {prompt_num}:**'))
  all_texts_prompt = [prompt_fn(t, possible_choices) for t in all_texts]
  accuracy = classify_dataset(all_texts_prompt, all_labels,
                              possible_choices, verbose=False)
  # Print the accuracy, followed by an empty line break
  display(Markdown(f'**Prompt {prompt_num} accuracy: {accuracy*100:.2f}%**'))
  print()

# Also evaluate the original text
display(Markdown('**Original text:**'))
accuracy = classify_dataset(all_texts, all_labels,
                            possible_choices, verbose=False)
display(Markdown(f'**Original text accuracy: {accuracy*100:.2f}%**'))

**Prompt 1:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 1 accuracy: 49.00%**




**Prompt 2:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 2 accuracy: 75.00%**




**Prompt 3:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 3 accuracy: 50.00%**




**Prompt 4:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 4 accuracy: 63.00%**




**Original text:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Original text accuracy: 50.00%**

CPU times: user 1min 19s, sys: 6.14 s, total: 1min 25s
Wall time: 1min 27s


**Results:** Prompt 2 performs the best.

| Prompt number | Prompt text | Accuracy |
|:-------------:|:------------|:--------:|
| 1 | `Which genre does the following text belong to?`<br/>`Text: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:` | 49% |
| 2 | `Text: <text>`<br/>`Choices: <choice 1> or <choice 2>`<br/>`Answer:` | 75% |
| 3 | `Text: <text>`<br/>`Answer:` | 50% |
| 4 | `Text: <text>`<br/>`Which genre does this text belong to?` | 63% |
| Original text | `<text>` | 50% |
| Random baseline | n/a | 50% |

There's even more variation in performance among the prompts for this task. While most prompts performed similarly for book review genre classification, only two of them did better than random chance on this narrative detection task.

### Changing the names of labels can impact performance, too

Instead of directly comparing the probabilities of the words `narrative` and `non-narrative`, we can turn this task into a Yes or No question. Of the ingredients for an evaluation, we'll be changing the label names here. This also means slightly changing the prompts, too, to deal with the new label names:

- the model
- the dataset
- the task
- **the label names**
- **the prompt**

The following code converts the the `narrative label` to `Yes` and the `non-narrative` label to `No`:

In [30]:
original_label_to_new_name = {
    'narrative': 'Yes',
    'non-narrative': 'No',
}
# This is the list of choices the model will evaluate
possible_choices = list(original_label_to_new_name.values())
# Now call the function to rename fantasy_paranormal to fantasy
all_labels = rename_labels(all_labels, original_label_to_new_name)

Now we can adapt the previous prompts to be a yes or no question, and evaluate the new prompts:

In [31]:
%%time
def apply_prompt_1(text, possible_choices):
  return f'Is the following text narrative?\nText: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_2(text, possible_choices):
  return f'Text: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_3(text, possible_choices):
  return f'Text: {text}\nAnswer:'

def apply_prompt_4(text, possible_choices):
  return f'\Text: {text}\nIs the preceding text narrative?'

for prompt_num, prompt_fn in zip([1, 2, 3, 4],
                                 [apply_prompt_1, apply_prompt_2,
                                  apply_prompt_3, apply_prompt_4]):
  display(Markdown(f'**Prompt {prompt_num}:**'))
  all_texts_prompt = [prompt_fn(t, possible_choices) for t in all_texts]
  accuracy = classify_dataset(all_texts_prompt, all_labels,
                              possible_choices, verbose=False)
  # Print the accuracy, followed by an empty line break
  display(Markdown(f'**Prompt {prompt_num} accuracy: {accuracy*100:.2f}%**'))
  print()

# Also evaluate the original text
display(Markdown('**Original text:**'))
accuracy = classify_dataset(all_texts, all_labels,
                            possible_choices, verbose=False)
display(Markdown(f'**Original text accuracy: {accuracy*100:.2f}%**'))

**Prompt 1:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 1 accuracy: 54.00%**




**Prompt 2:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 2 accuracy: 47.00%**




**Prompt 3:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 3 accuracy: 49.00%**




**Prompt 4:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Prompt 4 accuracy: 50.00%**




**Original text:**

  0%|          | 0/100 [00:00<?, ?it/s]

**Original text accuracy: 30.00%**

CPU times: user 1min 18s, sys: 5.24 s, total: 1min 23s
Wall time: 1min 25s


**Results:**

| Prompt number | Prompt text | Accuracy |
|:-------------:|:------------|:--------:|
| 1 | `Is the following text narrative?`<br/>`Text: <text>`<br/>`Choices: Yes or No`<br/>`Answer:` | 54% |
| 2 | `Text: <text>`<br/>`Choices: Yes or No`<br/>`Answer:` | 47% |
| 3 | `Text: <text>`<br/>`Answer:` | 49% |
| 4 | `Text: <text>`<br/>`Is the preceding text narrative?` | 50% |
| Original text | `<text>` | 30% |
| Random baseline | n/a | 50% |

Changing the labels dramatically changed the performance on this task. Only one prompt is able to (barely) achieve better than random performance.

**Takeaway:** If your task seems difficult, experiment with alternate labels and phrasings.
