<a href="https://colab.research.google.com/github/gaussalgo/L2L_MLPrague23/blob/main/notebooks/ICL_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to In-context learning
1. **🤖 What is in-context learning (ICL)**
2. **🎨 Prompt design**
3. **🦮 Zero-shot vs. Few-shot ICL**
4. **✋ Hands-on: Transforming a dataset into a few-shot prompt-label dataset and evaluating existing models**


In [None]:
import sys
!{sys.executable} -m pip install git+https://github.com/fewshot-goes-multilingual/promptsource transformers[sentencepiece]==4.19.1

# 1. 🤖 In-context learning (ICL)

In context learning is a behavior a generative model shows, where it is able to perform never before seen tasks with only its description as a part of the input. 

This behavior is mainly exhibited by **Large Language models**. The cause of why exactly it occurs is still unknown, but it may have to do with the latent concepts the LM has acquired from pretraining on large amount of data.

Learning is not meant as training, instead it means "understading" the task solely from the user's input, aka a prompt.



In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
model_path = "gaussalgo/mt5-base-priming-QA_en-cs"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)


Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.6M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

In [None]:
long_text = """
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. 
I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial"
I really had to see this for myself. The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life.
In particular she wants to focus her attentions to making some sort of documentary on what the
average Swede thought about certain political issues such as the Vietnam War and race issues in the United States.
"""

prompt = "What is meant by: {}".format(long_text) # We could be asking about the sentiment of the sentence, or meaning...the instruction is unclear

inputs = tokenizer([prompt], return_tensors="pt", padding=True)
outputs = model.generate(**inputs.to(model.device))
outputs_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# predictions:
outputs_str

['The plot is centered around a young Swedish drama student named Lena']

## 2. 🎨 What should the prompts look like?
For training a custom in-context learner we need text pairs of a prompt and label. In the above example we see how difficult it can be to create a  prompt. There is art in creating a prompt, that works best with a given model. Below we will present the [promptsource](https://github.com/bigscience-workshop/promptsource) library. Which contains over 2000 prompts for use with 180 different EN datasets.

In [None]:
from datasets import load_dataset
from promptsource.templates import DatasetTemplates

dataset = load_dataset('super_glue', 'boolq', split="validation[:10%]")

prompts = DatasetTemplates("super_glue/boolq")
print(prompts.all_template_names) # Here you can see all available prompts for the given dataset
prompt = prompts['after_reading']


Downloading builder script:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading and preparing dataset super_glue/boolq to /root/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed...


Downloading data:   0%|          | 0.00/4.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

Dataset super_glue downloaded and prepared to /root/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed. Subsequent calls will reuse this data.
['GPT-3 Style', 'I wonder…', 'after_reading', 'based on the following passage', 'based on the previous passage', 'could you tell me…', 'exam', 'exercise', 'valid_binary', 'yes_no_question']


### 2.1 Evaluation
Let's evaluate our model on a dataset created using the promptsource library and a dataset about if the answer to a question is in the context. (The model was not trained on this dataset)

In [None]:
import torch

model = model.to("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from tqdm import tqdm

predictions = []
references = [x==1 for x in dataset["label"]]

# Get predictions
for item in tqdm(dataset):
    model_input_string = prompt.apply(item)
    inputs = tokenizer(model_input_string,padding=True, truncation=True, return_tensors="pt")
    outputs = model.generate(**inputs.to(model.device))
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    predictions.append(response_text)


100%|██████████| 327/327 [00:51<00:00,  6.39it/s]


In [None]:
# Accuracy
correct_predictions = sum([pred == str(true) for pred, true in zip(predictions, references)])
incorrect_predictions = sum([pred != str(true) for pred, true in zip(predictions, references)])

accuracy = correct_predictions / (correct_predictions+incorrect_predictions)
print("Prediction using '%s' classifier; accuracy: %s" % (model.config.model_type, accuracy))  

Prediction using 'mt5' classifier; accuracy: 0.23547400611620795


## 3. 🦮 Zero-shot vs few-shot in-context learning
The prompts we talked about above we all "zero-shot" prompts, which means the model had tu learn the task from the prompted instruction and text without any demonstrations on how the expected output should look like. 

"Few-shot" prompting is when we show multiple demonstrations to the model as a part of the input prompt. These input-output examples can considerably up the models performance on never before seen tasks. The demonstration provide a lead on what is the task at hand.


In [None]:
input_zero_shot = """
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
input_few_shot_not_heterogenic = """
Question: What is the sentiment of the context: positive or negative? 
Context: He said, that the consert was very dull.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: She came from school sad and lonely.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
input_few_shot = """
Question: What is the sentiment of the context: positive or negative? 
Context: He said, that the consert was very dull.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: She came from school smiling and singing.
Answer:"positive"
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
inputs = tokenizer([input_zero_shot,input_few_shot_not_heterogenic,  input_few_shot], return_tensors="pt", padding=True)
outputs = model.generate(**inputs.to(model.device))
outputs_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# predictions:
outputs_str

['happy', 'positive or negative', 'positive']

# 4. ✋ Hands on: Creation of an evaluation dataset 

* Download an existing dataset and transform it into a prompt input - label pair (either by creating your own prompt or by using the promtsource library).
  * Text classification (https://huggingface.co/datasets/imdb)
  * Named Entity Recognition (https://huggingface.co/datasets/polyglot_ner/viewer/en/train)
  * Question Answering (https://huggingface.co/datasets/squad_v2)
  * or other

* Adjust the evaluation script accordingly

* Create a function which will generate a few-shot (the prompt will include few demonstrations of the same task) prompt and label pairs.

* evaluate FLAN base model on your dataset (try both zero-shot and few-shot prompts):

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
model_path = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [None]:
def pick_random_demonstrations():
    # From your custom dataset pick random demostrations (prompt-label pairs)
    pass

def create_few_shot_prompt():
    # With the pick_random_demonstrations() function create a new prompt
    pass

# Get models predictions

# Evaluate (depending on your dataset you may need to change the evaluation script) 

## ✋ [New] Hands on Solution: Evaluating few-shot model on our own dataset 

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
model_path = "allenai/tk-instruct-large-def-pos"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

Downloading:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/679 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

In [None]:
import torch

model = model.to("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Download a chosen dataset and choose a prompt from the promptsource library

from datasets import load_dataset
from promptsource.templates import DatasetTemplates

dataset = load_dataset("imdb", split="test").select(range(1000))  # pick first 1000 samples for test

prompts = DatasetTemplates("imdb")
print(prompts.all_template_names) # Here you can see all available prompts for the given dataset
prompt = prompts['Reviewer Enjoyment Yes No']




['Movie Expressed Sentiment', 'Movie Expressed Sentiment 2', 'Negation template for positive and negative', 'Reviewer Enjoyment', 'Reviewer Enjoyment Yes No', 'Reviewer Expressed Sentiment', 'Reviewer Opinion bad good choices', 'Reviewer Sentiment Feeling', 'Sentiment with choices ', 'Text Expressed Sentiment', 'Writer Expressed Sentiment']


In [None]:
# Create a dataframe for easier handling

from pandas import DataFrame
prompts = [prompt.apply(item) for item in dataset]
inputs = [item[0] for item in prompts]
labels = [item[1] for item in prompts]

df1 = DataFrame({"input":inputs, "label": labels})
df1



Unnamed: 0,input,label
0,I love sci-fi and am willing to put up with a ...,No
1,"Worth the entertainment value of a rental, esp...",No
2,its a totally average film with a few semi-alr...,No
3,STAR RATING: ***** Saturday Night **** Friday ...,No
4,"First off let me say, If you haven't enjoyed a...",No
...,...,...
995,When they announced this movie for TNT I was e...,No
996,"As a recent convert to Curb Your Enthusiasm, w...",No
997,Great ensemble cast but unfortunately a bunch ...,No
998,How i deserved to watch this crap??? Worst eve...,No


In [None]:
from typing import List, Tuple

all_ratings = df1.label.unique()

def pick_random_demonstrations(input_text: str) -> List[Tuple[str, int]]:
    # picks random demonstrations for each class

    picked_demonstrations = []
    
    for rating in all_ratings:
        new_demonstration_row = df1[(df1.label == rating)& (df1.input != input_text)].sample(n=1)
        review = new_demonstration_row.input.iloc[0]
        label_string = new_demonstration_row.label.iloc[0]
        picked_demonstrations.append((review, label_string))
        
    return picked_demonstrations

In [None]:
def create_few_shot_prompt(input_text:str) -> str:
    # Creates a prompt with the examples

    all_labels_demonstrations = pick_random_demonstrations(input_text)
    input_output_format = "Question: %s Answer: %s "
    formatted_demonstrations = [input_output_format % (review, label) for review, label in all_labels_demonstrations]
    formatted_demonstrations_str = "\n".join(formatted_demonstrations)
    formatted_final_input = input_output_format % (input_text, "")
    return "".join([formatted_demonstrations_str,"\n", formatted_final_input])

print(create_few_shot_prompt("This is a Question"))

Question: OUR GANG got one chance at a feature film in its 22 year history, and this was the best that could be done? It's boring, forced and pointless, and I must respectfully disagree with the other poster on this film; the 1994 LITTLE RASCALS remake was better than this. Almost anything is. The kids are subordinate to the Civil War proceedings; it doesn't feel like an OUR GANG film at all, but like a humorless second-rate Shirley Temple clone. Did the reviewer enjoy the movie? Answer: No 
Question: This is a Question Answer:  


### Zero-shot prediction

In [None]:
from tqdm import tqdm

predictions_zero_shot = []

# Get zero-shot predictions
for index, item in tqdm(df1.iterrows(), total=df1.shape[0]):
    model_input_string = item.input
    inputs = tokenizer(model_input_string,padding=True, truncation=True, return_tensors="pt")
    outputs = model.generate(**inputs.to(model.device))
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    predictions_zero_shot.append(response_text)



100%|██████████| 1000/1000 [08:01<00:00,  2.08it/s]


### Few-shot prediction

In [None]:
from tqdm import tqdm

predictions_few_shot = []

# Get few-shot predictions
for index, item in tqdm(df1.iterrows(), total=df1.shape[0]):
    model_input_string = create_few_shot_prompt(item.input)
    inputs = tokenizer(model_input_string,padding=True, truncation=True, return_tensors="pt")
    outputs = model.generate(**inputs.to(model.device))
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    predictions_few_shot.append(response_text)



100%|██████████| 1000/1000 [05:19<00:00,  3.13it/s]


In [None]:
# References for evaluation

references = df1.label.tolist()

In [None]:
# Accuracy zero-shot
correct_predictions_zero_shot = sum([true.lower() in pred.lower() for pred, true in zip(predictions_zero_shot, references)])
incorrect_predictions_zero_shot = sum([true.lower() in pred.lower() for pred, true in zip(predictions_zero_shot, references)])

accuracy_zero_shot = correct_predictions_zero_shot / (correct_predictions_zero_shot+incorrect_predictions_zero_shot)
print("Zero-shot")  
print("Prediction using '%s' classifier; accuracy: %s" % (model.config.model_type, accuracy_zero_shot))  

Zero-shot
Prediction using 't5' classifier; accuracy: 0.5


In [None]:
# Accuracy few-shot
correct_predictions_few_shot = sum([true.lower() in pred.lower() for pred, true in zip(predictions_few_shot, references)])
incorrect_predictions_few_shot = sum([true.lower() not in pred.lower() for pred, true in zip(predictions_few_shot, references)])

accuracy_few_shot = correct_predictions_few_shot / (correct_predictions_few_shot+incorrect_predictions_few_shot)
print("Few-shot")  
print("Prediction using '%s' classifier; accuracy: %s" % (model.config.model_type, accuracy_few_shot))  

Few-shot
Prediction using 't5' classifier; accuracy: 0.732


We can see that tk-instruct instruction-tuned few-shot model can largely benefit from demonstration, removing almost 50% of its zero-shot error.