# Introduction to In-context learning
1. **🤖 What is in-context learning (ICL)**
2. **🦮 Zero-shot vs. Few-shot ICL**
3. **🎨 Prompt design**
4. **✋ Hands-on: Transforming a dataset into a few-shot prompt-label dataset and evaluating existing models**


In [1]:
import sys
!{sys.executable} -m pip install git+https://github.com/fewshot-goes-multilingual/promptsource

Collecting git+https://github.com/fewshot-goes-multilingual/promptsource
  Cloning https://github.com/fewshot-goes-multilingual/promptsource to /tmp/pip-req-build-9supn9vs
  Running command git clone --filter=blob:none --quiet https://github.com/fewshot-goes-multilingual/promptsource /tmp/pip-req-build-9supn9vs
  Resolved https://github.com/fewshot-goes-multilingual/promptsource to commit eb6d175b2c397cb7ee2aa46c334c17e3238a49cc
  Preparing metadata (setup.py) ... [?25ldone


# 1. 🤖 In-context learning (ICL)

In context learning is a use of a generative model, where the description of a desired task is a part of the input. 

While pre-training, the model is trained on an objective of "guessing" the right word in context. This is achieved by objectives like Masked Language Modeling (MLM) or Causal Language Modeling (CLM). During these objectives, the model aquires an inherent understanding of the language. 

After pre-training, traditionaly, we would then fine-tune the model through Supervised ML for a specific task for which we need:
* Training data (input and label pairs)
* Adding a specific layer ("head") to the model relevant to our desired task
The resulting model is fit for that one specific task.

When we talk about In-context learning, we mean a property of a model to solve tasks it was not fine-tuned for with only instructions provided in the user's input, aka a prompt. In-context learning is a property usually seen in Large Language Models


In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
model_path = "gaussalgo/mt5-base-priming-QA_en-cs"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
prompt = """
What is meant by: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. 
I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial"
I really had to see this for myself. The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life.
In particular she wants to focus her attentions to making some sort of documentary on what the
average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. 
"""

inputs = tokenizer([prompt], return_tensors="pt", padding=True)
outputs = model.generate(**inputs.to(model.device))
outputs_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# predictions:
outputs_str

['The plot is centered around a young Swedish drama student named Lena']

## 2. 🦮 Zero-shot vs few-shot in-context learning
The model might understand the task from the input, but it does not know how do we expect it to respond. Therefore, if we have the model adjusted for such use, we can show it the format of the task from a few input-output examples and see if it comprehends.

This approach is called in-context few-shot learning: In addition to the description of the task, we give the model a few input-output examples (demonstrations). Given these, the model has much easier time understanding the format of the interaction that we expect from it. The demonstration are the only lead the model has to understand the task at hand. We can see, that if we pick only examples with a negative sentiment, the model is unable to pick the correct label. 

In this setting, we need to standardize the format of prediction, so that the model can rely on it.


In [4]:
input_zero_shot = """
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
input_few_shot_not_heterogenic = """
Question: What is the sentiment of the context: positive or negative? 
Context: He said, that the consert was very dull.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: She came from school sad and lonely.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
input_few_shot = """
Question: What is the sentiment of the context: positive or negative? 
Context: He said, that the consert was very dull.
Answer:"negative"
Question: What is the sentiment of the context: positive or negative? 
Context: She came from school smiling and singing.
Answer:"positive"
Question: What is the sentiment of the context: positive or negative? 
Context: I am very happy to be here today.
Answer:""
"""
inputs = tokenizer([input_zero_shot,input_few_shot_not_heterogenic,  input_few_shot], return_tensors="pt", padding=True)
outputs = model.generate(**inputs.to(model.device))
outputs_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# predictions:
outputs_str

['happy', 'positive or negative', 'positive']

## 3. 🎨 What should the prompts look like?
For training a custom in-context learner we need text pairs of a prompt and label. While in the above example we see a unified prompt, in training it is beneficial to create multiple prompts for one task, as we want to support the models capability to understand the task by its description, not by identifying a task by a template. This diversivication should yield a benefit of having the model understanding never before seen tasks better.

In [5]:
#TODO write your own prompt (format string)

In [12]:
from datasets import load_dataset
from promptsource.templates import DatasetTemplates

dataset = load_dataset('super_glue', 'boolq', split="validation[:10%]")

prompts = DatasetTemplates("super_glue/boolq")
print(prompts.all_template_names)
prompt = prompts['GPT-3 Style']


Found cached dataset super_glue (/home/nikola/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)


['GPT-3 Style', 'I wonder…', 'after_reading', 'based on the following passage', 'based on the previous passage', 'could you tell me…', 'exam', 'exercise', 'valid_binary', 'yes_no_question']


### 3.1 Evaluation
Let's evaluate our model on a dataset created using the promptsource library and a dataset about if the answer to a question is in the context. (The model was not trained on this dataset)

In [11]:
from tqdm import tqdm

predictions = []
references = [x==1 for x in dataset["label"]]

# Get predictions
for item in tqdm(dataset):
    model_input_string = prompt.apply(item)
    inputs = tokenizer(model_input_string,padding=True, truncation=True, return_tensors="pt")
    outputs = model.generate(**inputs.to(model.device))
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    predictions.append(response_text)


  0%|          | 0/327 [00:00<?, ?it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Ethanol fuel -- All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that productio

  0%|          | 1/327 [00:00<05:03,  1.08it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Property tax -- Property tax or 'house tax' is a local tax on buildings, along with appurtenant land. It is and imposed on the Possessor (not the custodian of property as per 1978, 44th amendment of constitution). It resembles the US-type wealth tax and differs from the excise-type UK rate. The tax power is vested in the states and is delegated to local bodies, specifying the valuation method, rate band, and collection procedures. The tax base is the annual rental value (ARV) or area-based rating. Owner-occupied and other properties not producing rent are assessed on cost and then converted into ARV by applying a percentage of cost, usually four percent. Vacant land is generally exempt. Central government properties are exempt. Instead a 'service charge' is permissible under executive order. Properties of foreign missions also enjoy tax exemption without requiring reciprocity. The tax is usually accompanied by service taxes, e.g., water tax,

  1%|          | 2/327 [00:02<06:26,  1.19s/it]

['EXAM\n1. Answer by yes or no.\n\nDocument: Phantom pain -- Phantom pain sensations are described as perceptions that an individual experiences relating to a limb or an organ that is not physically part of the body. Limb loss is a result of either removal by amputation or congenital limb deficiency. However, phantom limb sensations can also occur following nerve avulsion or spinal cord injury.\nQuestion: is pain experienced in a missing body part or paralyzed area?', 'Yes']


  1%|          | 3/327 [00:03<06:16,  1.16s/it]

['EXAM\n1. Answer by yes or no.\n\nDocument: Harry Potter and the Escape from Gringotts -- Harry Potter and the Escape from Gringotts is an indoor steel roller coaster at Universal Studios Florida, a theme park located within the Universal Orlando Resort. Similar to dark rides, the roller coaster utilizes special effects in a controlled-lighting environment and also employs motion-based 3-D projection of both animation and live-action sequences to enhance the experience. The ride, which is themed to the Gringotts Wizarding Bank, became the flagship attraction for the expanded Wizarding World of Harry Potter when it opened on July 8, 2014.\nQuestion: is harry potter and the escape from gringotts a roller coaster ride?', 'Yes']


  1%|          | 4/327 [00:04<05:28,  1.02s/it]

["EXAM\n1. Answer by yes or no.\n\nDocument: Hydroxyzine -- Hydroxyzine preparations require a doctor's prescription. The drug is available in two formulations, the pamoate and the dihydrochloride or hydrochloride salts. Vistaril, Equipose, Masmoran, and Paxistil are preparations of the pamoate salt, while Atarax, Alamon, Aterax, Durrax, Tran-Q, Orgatrax, Quiess, and Tranquizine are of the hydrochloride salt.\nQuestion: is there a difference between hydroxyzine hcl and hydroxyzine pam?", 'Yes']


  2%|▏         | 5/327 [00:05<06:25,  1.20s/it]

["EXAM\n1. Answer by yes or no.\n\nDocument: Barq's -- Barq's /ˈbɑːrks/ is an American soft drink. Its brand of root beer is notable for having caffeine. Barq's, created by Edward Barq and bottled since the turn of the 20th century, is owned by the Barq family but bottled by the Coca-Cola Company. It was known as Barq's Famous Olde Tyme Root Beer until 2012.\nQuestion: is barq's root beer a pepsi product?", 'No']


  2%|▏         | 6/327 [00:06<05:20,  1.00it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Parity (mathematics) -- In mathematics, parity is the property of an integer's inclusion in one of two categories: even or odd. An integer is even if it is evenly divisible by two and odd if it is not even. For example, 6 is even because there is no remainder when dividing it by 2. By contrast, 3, 5, 7, 21 leave a remainder of 1 when divided by 2. Examples of even numbers include −4, 0, 82 and 178. In particular, zero is an even number. Some examples of odd numbers are −5, 3, 29, and 73.\nQuestion: can an odd number be divided by an even number?", 'Yes']


  2%|▏         | 7/327 [00:07<05:37,  1.05s/it]

['EXAM\n1. Answer by yes or no.\n\nDocument: List of English words containing Q not followed by U -- Of the 71 words in this list, 67 are nouns, and most would generally be considered loanwords; the only modern-English words that contain Q not followed by U and are not borrowed from another language are qiana, qwerty, and tranq. However, all of the loanwords on this list are considered to be naturalised in English according to at least one major dictionary (see References), often because they refer to concepts or societal roles that do not have an accurate equivalent in English. For words to appear here, they must appear in their own entry in a dictionary; words which occur only as part of a longer phrase are not included.\nQuestion: is there a word with q without u?', 'Yes']


  2%|▏         | 8/327 [00:08<05:14,  1.02it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: American entry into Canada by land -- Persons driving into Canada must have their vehicle's registration document and proof of insurance.\nQuestion: can u drive in canada with us license?", 'Yes']


  3%|▎         | 9/327 [00:08<04:17,  1.24it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: 2018 FIFA World Cup knockout stage -- The knockout stage of the 2018 FIFA World Cup was the second and final stage of the competition, following the group stage. It began on 30 June with the round of 16 and ended on 15 July with the final match, held at the Luzhniki Stadium in Moscow. The top two teams from each group (16 in total) advanced to the knockout stage to compete in a single-elimination style tournament. A third place play-off was also played between the two losing teams of the semi-finals.\nQuestion: is there a play off for third place in the world cup?', 'Yes']


  3%|▎         | 10/327 [00:09<04:16,  1.24it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: Alcohol laws of New York -- In response to the National Minimum Drinking Age Act in 1984, which reduced by up to 10% the federal highway funding of any state which did not have a minimum purchasing age of 21, the New York Legislature raised the drinking age from 19 to 21, effective December 1, 1985. (The drinking age had been 18 for many years before the first raise on December 4th, 1982, to 19.) Persons under 21 are prohibited from purchasing alcohol or possessing alcohol with the intent to consume, unless the alcohol was given to that person by their parent or legal guardian. There is no law prohibiting where people under 21 may possess or consume alcohol that was given to them by their parents. Persons under 21 are prohibited from having a blood alcohol level of 0.02% or higher while driving.\nQuestion: can minors drink with parents in new york?', 'Yes']


  3%|▎         | 11/327 [00:10<03:48,  1.39it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Bloodline (TV series) -- Bloodline was announced in October 2014 as part of a partnership between Netflix and Sony Pictures Television, representing Netflix's first major deal with a major film studio for a television series. The series was created and executive produced by Todd A. Kessler, Glenn Kessler, and Daniel Zelman, who previously created the FX series Damages. According to its official synopsis released by Netflix, Bloodline ``centers on a close-knit family of four adult siblings whose secrets and scars are revealed when their black sheep brother returns home.''\nQuestion: is the show bloodline based on a true story?", 'No']


  4%|▎         | 12/327 [00:10<03:39,  1.44it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Shower gel -- Shower gels for men may contain the ingredient menthol, which gives a cooling and stimulating sensation on the skin, and some men's shower gels are also designed specifically for use on hair and body. Shower gels contain milder surfactant bases than shampoos, and some also contain gentle conditioning agents in the formula. This means that shower gels can also double as an effective and perfectly acceptable substitute to shampoo, even if they are not labelled as a hair and body wash. Washing hair with shower gel should give approximately the same result as using a moisturising shampoo.\nQuestion: is it bad to wash your hair with shower gel?", 'Yes']


  4%|▍         | 13/327 [00:11<04:15,  1.23it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Excretory system -- The liver detoxifies and breaks down chemicals, poisons and other toxins that enter the body. For example, the liver transforms ammonia (which is poisonous) into urea in fish, amphibians and mammals, and into uric acid in birds and reptiles. Urea is filtered by the kidney into urine or through the gills in fish and tadpoles. Uric acid is paste-like and expelled as a semi-solid waste (the ``white'' in bird excrements). The liver also produces bile, and the body uses bile to break down fats into usable fats and unusable waste.\nQuestion: is the liver part of the excretory system?", 'Yes']


  4%|▍         | 14/327 [00:12<04:10,  1.25it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: Fantastic Beasts and Where to Find Them (film) -- Fantastic Beasts and Where to Find Them is a 2016 fantasy film directed by David Yates. A joint British and American production, it is a spin-off and prequel to the Harry Potter film series, and it was produced and written by J.K. Rowling in her screenwriting debut, and inspired by her 2001 book of the same name. The film stars Eddie Redmayne as Newt Scamander, with Katherine Waterston, Dan Fogler, Alison Sudol, Ezra Miller, Samantha Morton, Jon Voight, Carmen Ejogo, Ron Perlman, Colin Farrell and Johnny Depp in supporting roles. It is the first installment in the Fantastic Beasts film series, and ninth overall in the Wizarding World franchise, that began with the Harry Potter films.\nQuestion: is fantastic beasts and where to find them a prequel?', 'Yes']


  5%|▍         | 15/327 [00:13<04:09,  1.25it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: The Vampire Diaries (season 8) -- The Vampire Diaries, an American supernatural drama, was renewed for an eighth season by The CW on March 11, 2016. On July 23, 2016, the CW announced that the upcoming season would be the series' last and would consist of 16 episodes. The season premiered on October 21, 2016 and concluded on March 10, 2017.\nQuestion: will there be a season 8 of vampire diaries?", 'Yes']


  5%|▍         | 16/327 [00:14<03:54,  1.33it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: The Strangers (2008 film) -- The Strangers is a 2008 American slasher film written and directed by Bryan Bertino. Kristen (Liv Tyler) and James (Scott Speedman) are expecting a relaxing weekend at a family vacation home, but their stay turns out to be anything but peaceful as three masked torturers leave Kristen and James struggling for survival. Writer-director Bertino was inspired by real-life events: the Manson family Tate murders, a multiple homicide; the Keddie Cabin Murders, that occurred in California in 1981; and a series of break-ins that occurred in his own neighborhood as a child. Made on a budget of $9 million, the film was shot on location in rural South Carolina in the fall of 2006.\nQuestion: was the movie strangers based on a true story?', 'Yes']


  5%|▌         | 17/327 [00:14<03:52,  1.33it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: Russell Group -- In March 2012 it was announced that four universities -- Durham, Exeter, Queen Mary University of London; and York -- would become members of the Russell Group in August of the same year. All of the new members had previously been members of the 1994 Group of British universities.\nQuestion: is durham university part of the russell group?', 'Yes']


  6%|▌         | 18/327 [00:15<04:18,  1.19it/s]

['EXAM\n1. Answer by yes or no.\n\nDocument: The Resident (TV series) -- The Resident is an American medical drama television series aired by Fox Broadcasting Company that premiered on January 21, 2018, as a mid-season replacement entry in the 2017--18 television season. The fictional series focuses on the lives and duties of staff members at Chastain Park Memorial Hospital, while delving into the bureaucratic practices of the hospital industry. Formerly called The City, the show was purchased by Fox from Showtime in 2017. It was created by created by Amy Holden Jones, Hayley Schore, and Roshan Sethi. On May 10, 2017, Fox ordered a full 14-episode season and renewed the series for a second season on May 7, 2018. The first season officially concluded on May 14, 2018.\nQuestion: is the tv show the resident over for the season?', 'Yes']


  6%|▌         | 19/327 [00:16<03:59,  1.29it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Magnesium citrate -- Magnesium citrate is a magnesium preparation in salt form with citric acid in a 1:1 ratio (1 magnesium atom per citrate molecule). The name ``magnesium citrate'' is ambiguous and sometimes may refer to other salts such as trimagnesium citrate which has a magnesium:citrate ratio of 3:2.\nQuestion: does magnesium citrate have citric acid in it?", 'Yes']


  6%|▌         | 20/327 [00:16<03:34,  1.43it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Post-office box -- Street Addressing will have the same street address of the post office, plus a ``unit number'' that matches the P.O. Box number. As an example, in El Centro, California, the post office is located at 1598 Main Street. Therefore, for P.O. Box 9975 (fictitious), the Street Addressing would be: 1598 Main Street Unit 9975, El Centro, CA. Nationally, the first five digits of the zip code may or may not be the same as the P.O. Box address, and the last four digits (Zip + 4) are virtually always different. Except for a few of the largest post offices in the U.S., the 'Street Addressing' (not the P.O. Box address) nine digit Zip + 4 is the same for all boxes at a given location.\nQuestion: does p o box come before street address?", 'No']


  6%|▋         | 21/327 [00:17<03:49,  1.33it/s]

["EXAM\n1. Answer by yes or no.\n\nDocument: Spark plug -- A spark plug (sometimes, in British English, a sparking plug, and, colloquially, a plug) is a device for delivering electric current from an ignition system to the combustion chamber of a spark-ignition engine to ignite the compressed fuel/air mixture by an electric spark, while containing combustion pressure within the engine. A spark plug has a metal threaded shell, electrically isolated from a central electrode by a porcelain insulator. The central electrode, which may contain a resistor, is connected by a heavily insulated wire to the output terminal of an ignition coil or magneto. The spark plug's metal shell is screwed into the engine's cylinder head and thus electrically grounded. The central electrode protrudes through the porcelain insulator into the combustion chamber, forming one or more spark gaps between the inner end of the central electrode and usually one or more protuberances or structures attached to the inner

  6%|▋         | 21/327 [00:19<04:38,  1.10it/s]


KeyboardInterrupt: 

In [10]:
# Accuracy
correct_predictions = sum([pred == str(true) for pred, true in zip(predictions, references)])
incorrect_predictions = sum([pred != str(true) for pred, true in zip(predictions, references)])

accuracy = correct_predictions / (correct_predictions+incorrect_predictions)
print("Prediction using '%s' classifier; accuracy: %s" % (model.config.model_type, accuracy))  

Prediction using 'mt5' classifier; accuracy: 0.0


# 4. ✋ Hands on: Creation of an evaluation dataset 

* Download an existing dataset and transform it into a prompt input - label pair (either by creating your own prompt or by using the promtsource library).
  * Text classification (https://huggingface.co/datasets/imdb)
  * Named Entity Recognition (https://huggingface.co/datasets/polyglot_ner/viewer/en/train)
  * Question Answering (https://huggingface.co/datasets/squad_v2)
  * or other

* Adjust the evaluation script accordingly

* Create a function which will generate a few-shot (the prompt will include few demonstrations of the same task) prompt and label pairs.

* evaluate some existing ICL models on your dataset (try both zero-shot and few-shot prompts):

  * https://huggingface.co/google/flan-t5-large
  * https://huggingface.co/allenai/mtk-instruct-3b-def-pos

In [9]:
def pick_random_demonstrations():
    # From your custom dataset pick random demostrations (prompt-label pairs)
    pass

def create_few_shot_prompt():
    # With the pick_random_demonstrations() function create a new prompt
    pass

# Get models predictions

# Evaluate (depending on your dataset you may need to change the evaluation script) 