# Introduction to Data Science 2025

# Week 4

In this week's exercise, we look at prompting and zero- and few-shot task settings. Below is a text generation example from https://github.com/TurkuNLP/intro-to-nlp/blob/master/text_generation_pipeline_example.ipynb demonstrating how to load a text generation pipeline with a pre-trained model and generate text with a given prompt. Your task is to load a similar pre-trained generative model and assess whether the model succeeds at a set of tasks in zero-shot, one-shot, and two-shot settings.

**Note: Downloading and running the pre-trained model locally may take some time. Alternatively, you can open and run this notebook on [Google Colab](https://colab.research.google.com/), as assumed in the following example.**

## Text generation example

This is a brief example of how to run text generation with a causal language model and `pipeline`.

Install [transformers](https://huggingface.co/docs/transformers/index) python package. This will be used to load the model and tokenizer and to run generation.

In [1]:
!pip install --quiet transformers

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

  from .autonotebook import tqdm as notebook_tqdm


Load a generative model and its tokenizer. You can substitute any other generative model name here (e.g. [other TurkuNLP GPT-3 models](https://huggingface.co/models?sort=downloads&search=turkunlp%2Fgpt3)), but note that Colab may have issues running larger models. 

In [3]:
MODEL_NAME = 'TurkuNLP/gpt3-finnish-large'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

Instantiate a text generation pipeline using the tokenizer and model.

In [4]:
pipe = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    device=model.device
)

Device set to use cpu


We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [5]:
output = pipe('Terve, miten menee?', max_new_tokens=25)
print(output)

[{'generated_text': 'Terve, miten menee? Tullaan huomenna, jos kaikki menee hyvin.”\n”Nuku hyvin, kulta.”\n”Rakastan sinua.”'}]


Just print the text

In [6]:
print(output[0]['generated_text'])

Terve, miten menee? Tullaan huomenna, jos kaikki menee hyvin.”
”Nuku hyvin, kulta.”
”Rakastan sinua.”


We can also call the pipeline with any arguments that the model `generate` function supports. For details on text generation using `transformers`, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

Example with sampling and a high `temperature` parameter to generate more chaotic output:

In [7]:
output = pipe(
    'Terve, miten menee?',
    do_sample=True,
    temperature=10.0,
    max_new_tokens=25
)
print(output[0]['generated_text'])

Terve, miten menee? Entä millainen olit koulussa nuorena tai opiskelijana?
Minulla siis kaikki sujunut ok vaikka pieni pelko hiipinyt rintaan..
Millainen historia oli lukioni aikana


## Exercise 1

Your task is to assess whether a generative model succeeds in the following tasks in zero-shot, one-shot, and two-shot settings:

- binary sentiment classification (positive / negative)

- person name recognition

- two-digit addition (e.g. 11 + 22 = 33)

For example, for assessing whether a generative model can name capital cities, we could use the following prompts:

- zero-shot:
	>"""\
	>Identify the capital cities of countries.
	>
	>Question: What is the capital of Finland?\
	>Answer:\
	>"""
- one-shot:
	>"""\
	>Identify the capital cities of countries.
	>
	>Question: What is the capital of Sweden?\
	>Answer: Stockholm
	>
	>Question: What is the capital of Finland?\
	>Answer:\
	>"""
- two-shot:
	>"""\
	>Identify the capital cities of countries.
	>
	>Question: What is the capital of Sweden?\
	>Answer: Stockholm
	>
	>Question: What is the capital of Denmark?\
	>Answer: Copenhagen
	>
	>Question: What is the capital of Finland?\
	>Answer:\
	>"""

You can do the tasks either in English or Finnish and use a generative model of your choice from the Hugging Face models repository, for example the following models:

- English: `gpt2-large`
- Finnish: `TurkuNLP/gpt3-finnish-large`

You can either come up with your own instructions for the tasks or use the following:

- English:
	- binary sentiment classification: "Do the following texts express a positive or negative sentiment?"
	- person name recognition: "List the person names occurring in the following texts."
	- two-digit addition: "This is a first grade math exam."
- Finnish:
	- binary sentiment classification: "Ilmaisevatko seuraavat tekstit positiivista vai negatiivista tunnetta?"
	- person name recognition: "Listaa seuraavissa teksteissä mainitut henkilönnimet."
	- two-digit addition: "Tämä on ensimmäisen luokan matematiikan koe."

Come up with at least two test cases for each of the three tasks, and come up with your own one- and two-shot examples.

In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import re

model_name_en = 'gpt2'  
tok_en = AutoTokenizer.from_pretrained(model_name_en)
mdl_en = AutoModelForCausalLM.from_pretrained(model_name_en)
pipe_en = pipeline('text-generation', model=mdl_en, tokenizer=tok_en, device=mdl_en.device)

def generate_en(prompt, max_new_tokens=40):
    out = pipe_en(prompt, max_new_tokens=max_new_tokens, do_sample=False)
    return out[0]['generated_text']

sentiment_instruction = (
    "Say if the text is positive or negative. Answer with 'positive' or 'negative' only.\n"
)

sentiment_one_shot_example = (
    "Text: The day was amazing and I felt great.\n"
    "Answer: positive\n"
)

sentiment_two_shot_example = (
    "Text: The food was cold and the service was slow.\n"
    "Answer: negative\n"
)

sentiment_tests = [
    {"text": "I love this new laptop, the battery lasts long and the screen is beautiful.", "expected": "positive"},
    {"text": "The app keeps crashing and it wastes my time. I hate it.", "expected": "negative"},
]

def make_sentiment_prompt(setting, text):
    prompt = sentiment_instruction
    if setting == 'one-shot':
        prompt += sentiment_one_shot_example
    if setting == 'two-shot':
        prompt += sentiment_one_shot_example + sentiment_two_shot_example
    prompt += f"Text: {text}\nAnswer:"
    return prompt

def check_sentiment(answer_text, expected_label):
    ans = answer_text.split('Answer:')[-1].strip().lower()
    if 'positive' in ans:
        pred = 'positive'
    elif 'negative' in ans:
        pred = 'negative'
    else:
        pred = 'unknown'
    return pred == expected_label, pred

ner_instruction = (
    "List the person names in the text. Return only the names separated by commas.\n"
)
ner_one_shot_example = (
    "Text: Alice met Bob at the library.\n"
    "Answer: Alice, Bob\n"
)
ner_two_shot_example = (
    "Text: John and Mary studied together.\n"
    "Answer: John, Mary\n"
)
ner_tests = [
    {"text": "Alice and Bob went to the park.", "expected": ["Alice", "Bob"]},
    {"text": "Barack Obama met Joe Biden in Washington.", "expected": ["Barack Obama", "Joe Biden"]},
]

def make_ner_prompt(setting, text):
    prompt = ner_instruction
    if setting == 'one-shot':
        prompt += ner_one_shot_example
    if setting == 'two-shot':
        prompt += ner_one_shot_example + ner_two_shot_example
    prompt += f"Text: {text}\nAnswer:"
    return prompt

def check_ner(answer_text, expected_names):
    ans = answer_text.split('Answer:')[-1]
    ans_low = ans.lower()
    ok = True
    for gold in expected_names:
        if all(part.lower() in ans_low for part in gold.split()):
            continue
        else:
            ok = False
    return ok, ans.strip()

math_instruction = (
    "This is a first grade math quiz. Solve the sum. Answer with just the number.\n"
)
math_one_shot_example = (
    "Question: 12 + 7 =\n"
    "Answer: 19\n"
)
math_two_shot_example = (
    "Question: 23 + 11 =\n"
    "Answer: 34\n"
)
math_tests = [
    {"a": 11, "b": 22},
    {"a": 47, "b": 18},
]

def make_math_prompt(setting, a, b):
    prompt = math_instruction
    if setting == 'one-shot':
        prompt += math_one_shot_example
    if setting == 'two-shot':
        prompt += math_one_shot_example + math_two_shot_example
    prompt += f"Question: {a} + {b} =\nAnswer:"
    return prompt

def check_math(answer_text, a, b):
    ans = answer_text.split('Answer:')[-1]
    m = re.search(r"[-+]?\d+", ans)
    if m:
        pred = int(m.group(0))
    else:
        pred = None
    correct_value = a + b
    return (pred == correct_value), (pred, correct_value, ans.strip())

all_results = []
settings = ['zero-shot', 'one-shot', 'two-shot']

for setting in settings:
    # Sentiment
    for i, case in enumerate(sentiment_tests, start=1):
        prompt = make_sentiment_prompt(setting, case['text'])
        gen = generate_en(prompt, max_new_tokens=24)
        ok, pred = check_sentiment(gen, case['expected'])
        all_results.append(['sentiment', setting, i, case['expected'], pred, ok])

    # NER
    for i, case in enumerate(ner_tests, start=1):
        prompt = make_ner_prompt(setting, case['text'])
        gen = generate_en(prompt, max_new_tokens=32)
        ok, ans = check_ner(gen, case['expected'])
        all_results.append(['name_recognition', setting, i, ', '.join(case['expected']), ans, ok])

    # Math
    for i, case in enumerate(math_tests, start=1):
        prompt = make_math_prompt(setting, case['a'], case['b'])
        gen = generate_en(prompt, max_new_tokens=10)
        ok, info = check_math(gen, case['a'], case['b'])
        pred, gold, raw = info
        all_results.append(['two_digit_addition', setting, i, str(gold), str(pred), ok])

print("Task | Setting | Case | Expected | Predicted | OK")
for row in all_results:
    print(" | ".join([str(x) for x in row]))

acc = {}
for task, setting, case, exp, pred, ok in all_results:
    key = (task, setting)
    if key not in acc:
        acc[key] = [0,0]
    acc[key][1] += 1
    if ok:
        acc[key][0] += 1

print("\nAccuracy by task/setting:")
for key, (good, total) in acc.items():
    pct = 100.0 * good / total if total else 0.0
    print(f"{key[0]} / {key[1]}: {good}/{total} = {pct:.1f}%")

print("\nNotes:")
print("- GPT-2 is small and not instruction-tuned, so zero-shot can be weak.")
print("- One- and two-shot usually help by showing the format.")
print("- NER check only verifies that expected names appear in the model's answer.")


Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for o

Task | Setting | Case | Expected | Predicted | OK
sentiment | zero-shot | 1 | positive | unknown | False
sentiment | zero-shot | 2 | negative | unknown | False
name_recognition | zero-shot | 1 | Alice, Bob | Alice and Bob went to the park.
Text: Alice and Bob went to the park.
Text: Alice and Bob went to the park.
Text | True
name_recognition | zero-shot | 2 | Barack Obama, Joe Biden | Joe Biden met Barack Obama in Washington.
Text: Barack Obama met Joe Biden in Washington.
Text: Barack Obama met Joe Biden in Washington.
Text | True
two_digit_addition | zero-shot | 1 | 33 | 11 | False
two_digit_addition | zero-shot | 2 | 65 | 47 | False
sentiment | one-shot | 1 | positive | negative | False
sentiment | one-shot | 2 | negative | negative | True
name_recognition | one-shot | 1 | Alice, Bob | Alice and Bob went to the park.
Text: Alice and Bob went to the park.
Text: Alice and Bob went to the park.
Text | True
name_recognition | one-shot | 2 | Barack Obama, Joe Biden | Barack Obama, Joe B

**Submit this exercise by submitting your code and your answers to the above questions as comments on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**