# Generative AI Use Case: Summarize Dialogue

Welcome to the practical side of this course. In this lab you will do the dialogue summarization task using generative AI. You will explore how the input text affects the output of the model, and perform prompt engineering to direct it towards the task you need. By comparing zero shot, one shot, and few shot inferences, you will take the first step towards prompt engineering and see how it can enhance the generative output of Large Language Models.

# Table of Contents

- [ 1 - Set up Kernel and Required Dependencies](#1)
- [ 2 - Summarize Dialogue without Prompt Engineering](#2)
- [ 3 - Summarize Dialogue with an Instruction Prompt](#3)
  - [ 3.1 - Zero Shot Inference with an Instruction Prompt](#3.1)
  - [ 3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5](#3.2)
- [ 4 - Summarize Dialogue with One Shot and Few Shot Inference](#4)
  - [ 4.1 - One Shot Inference](#4.1)
  - [ 4.2 - Few Shot Inference](#4.2)
- [ 5 - Generative Configuration Parameters for Inference](#5)


<a name='1'></a>
## 1 - Set up Kernel and Required Dependencies

First, check that the correct kernel is chosen.

<img src="images/kernel_set_up.png" width="300"/>

The following code checks the compute instance type to ensure there are enough compute resources to run this lab.


In [None]:
def verify_m5_2xlarge():
    import subprocess, psutil
    c,m=int(subprocess.getoutput("nproc")),psutil.virtual_memory().total/1024**3
    assert c==8 and abs(m-32)<2,f"ERROR: Wrong instance type. Select ml.m5.2xlarge (8 vCPUs, 32 GiB). Current: {c} vCPUs, {m:.1f} GiB\nFix: File->Shut Down, then select ml.m5.2xlarge"
    print("Instance verified ✓")
verify_m5_2xlarge()

Instance verified ✓


Now install the required packages to use PyTorch and Hugging Face transformers and datasets.



In [None]:
# First upgrade pip
%pip install --upgrade pip

# Install tensorflow and keras first
%pip install tensorflow==2.18.0 keras==3.9.0

# Install torch and torchdata
%pip install --no-deps torch==2.5.1 torchdata==0.6.0 --quiet

# Then install other packages except TRL
%pip install -U \
    datasets==2.17.0 \
    transformers==4.38.2 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
jupyter-ai 2.30.0 requires faiss-cpu!=1.8.0.post0,<2.0.0,>=1.8.0, which is not installed.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.
pathos 0.3.3 requires dill>=0.3.9, but you have dill 0.3.8 which is incompatible.
pathos 0.3.3 requires multiprocess>=0.70.17, b



Load the datasets, Large Language Model (LLM), tokenizer, and configurator. Do not worry if you do not understand yet all of those components - they will be described and discussed later in the notebook.

In [None]:
# ────────────────────────────────────────────────────────────────
# DATA LOADER
# ────────────────────────────────────────────────────────────────
from datasets import load_dataset       # Hugging Face helper that grabs a
                                        # dataset (local file, URL, or HF Hub)
                                        # and turns it into a fast, memory-
                                        # mapped object you can iterate over
                                        # in mini-batches.  One-liner for
                                        # reading MIMIC-IV CSV, JSON, Parquet,
                                        # etc.

# ────────────────────────────────────────────────────────────────
# MODEL CLASS: sequence-to-sequence (T5, FLAN-T5, BART, …)
# ────────────────────────────────────────────────────────────────
from transformers import AutoModelForSeq2SeqLM   # “Auto” picks the right
                                                 # concrete class once you
                                                 # pass model_name (e.g.,
                                                 # "google/flan-t5-base").
                                                 # All encoder-decoder models
                                                 # that generate **new text**
                                                 # (summaries, translations)
                                                 # fall under Seq2SeqLM.

# ────────────────────────────────────────────────────────────────
# TOKENIZER
# ────────────────────────────────────────────────────────────────
from transformers import AutoTokenizer          # Converts raw strings ↔︎ lists
                                                # of integer IDs.  Must be the
                                                # *exact* tokenizer that matches
                                                # the model checkpoint; the
                                                # Auto* helper ensures that.

# ────────────────────────────────────────────────────────────────
# GENERATION SETTINGS
# ────────────────────────────────────────────────────────────────
from transformers import GenerationConfig       # A tidy container for all the
                                                # decoding knobs: max_new_tokens,
                                                # temperature, top_k / top_p,
                                                # penalty_alpha (contrastive
                                                # decoding), etc.  You can save
                                                # a config once and reuse it
                                                # across multiple generate()
                                                # calls.


<a name='2'></a>
## 2 - Summarize Dialogue without Prompt Engineering

In this use case, you will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index).

Let's upload some simple dialogues from the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
# ------------------------------------------------------------------
# 1) Pick a dataset that already lives on the Hugging Face Hub.
#    "knkarthick/dialogsum" is a public repo containing a benchmark
#    of short human-written dialogues + reference summaries.
#    (Perfect sandbox before you jump to the larger MIMIC-IV notes.)
# ------------------------------------------------------------------
huggingface_dataset_name = "knkarthick/dialogsum"   # <hub_owner>/<repo_name>

# ------------------------------------------------------------------
# 2) Download (or read from cache) and return a DatasetDict object.
#    • If this is your first run, the files are pulled over HTTP,
#      decompressed, then stored under ~/.cache/huggingface/datasets.
#    • Next time, load_dataset() will grab them locally—no re-download.
#
#    Result structure:
#      dataset["train"]   →  tens of thousands of examples
#      dataset["validation"]
#      dataset["test"]
#
#    Each split behaves like a Python list of dicts, but with vectorized
#    methods such as .map(), .shuffle(), .select(), .train_test_split().
# ------------------------------------------------------------------
dataset = load_dataset(huggingface_dataset_name)


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating validation split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Print a couple of dialogues with their baseline summaries.

In [None]:
# ------------------------------------------------------------
# 1) Pick which rows you want to print from the test set.
#    • Each row in dataset["test"] is one dialogue-summary pair.
#    • Row 0 = the very first example, row 1 = the next, etc.
#    • Here we choose rows 40 and 200 just to show variety.
# ------------------------------------------------------------
example_rows = [40, 200]     # <- change to [0, 1] to see the first two rows
                             #    or [5] to see only row 5, etc.

# ──────────────────────────────────────────────────────────
# 2) Build a long dashed line "----------…"
#    • The generator ('' for x in range(100)) yields 100 empty
#      strings.  Joining them with '-' puts a dash between each
#      empty string, giving you 99 dashes total.
#    • A simpler way would be '-' * 100, but the lab author
#      chose .join(); we keep it and explain the trick.
# ──────────────────────────────────────────────────────────
dash_line = '-'.join('' for _ in range(100))   # visual separator

# ──────────────────────────────────────────────────────────
# 3) Loop over our chosen indices and print a nicely-formatted
#    preview so you can eyeball what the raw dialogue looks like
#    and compare it to the human-written reference summary.
# ──────────────────────────────────────────────────────────
for i, idx in enumerate(example_indices):

    # Header line
    print(dash_line)
    print(f'Example {i + 1}')        # friendly counter: 1, 2, …

    # Another separator for readability
    print(dash_line)

    # ---- Show the full dialogue --------------------------------
    print('INPUT DIALOGUE:')
    print(dataset['test'][idx]['dialogue'])    # raw multi-turn convo
    print(dash_line)

    # ---- Show the gold-standard summary ------------------------
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][idx]['summary'])     # written by annotators
    print(dash_line)

    # Blank line between examples
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Exa

Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method.

In [None]:
# ------------------------------------------------------------
# 1) Pick which checkpoint (i.e., “pre-trained model file”) you
#    want from the Hugging Face Hub.
#    • "google/flan-t5-base" = 580 M-parameter FLAN-T5, already
#      fine-tuned by Google to follow instructions in plain English.
#      (Good balance of quality vs. GPU/CPU memory.)
# ------------------------------------------------------------
model_name = "google/flan-t5-base"   # try "google/flan-t5-large" for higher quality,
                                     # or "t5-small" if you have very little RAM.


# ------------------------------------------------------------
# 2) Download the weights + configuration (or read them from
#    ~/.cache if you’ve done it before) and build a PyTorch model
#    object that’s ready for inference.
#    • AutoModelForSeq2SeqLM   → “Auto” figures out the exact class
#      (here, T5ForConditionalGeneration) based on the checkpoint.
#    • .from_pretrained(...)   → handles:
#         – fetching files from the Hub
#         – reading config.json (d_model, n_layers, etc.)
#         – loading the weight tensors into the architecture
# ------------------------------------------------------------
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# (Optional) move to GPU if available
# model.to("cuda")




config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models.

Download the tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [None]:
# ------------------------------------------------------------
# TOKENIZER: turns raw text  ←→  integer IDs that the model expects
#
# • AutoTokenizer            → picks the correct tokenizer class
#                              (T5Tokenizer, LlamaTokenizer, etc.)
#   based solely on the same
#   `model_name` we used for the weights.
#
# • .from_pretrained(...)    → downloads (or caches) the tokenizer
#                              files: vocab.json, merges.txt, special
#                              tokens, normalisation rules, etc.
#
# • use_fast=True            → prefer the “fast” Rust-based tokenizer
#                              (tokenizers library) which is:
#                                – 5-10× quicker than the pure-Python version
#                                – functionally identical for most models
#                              Falls back to the slow version if a fast
#                              implementation doesn’t exist for this checkpoint.
# ------------------------------------------------------------
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Test the tokenizer encoding and decoding a simple sentence:

In [None]:
# ------------------------------------------------------------
# 1) Raw text we want to feed into (or recover from) the model.
#    Feel free to change this string to anything you like.
# ------------------------------------------------------------
sentence = "What time is it, Tom?"


# ------------------------------------------------------------
# 2) ENCODE  (text → integer IDs)
#    • tokenizer(sentence, return_tensors='pt')
#        – Splits the text into sub-word pieces
#          (e.g., "What", "▁time", "▁is", …).
#        – Maps each piece to its numeric ID from the model’s
#          vocabulary file.
#        – Adds special tokens if needed (e.g., <pad>, </s>).
#        – return_tensors='pt'  → wrap the result in PyTorch
#          tensors so it can go straight into the model.
#
#    The output is a *dict* like:
#        {
#          "input_ids":      tensor([[ 262,  3635,  19,  34,    6,  924,    58 ]]),
#          "attention_mask": tensor([[   1,    1,   1,   1,    1,    1,     1 ]])
#        }
# ------------------------------------------------------------
sentence_encoded = tokenizer(sentence, return_tensors='pt')


# ------------------------------------------------------------
# 3) DECODE  (IDs → text, mainly for sanity-checking)
#    • tokenizer.decode(ids, skip_special_tokens=True)
#        – Converts the list of IDs back to readable tokens.
#        – skip_special_tokens=True  drops things like <pad> or </s>
#          so you see only the original words.
#
#    We grab the first (and only) row with [0] because
#    sentence_encoded["input_ids"] has shape [batch, seq_len].
# ------------------------------------------------------------
sentence_decoded = tokenizer.decode(
    sentence_encoded["input_ids"][0],
    skip_special_tokens=True
)


# ------------------------------------------------------------
# 4) Print both versions so you can verify round-tripping works.
# ------------------------------------------------------------
print('ENCODED SENTENCE (token IDs):')
print(sentence_encoded["input_ids"][0])

print('\nDECODED SENTENCE (back to text):')
print(sentence_decoded)


ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [None]:
# ------------------------------------------------------------------
# ITERATE THROUGH the sample indices and have the model
# produce a summary for each dialogue.
#
# ─ Variables used ────────────────────────────────────────────────
# example_indices : list of row numbers we picked earlier
# dataset         : Hugging Face DatasetDict already loaded
# tokenizer       : converts text ↔ token IDs
# model           : FLAN-T5 network we loaded
# dash_line       : a long "------" string for nicer console output
# ------------------------------------------------------------------
for i, index in enumerate(example_indices):

    # --------------------------------------------------------------
    # 1) FETCH raw data from the test split
    #    • Each item is a dict with keys: "dialogue", "summary", …
    # --------------------------------------------------------------
    dialogue = dataset['test'][index]['dialogue'] # full conversation
    summary = dataset['test'][index]['summary'] # human summary

    # --------------------------------------------------------------
    # 2) ENCODE the dialogue so the model can understand it
    #    return_tensors='pt'  → get PyTorch tensors (not lists)
    # --------------------------------------------------------------
    inputs = tokenizer(dialogue, return_tensors='pt')

    # --------------------------------------------------------------
    # 3) GENERATE a summary with the model
    #    • model.generate(ids, max_new_tokens=50)
    #        – takes the encoded input IDs
    #        – autoregressively produces up to 50 new tokens
    #    • The output is a tensor of token IDs → decode back to text
    # --------------------------------------------------------------
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], # prompt IDs
            max_new_tokens=50, # cap length of generated summary
        )[0],  # [0] because batch-size is 1
        skip_special_tokens=True # drop <pad>, </s>, etc.
    )

    # --------------------------------------------------------------
    # 4) PRINT everything in a readable block
    # --------------------------------------------------------------
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION - WITHOUT PROMPT ENGINEERING:
Person1: It's ten to nine.

-------------------------------

You can see that the guesses of the model make some sense, but it doesn't seem to be sure what task it is supposed to accomplish. Seems it just makes up the next sentence in the dialogue. Prompt engineering can help here.

<a name='3'></a>
## 3 - Summarize Dialogue with an Instruction Prompt

Prompt engineering is an important concept in using foundation models for text generation. You can check out [this blog](https://www.amazon.science/blog/emnlp-prompt-engineering-is-the-new-feature-engineering) from Amazon Science for a quick introduction to prompt engineering.

<a name='3.1'></a>
### 3.1 - Zero Shot Inference with an Instruction Prompt

In order to instruct the model to perform a task - summarize a dialogue - you can take the dialogue and convert it into an instruction prompt. This is often called **zero shot inference**.  You can check out [this blog from AWS](https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/) for a quick description of what zero shot learning is and why it is an important concept to the LLM model.

Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [None]:
# ------------------------------------------------------------------
# ZERO-SHOT PROMPT ENGINEERING EXAMPLE
#   • Same idea as the previous cell, *but* we wrap the dialogue
#     inside an explicit instruction:  “Summarize the following…”.
#   • That single natural-language instruction often makes FLAN-T5
#     produce much cleaner summaries (called “zero-shot” because we
#     give *no* example summaries—just the task description).
# ------------------------------------------------------------------


for i, index in enumerate(example_indices):
    # --------------------------------------------------------------
    # 1) Pull the raw dialogue + its human summary from the test set
    # --------------------------------------------------------------
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    # --------------------------------------------------------------
    # 2) Build an instruction-style prompt
    #    • f""" … """  is a multi-line f-string: we can insert the
    #      dialogue variable right inside the string.
    #    • The trailing word “Summary:” nudges the model to start
    #      writing immediately after that label.
    # --------------------------------------------------------------
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
    """

    # --------------------------------------------------------------
    # 3) ENCODE the prompt and GENERATE up to 50 new tokens
    # --------------------------------------------------------------
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], # encoded prompt
            max_new_tokens=50, # stop after 50 generated tokens
        )[0],  # batch size = 1 → grab row 0
        skip_special_tokens=True # remove <pad>, </s>, etc.
    )

    # --------------------------------------------------------------
    # 4) Pretty-print everything side by side for inspection
    # --------------------------------------------------------------
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
The train is about to

This is much better! But the model still does not pick up on the nuance of the conversations though.

**Exercise:**

- Experiment with the `prompt` text and see how the inferences will be changed. Will the inferences change if you end the prompt with just empty string vs. `Summary: `?
- Try to rephrase the beginning of the `prompt` text from `Summarize the following conversation.` to something different - and see how it will influence the generated output.

<a name='3.2'></a>
### 3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5

Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/tree/main/flan/v2). In the following code, you will use one of the [pre-built FLAN-T5 prompts](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py):

In [None]:
# ------------------------------------------------------------------
# ZERO-SHOT, *CONVERSATIONAL* PROMPT
#   • Same goal (get the model to summarise) but phrased as a casual
#     question:  “What was going on?”  Sometimes open-ended wording
#     elicits a more narrative answer than the explicit “Summarize…”.
# ------------------------------------------------------------------
for i, index in enumerate(example_indices):

    # --------------------------------------------------------------
    # 1) Fetch the dialogue text + its human-written summary
    # --------------------------------------------------------------
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    # --------------------------------------------------------------
    # 2) Build an *informal* prompt
    #    • Section label “Dialogue:” keeps the raw text readable.
    #    • Followed by an open question: “What was going on?”
    # --------------------------------------------------------------
    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    # --------------------------------------------------------------
    # 3) Tokenise the prompt and generate up to 50 tokens
    # --------------------------------------------------------------
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Tom is late for the train.

--------------

Notice that this prompt from FLAN-T5 did help a bit, but still struggles to pick up on the nuance of the conversation. This is what you will try to solve with the few shot inferencing.

<a name='4'></a>
## 4 - Summarize Dialogue with One Shot and Few Shot Inference

**One shot and few shot inference** are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.  You can read more about it in [this blog from HuggingFace](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api).

<a name='4.1'></a>
### 4.1 - One Shot Inference

Let's build a function that takes a list of `example_indices_full`, generates a prompt with full examples, then at the end appends the prompt which you want the model to complete (`example_index_to_summarize`).  You will use the same FLAN-T5 prompt template from section [3.2](#3.2).

In [None]:
## ------------------------------------------------------------------
# FEW-SHOT PROMPT CONSTRUCTOR
#   • Purpose: build *one big string* that contains N “demo” pairs
#     (dialogue + correct summary) **plus** one final dialogue the
#     model must summarise itself.
#   • This is called *few-shot prompting*: we show the model examples
#     of the task so it can mimic the pattern.
# ------------------------------------------------------------------

def make_prompt(example_indices_full, example_index_to_summarize):
   """
    Parameters
    ----------
    example_indices_full : list[int]
        Dataset rows we will use as *demonstrations*.  Each one adds:
            Dialogue
            + question “What was going on?”
            + the *answer* (human summary)

    example_index_to_summarize : int
        Row that will *not* include an answer.  The model will have to
        generate the summary for this final dialogue.

    Returns
    -------
    prompt : str
        A single, multi-line prompt ready to feed into the tokenizer.
        Format:
            Dialogue
            Question
            Human summary
            <blank line>
        … repeated for each demo, and finally
            Dialogue
            Question
            ←-- model continues here
    """

    prompt = '' # start with an empty string

    # --------------------------------------------------------------
    # 1) Add *demonstration* examples (dialogue + known summary)
    # --------------------------------------------------------------
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        # The triple line-break after {summary} acts as a *stop
        # sequence* for FLAN-T5.  It separates one QA pair from the
        # next so the model learns the boundary.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""
    # --------------------------------------------------------------
    # 2) Append the *new* dialogue that needs summarising
    #    (no answer provided — model must fill this in)
    # --------------------------------------------------------------
    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
""" # <-- we stop after the question; model will continue here

    return prompt

Construct the prompt to perform one shot inference:

In [None]:
# ------------------------------------------------------------
# EXAMPLE: BUILD A **ONE-SHOT** PROMPT
#   • “One-shot” = give the model ONE demo pair (dialogue + summary)
#     before asking it to summarise a new dialogue by itself.
# ------------------------------------------------------------

# Row numbers for our *demonstration* set.
# Here we choose row 40 (any single row works).
example_indices_full = [40]           # ← 1 demo dialogue + its summary

# Row number the model must now summarise (no answer provided).
example_index_to_summarize = 200      # ← target dialogue

# ------------------------------------------------------------
# Call the helper we wrote earlier to assemble the big prompt:
#   [Dialogue 40 + Q + Summary]
#   <blank line>
#   [Dialogue 200 + Q]    ← model will continue from here
# ------------------------------------------------------------
one_shot_prompt = make_prompt(example_indices_full,  # demo rows
                              example_index_to_summarize # unseen row
                              )

# ------------------------------------------------------------
# Inspect the resulting prompt text so you can verify the format
# before sending it through the model.
# ------------------------------------------------------------
print(one_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also ne

Now pass this prompt to perform the one shot inference:

In [None]:
# ------------------------------------------------------------
# 1) GRAB THE GOLD-STANDARD SUMMARY
#    • We already picked example_index_to_summarize = 200.
#    • Pull the human-written summary for that row so we can
#      compare the model’s one-shot output to the reference.
# ------------------------------------------------------------
summary = dataset['test'][example_index_to_summarize]['summary']

# ------------------------------------------------------------
# 2) TOKENISE THE ONE-SHOT PROMPT AND GENERATE A SUMMARY
#    • one_shot_prompt  was built in the previous cell:
#         [demo dialogue + answer]  +  target dialogue + question
#    • model.generate(...)
#         – reads the full prompt (demo + new dialogue)
#         – continues writing after the final question
#    • max_new_tokens=50  caps the length of the model’s answer
# ------------------------------------------------------------
inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

# ------------------------------------------------------------
# 3) SHOW BASELINE VS. MODEL OUTPUT SIDE-BY-SIDE
# ------------------------------------------------------------
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to add a CD-ROM drive.


<a name='4.2'></a>
### 4.2 - Few Shot Inference

Let's explore few shot inference by adding two more full dialogue-summary pairs to your prompt.

In [None]:
# ------------------------------------------------------------
# EXAMPLE: BUILD A **FEW-SHOT** PROMPT (3 demos + 1 target)
#   • “Few-shot” = give the model SEVERAL demonstration pairs
#     (dialogue + correct summary) before asking it to summarise
#     a new dialogue by itself.
# ------------------------------------------------------------

# Row numbers for *demonstration* examples (3 in this case).
# Feel free to swap in any other valid row indices.
example_indices_full = [40, 80, 120]      # 3 demo dialogues + summaries

# Row number that the model must summarise (no answer provided).
example_index_to_summarize = 200          # 1 target dialogue

# ------------------------------------------------------------
# 1) Call make_prompt() to assemble the big prompt:
#      [Dialogue 40 + Q + Summary 40]
#      [Dialogue 80 + Q + Summary 80]
#      [Dialogue 120 + Q + Summary 120]
#      [Dialogue 200 + Q]   <-- model will continue here
# ------------------------------------------------------------
few_shot_prompt = make_prompt(
    example_indices_full,                 # demo rows
    example_index_to_summarize            # target row
)

# ------------------------------------------------------------
# 2) Print the resulting prompt so you can verify the format
#    before passing it through the model.  You should see three
#    complete dialogue–summary pairs, followed by a *fourth*
#    dialogue ending right after the question “What was going on?”
# ------------------------------------------------------------
print(few_shot_prompt)



Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: May, do you mind helping me prepare for the picnic?
#Person2#: Sure. Have you checked the weather report?
#Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father's favorite sausage. Sandwiches for you and Daniel.
#Person2#: No, thanks Mom. I'd like some toast and chicken wings.
#Person1#: Okay. Please take some fruit salad and crackers for me.
#Person2#: Done. Oh, don't forget to take napkins disposable plates, cups and picnic blanket.
#Person1#: All set. 

Now pass this prompt to perform a few shot inference:

In [None]:
# ------------------------------------------------------------
# EVALUATE THE MODEL WITH **FEW-SHOT** PROMPTING
#   • few_shot_prompt   already contains 3 demo Q-A pairs
#     plus the target dialogue (row 200) with no answer.
#   • We’ll run the model, then compare its output to the
#     ground-truth human summary for row 200.
# ------------------------------------------------------------

# ------------------------------------------------------------
# 1) Fetch the *reference* (gold-standard) summary for row 200
#    so we have something to compare the model’s answer against.
# ------------------------------------------------------------
summary = dataset['test'][example_index_to_summarize]['summary']


# ------------------------------------------------------------
# 2) TOKENISE the few-shot prompt and let the model GENERATE
#    up to 50 new tokens (its predicted summary).
# ------------------------------------------------------------
inputs = tokenizer(few_shot_prompt, return_tensors='pt') # text → IDs → tensors
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"], # encoded prompt (demo + target dialogue)
        max_new_tokens=50, # cut off after 50 generated tokens
    )[0],
    skip_special_tokens=True
)

# ------------------------------------------------------------
# 3) Print baseline vs. model output side-by-side for inspection
# ------------------------------------------------------------
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (819 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.


In this case, few shot did not provide much of an improvement over one shot inference.  And, anything above 5 or 6 shot will typically not help much, either.  Also, you need to make sure that you do not exceed the model's input-context length which, in our case, if 512 tokens.  Anything above the context length will be ignored.

However, you can see that feeding in at least one full example (one shot) provides the model with more information and qualitatively improves the summary overall.

**Exercise:**

Experiment with the few shot inferencing.
- Choose different dialogues - change the indices in the `example_indices_full` list and `example_index_to_summarize` value.
- Change the number of shots. Be sure to stay within the model's 512 context length, however.

How well does few shot inferencing work with other examples?

<a name='5'></a>
## 5 - Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. A full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

A convenient way of organizing the configuration parameters is to use `GenerationConfig` class.

**Exercise:**

Change the configuration parameters to investigate their influence on the output.

Putting the parameter `do_sample = True`, you activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`).

Uncomment the lines in the cell below and rerun the code. Try to analyze the results. You can read some comments below.

In [None]:
# ------------------------------------------------------------------
# PLAYING WITH GENERATION SETTINGS
#
# Hugging Face provides a handy “GenerationConfig” object that bundles
# all decoding knobs (max tokens, sampling vs. greedy, temperature…).
# You create the config once and pass it into model.generate().
#
# Below are **five** example configs:
#   1) 50-token cap, greedy decoding          (deterministic)
#   2) 10-token cap, greedy (short summary)   (commented out)
#   3) 50-token cap, sampling  T=0.1  (very cautious / low diversity)
#   4) 50-token cap, sampling  T=0.5  (balanced creative/factual)
#   5) 50-token cap, sampling  T=1.0  (high diversity, risk of drift)
#
# Uncomment the one you want to test.
# ------------------------------------------------------------------
from transformers import GenerationConfig

generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)                       # short & terse
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)


# ------------------------------------------------------------------
# 1) Encode the few-shot prompt (demo Q-A pairs + target dialogue)
# ------------------------------------------------------------------
inputs = tokenizer(few_shot_prompt, return_tensors='pt')

# ------------------------------------------------------------------
# 2) Generate the model’s summary using the chosen config
#    • model.generate(..., generation_config=…) reads every field
#      from the config (max tokens, sampling flags, temperature, etc.)
# ------------------------------------------------------------------
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

# ------------------------------------------------------------------
# 4) Display the model’s few-shot summary and the gold reference
# ------------------------------------------------------------------
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.



Comments related to the choice of the parameters in the code cell above:
- Choosing `max_new_tokens=10` will make the output text too short, so the dialogue summary will be cut.
- Putting `do_sample = True` and changing the temperature value you get more flexibility in the output.

As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!