# Text Generation

In this notebook we will look at 3 different examples:

1. Building a Seq2Seq model for machine translation using RNNs with and without Attention

2. Playing with T5 for summarization and translation

3. Exercise with prompts and language generation using the various models. Note that in this section it is necessary to stop and disconnect the notebook and then restart to run the specific model.  This is because the T4 GPU has 15 gigabytes of RAM and models like Qwen 3 cannot be loaded with others because of their size.

The sequence to sequence architecture is inspired by the Keras Tutorial https://keras.io/examples/nlp/lstm_seq2seq/.


<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Seq2Seq Model](#encoderDecoder)
    * 2.1 [Data Acquisition](#dataAcquisition)
    * 2.2 [Seq2Seq without Attention](#s2sNoAttention)
    * 2.3 [Seq2Seq with Attention](#s2sAttention)
  * 3. [T5](#t5Example)
    * 3.1 [Tokenization](#tokenization)
    * 3.2 [Model Structure & Output](#modelOutput)
  * 4. [Prompt Engineering and Generative Large Language Models](#prompts)
    * 4.1 [Cloze Prompts](#clozeExample)
    * 4.2 [Prefix Prompts](#prefixExample)
    * 4.3 [Instruction Tuned Prompts](#llama3)
    * 4.4 [Chat GPT](#chatgpt) 

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

We first need to do the usual setup. We will also use some nltk and sklearn components in order to tokenize the text.

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally.

In [1]:
#@title Installs

!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet
!pip install nltk --quiet

In [2]:
#@title Imports

import numpy as np

import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds

import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer

import os
import nltk

import matplotlib.pyplot as plt

import re
import textwrap

from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration
from transformers import GPT2Tokenizer, TFOPTForCausalLM

In [3]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

[Return to Top](#returnToTop)  
<a id = 'encoderDecoder'></a>


## 2. Building a Seq2Seq model for Translation using RNNs with and without Attention

### 2.1 Downloading and pre-processing Data


Let's get the data. Just like the Keras tutorial, we will use http://www.manythings.org as the source for the parallel corpus, but we will use German.  Machine translation requires sentence pairs for training, that is individual sentences in German and the corresponding sentence in English.

In [4]:
!!curl -O http://www.manythings.org/anki/deu-eng.zip
!!unzip deu-eng.zip

['Archive:  deu-eng.zip',
 '  inflating: deu.txt                 ',
 '  inflating: _about.txt              ']

Next, we need to set a few parameters.  Note these numbers are much smaller than we would set in a real world system.  For example, vocabulary sizes of 2000 and 3000 are unrealistic unless we were dealing with a highly specialized domain.

In [5]:
embed_dim = 100  # Embedding dimensions for vectors and LSTMs.
num_samples = 10000  # Number of examples to consider.

# Path to the data txt file on disk.
data_path = "deu.txt"

# Vocabulary sizes that we'll use:
english_vocab_size = 2000
german_vocab_size = 3000

Next, we need to format the input. In particular we would like to use nltk to help with the tokenization. We will then use sklearn's CountVectorizer to create a vocabulary from the most frequent words in each language.

(Before, we used pre-trained word embeddings from Word2Vec that came with a defined vocabulary. This time, we'll start from scratch, and need to extract the vocabulary from the training text.)

In [6]:
# Initialize empty lists to hold preprocessed text pairs
input_texts = []   # English sentences
target_texts = []  # German sentences

# Initialize max length trackers for both input and output sequences
max_input_length = -1
max_output_length = -1

# Open and read the dataset file (tab-separated, with 3 columns per line)
with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")  # Split the file into individual lines

# Iterate through each line, up to the specified number of samples
for line in lines[: min(num_samples, len(lines) - 1)]:
    # Each line has the format: English_text \t German_text \t metadata
    input_text, target_text, _ = line.split("\t")

    # Tokenize the English and German sentences into word tokens
    tokenized_source_text = nltk.word_tokenize(input_text, language='english')
    tokenized_target_text = nltk.word_tokenize(target_text, language='german')

    # Track the longest tokenized sequence (for padding or model input sizing)
    if len(tokenized_source_text) > max_input_length:
        max_input_length = len(tokenized_source_text)

    if len(tokenized_target_text) > max_output_length:
        max_output_length = len(tokenized_target_text)

    # Reconstruct tokenized text into lowercase, space-separated strings
    source_text = (' '.join(tokenized_source_text)).lower()
    target_text = (' '.join(tokenized_target_text)).lower()

    # Add processed text pairs to the lists
    input_texts.append(source_text)
    target_texts.append(target_text)

# Create English vocabulary using CountVectorizer (Bag-of-Words style)
vectorizer_english = CountVectorizer(max_features=english_vocab_size)
vectorizer_english.fit(input_texts)
vocab_english = vectorizer_english.get_feature_names_out()  # List of top English words

# Create German vocabulary using CountVectorizer
vectorizer_german = CountVectorizer(max_features=german_vocab_size)
vectorizer_german.fit(target_texts)
vocab_german = vectorizer_german.get_feature_names_out()  # List of top German words

# Output useful statistics for later model setup
print('Maximum source input length: ', max_input_length)
print('Maximum target output length: ', max_output_length)


Maximum source input length:  6
Maximum target output length:  10


### What `max_features` does here

* `CountVectorizer.fit(...)` scans the corpus and ranks tokens by **overall frequency** (after its own tokenization rules).
* `max_features=N` keeps only the **top N most frequent tokens** from that corpus.

  * So English keeps the top **2,000** tokens → `vocab_english` has length 2,000.
  * German keeps the top **3,000** tokens → `vocab_german` has length 3,000.
* Everything **not** in those top-N lists becomes **out-of-vocabulary (OOV)** and will map to your `<unk>` later.

### Why this matters downstream

* These top-N lists are the base vocabularies you map to IDs; then you **append specials**:

  * Source: size → `2000 + 2` (`<unk>`, `<pad>`)
  * Target: size → `3000 + 4` (`<unk>`, `<start>`, `<end>`, `<pad>`)
* The limits control:

  * **Model size** (embedding matrices, softmax head).
  * **Coverage vs. OOV rate** (smaller vocab → more `<unk>`).
  * **Training speed/memory**.

Setting English=2k and German=3k makes the vectorizers keep only the top-frequency 2,000/3,000 tokens for each side. That defines your base vocab sizes before you append special tokens, directly controlling memory/compute vs. OOV trade-offs.


In [7]:
input_texts[:2]

['go .', 'hi .']

In [8]:
target_texts[:2]

['geh .', 'hallo !']

Looks simple but correct.

So the source and target sequences have max lengths 6 and 11, respectively. As we will add start and end tokens (\<s> and \</s>) to our decoder side we will set the respective max lengths to:

In [9]:
max_encoder_seq_length = max_input_length                 # from your scan, = 6
max_decoder_seq_length = max_output_length + 2            # room for <start> and <end>, 10 + 2

Next, we create the dictionaries translating between integer ids and tokens for both source (English) and target (German).

In [60]:
# ---- Source (English) vocab <-> id mappings ---------------------------------

source_id_vocab_dict = {}  # maps integer IDs -> tokens
source_vocab_id_dict = {}  # maps tokens -> integer IDs

# Assign IDs to each English vocab item (0..english_vocab_size-1)
for sid, svocab in enumerate(vocab_english):
    source_id_vocab_dict[sid] = svocab
    source_vocab_id_dict[svocab] = sid

# Reserve extra IDs for special tokens (placed AFTER the base vocab range)
source_id_vocab_dict[english_vocab_size] = "<unk>"  # unknown token for OOV words
source_id_vocab_dict[english_vocab_size + 1] = "<pad>"  # padding token

# And the reverse mapping for source language
source_vocab_id_dict["<unk>"] = english_vocab_size
source_vocab_id_dict["<pad>"] = english_vocab_size + 1


# ---- Target (German) vocab <-> id mappings ----------------------------------

target_id_vocab_dict = {}  # maps integer IDs -> tokens
target_vocab_id_dict = {}  # maps tokens -> integer IDs

# Assign IDs to each German vocab item (0..german_vocab_size-1)
for tid, tvocab in enumerate(vocab_german):
    target_id_vocab_dict[tid] = tvocab
    target_vocab_id_dict[tvocab] = tid

# Add special tokens for the target side:
# <unk>  : unknown token (OOV handling)
# <start>: marks decoder start-of-sequence
# <end>  : marks decoder end-of-sequence
# <pad>  : padding for batching/sequencing alignment
target_id_vocab_dict[german_vocab_size]     = "<unk>"
target_id_vocab_dict[german_vocab_size + 1] = "<start>"
target_id_vocab_dict[german_vocab_size + 2] = "<end>"
target_id_vocab_dict[german_vocab_size + 3] = "<pad>"

# Reverse mapping for target language
target_vocab_id_dict["<unk>"]   = german_vocab_size
target_vocab_id_dict["<start>"] = german_vocab_size + 1
target_vocab_id_dict["<end>"]   = german_vocab_size + 2
target_vocab_id_dict["<pad>"]   = german_vocab_size + 3


### Creating Train/Test Data for the Seq2Seq Models

We need a small helper function to:

* Split data into **train** and **test** sets.
* **Pad** sequences to fixed lengths.
* Add **`<start>`/`<end>`** tokens on the **decoder** side.

This function will produce **three sequences of vocab IDs**:

1. **Encoder inputs (source language)**

   * What the encoder reads.
   * Composition: `source_tokens + <pad>...` (no start/end needed).

2. **Decoder inputs (target language, teacher forcing)**

   * What the decoder *receives* at each time step.
   * Composition: `<start> + target_tokens + <end> + <pad>...`

3. **Decoder labels (targets to predict)**

   * What the decoder should *predict* at each time step.
   * Composition: `target_tokens + <end> + <pad>...`
   * This is **shifted by one** relative to the decoder inputs so the model learns “given previous tokens, predict the next token.”



In [11]:
def convert_text_to_data(texts,
                         vocab_id_dict,
                         max_length=20,
                         type=None,
                         train_test_vector=None,
                         samples=100000):
    """
    Convert tokenized (space-delimited) texts into fixed-length ID sequences
    with task-specific special tokens (start/end/pad) and an optional
    train/test split mask.

    Args:
        texts: iterable of strings, already tokenized as space-separated tokens.
        vocab_id_dict: dict mapping token -> id. Must include "<unk>" and:
            - For source: "<pad>"
            - For target: "<start>", "<end>", "<pad>"
        max_length: output sequence length (after padding/truncation).
        type: one of {'input_source', 'input_target', 'output_target'}.
        train_test_vector: optional boolean mask (True=train, False=test)
                           aligned to the first len(texts[:samples]) items.
        samples: max number of examples to process.

    Returns:
        (np.array(train_data), np.array(test_data))
    """

    # Require an explicit type to avoid accidental misuse
    if type == None:
        raise ValueError("'type' is not defined. Please choose from: input_source, input_target, output_target.")

    train_data = []
    test_data = []

    # Process up to `samples` texts
    for text_num, text in enumerate(texts[:samples]):

        sentence_ids = []

        # Convert each token to its id; use <unk> for OOV
        for token in text.split():
            if token in vocab_id_dict.keys():
                sentence_ids.append(vocab_id_dict[token])
            else:
                sentence_ids.append(vocab_id_dict["<unk>"])

        # We rely on specials being appended to the end of the vocab.
        # That is:
        #   SOURCE: ... , <unk>=V, <pad>=V+1   => vocab_size = V+2; pad = vocab_size-1
        #   TARGET: ... , <unk>=V, <start>=V+1, <end>=V+2, <pad>=V+3
        #           => vocab_size = V+4; start=vocab_size-3, end=vocab_size-2, pad=vocab_size-1
        vocab_size = len(vocab_id_dict.keys())

        # Build final id sequence per task type, then pad/truncate to `max_length`.
        if type == 'input_source':
            # Encoder input: raw source tokens + PADs
            ids = (sentence_ids + [vocab_size - 1] * max_length)[:max_length]

        elif type == 'input_target':
            # Decoder input (teacher forcing): <START> + target tokens + <END> + PADs
            ids = ([vocab_size - 3] + sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]

        elif type == 'output_target':
            # Decoder output (shifted target): target tokens + <END> + PADs
            ids = (sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]

        # If a split mask is provided and this item is marked False -> test set
        if train_test_vector is not None and not train_test_vector[text_num]:
            test_data.append(ids)
        else:
            train_data.append(ids)

    return np.array(train_data), np.array(test_data)


# Random 80/20 split mask for (up to) 10,000 items, any num > 0.2 will return "True"
train_test_split_vector = (np.random.uniform(size=10000) > 0.2)

# Build encoder inputs (source side)
train_source_input_data, test_source_input_data = convert_text_to_data(
    input_texts,
    source_vocab_id_dict,
    type='input_source',
    max_length=max_encoder_seq_length,
    train_test_vector=train_test_split_vector
)

# Build decoder inputs (target side, with <START> ... <END>)
train_target_input_data, test_target_input_data = convert_text_to_data(
    target_texts,
    target_vocab_id_dict,
    type='input_target',
    max_length=max_decoder_seq_length,
    train_test_vector=train_test_split_vector
)

# Build decoder outputs (target side, shifted, ends with <END>)
train_target_output_data, test_target_output_data = convert_text_to_data(
    target_texts,
    target_vocab_id_dict,
    type='output_target',
    max_length=max_decoder_seq_length,
    train_test_vector=train_test_split_vector
)


Because you **appended special tokens at the end of the vocab**, the `<pad>` token ends up with the **last** ID, which equals `vocab_size - 1`.

Quick recap from your earlier construction:

* **Source (English) side**

  * Base vocab IDs: `0 ... english_vocab_size-1`
  * You then added:

    * `<unk>` at `english_vocab_size`
    * `<pad>` at `english_vocab_size + 1`
  * So total `vocab_size = english_vocab_size + 2`
  * Therefore `<pad>` ID = `english_vocab_size + 1 = vocab_size - 1`

* **Target (German) side**

  * Base vocab IDs: `0 ... german_vocab_size-1`
  * You then added:

    * `<unk>` at `german_vocab_size`
    * `<start>` at `german_vocab_size + 1`
    * `<end>` at `german_vocab_size + 2`
    * `<pad>` at `german_vocab_size + 3`
  * So total `vocab_size = german_vocab_size + 4`
  * Therefore `<pad>` ID = `german_vocab_size + 3 = vocab_size - 1`

So, in both cases, `<pad>` is the last entry ⇒ its ID is `vocab_size - 1`.



Let us look at a few examples. They appear coorect.

In [12]:
train_source_input_data[:2]

array([[ 765, 2000, 2001, 2001, 2001, 2001],
       [ 828, 2000, 2001, 2001, 2001, 2001]])

In [13]:
train_target_input_data[:2]

array([[3001,  928, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [3001, 1118, 1098, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

In [14]:
train_target_output_data[:2]

array([[ 928, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [1118, 1098, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

They’re matching exactly what our data pipeline promises:

### ID scheme

* **English (source):** base 0–1999, `<unk>=2000`, `<pad>=2001`.
* **German (target):** base 0–2999, `<unk>=3000`, `<start>=3001`, `<end>=3002`, `<pad>=3003`.

### What we expect each tensor to look like

1. **`train_source_input_data`** (encoder input)
   `tokens + <pad>…`
   👉 Example starts with a real token (`765`), has an OOV (`2000=<unk>`), then tails of `2001=<pad>` → correct.

2. **`train_target_input_data`** (decoder input, teacher forcing)
   `<start> + tokens + <end> + <pad>…`
   👉 Example begins with `3001=<start>`, then a token (`928`), later `3002=<end>`, then lots of `3003=<pad>` → correct.

3. **`train_target_output_data`** (decoder labels)
   `tokens + <end> + <pad>…` (one-step **shift** vs decoder input)
   👉 Example begins with the first token (`928`), includes `3002=<end>`, and pads with `3003`—**no `<start>`** at the front → correct.

### Alignment check (first row, conceptually)

```
input_target : [<start>, 928, ..., 3002, <pad>, <pad>, ...]
output_target: [928,      ..., 3002, <pad>, <pad>, ...]
```

At each time step, the label is the **next token** of the decoder input—exactly the intended shift for next-token prediction.

So:

* Source rows end with `2001` pads ✔️
* Decoder-input rows start with `3001` and contain an `3002` end before `3003` pads ✔️
* Decoder-output rows omit `3001`, include `3002`, then `3003` pads ✔️

All IDs are where they should be, and the sequences have the right structure and shift.


[Return to Top](#returnToTop)  
<a id = 's2sNoAttention'></a>

### 2.2 The Seq2seq model without Attention

We need to build both the encoder and the decoder and we'll use LSTMs.  We'll set up the system first without an attention layer between the encoder and decoder.

In [61]:
def create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim):
    """
    Build a basic seq2seq (Encoder-Decoder) LSTM translation model WITHOUT attention.
    Uses teacher forcing during training: decoder sees <start> + tokens and predicts next tokens.

    Args:
        encode_vocab_size: size of source (encoder) vocab including specials
        decode_vocab_size: size of target (decoder) vocab including specials
        embed_dim: dimension for both embeddings and LSTM hidden size (for simplicity)

    Returns:
        A compiled tf.keras.Model that maps:
            [source_input_ids, target_input_ids] -> decoder token distributions over decode_vocab_size
    """

    # ----- Inputs (padded ID sequences) -----
    # Encoder input shape: (batch, max_encoder_seq_length)
    source_input_no_att = tf.keras.layers.Input(
        shape=(max_encoder_seq_length,), dtype='int64', name='source_input_no_att'
    )
    # Decoder input shape: (batch, max_decoder_seq_length)
    # Contains <start> ... tokens ... <end> + PADs
    target_input_no_att = tf.keras.layers.Input(
        shape=(max_decoder_seq_length,), dtype='int64', name='target_input_no_att'
    )

    # ----- Embedding layers (token IDs -> dense vectors) -----
    # NOTE: consider mask_zero=True to ignore PADs automatically in LSTM
    source_embedding_layer_no_att = tf.keras.layers.Embedding(
        input_dim=encode_vocab_size, output_dim=embed_dim, name='source_embedding_layer_no_att'
        # , mask_zero=True
    )
    target_embedding_layer_no_att = tf.keras.layers.Embedding(
        input_dim=decode_vocab_size, output_dim=embed_dim, name='target_embedding_layer_no_att'
        # , mask_zero=True
    )

    # Embedded sequences:
    # source_embeddings: (batch, max_enc_len, embed_dim)
    source_embeddings_no_att = source_embedding_layer_no_att(source_input_no_att)
    # target_embeddings: (batch, max_dec_len, embed_dim)
    target_embeddings_no_att = target_embedding_layer_no_att(target_input_no_att)

    # ----- Encoder LSTM -----
    # return_sequences=True -> provide hidden state at each time step (not used further here)
    # return_state=True -> also return the final (h, c) states to initialize the decoder
    encoder_lstm_layer_no_att = tf.keras.layers.LSTM(
        embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_no_att'
    )
    encoder_out_no_att, encoder_state_h_no_att, encoder_state_c_no_att = \
        encoder_lstm_layer_no_att(source_embeddings_no_att)
    # Shapes:
    #  encoder_out_no_att: (batch, max_enc_len, embed_dim)  [unused downstream]
    #  encoder_state_h_no_att: (batch, embed_dim)
    #  encoder_state_c_no_att: (batch, embed_dim)

    # ----- Decoder LSTM (teacher forcing) -----
    # Initialized with encoder final states so it conditions on the source sentence.
    decoder_lstm_layer_no_att = tf.keras.layers.LSTM(
        embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_no_att'
    )
    # decoder_lstm_out: (batch, max_dec_len, embed_dim)
    decoder_lstm_out_no_att = decoder_lstm_layer_no_att(
        target_embeddings_no_att, initial_state=[encoder_state_h_no_att, encoder_state_c_no_att]
    )

    # ----- Token classification over target vocab at each time step -----
    # Dense time-distributed by broadcasting over the sequence dimension:
    # output: (batch, max_dec_len, decode_vocab_size), softmax over vocab axis
    target_classification_no_att = tf.keras.layers.Dense(
        decode_vocab_size, activation='softmax', name='classification_no_att'
    )(decoder_lstm_out_no_att)

    # ----- Assemble & compile -----
    translation_model_no_att = tf.keras.models.Model(
        inputs=[source_input_no_att, target_input_no_att],
        outputs=[target_classification_no_att]
    )

    # Using SparseCategoricalCrossentropy so labels are integer IDs (no one-hot needed).
    # Expect labels shaped (batch, max_dec_len) or (batch, max_dec_len, 1).
    translation_model_no_att.compile(
        optimizer="Adam",
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return translation_model_no_att


Now we can call the function we created to instantiate that model and confirm that it is set up the way we like using model.sumary().

In [16]:
encode_vocab_size = len(source_id_vocab_dict.keys())
decode_vocab_size = len(target_id_vocab_dict.keys())

translation_model_no_att = create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim) # embed_dim = 100

translation_model_no_att.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 source_input_no_att (Input  [(None, 6)]                  0         []                            
 Layer)                                                                                           
                                                                                                  
 target_input_no_att (Input  [(None, 13)]                 0         []                            
 Layer)                                                                                           
                                                                                                  
 source_embedding_layer_no_  (None, 6, 100)               200200    ['source_input_no_att[0][0]'] 
 att (Embedding)                                                                              

It never hurts to look at the shapes of the outputs.

In [17]:
translation_model_no_att.predict(x=[train_source_input_data, train_target_input_data]).shape



(8005, 13, 3004)

Now that everything checks out, we can train our model.

In [18]:
translation_model_no_att.fit(x=[train_source_input_data, train_target_input_data],
                             y=train_target_output_data,
                             validation_data=([test_source_input_data, test_target_input_data],
                                              test_target_output_data),
                             epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tf_keras.src.callbacks.History at 0x7a1e3b0629c0>

[Return to Top](#returnToTop)  
<a id = 's2sAttention'></a>

### 2.3 The Seq2seq model with Attention

All we need to do is add an attention layer that ceates a context vector for each decoder position. We can use the attention layer provided by Keras in *tf.keras.layers.Attention()*.  We will then simply concatenate these corresponding context vectors with the output of the LSTM layer in order to predict the translation tokens one by one.

In [63]:
def create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim):
    """
    Seq2Seq (Encoder-Decoder) with additive/luong-style 'dot' Attention via Keras Attention layer.
    Teacher forcing during training: decoder sees <start>...<end>, predicts next tokens.

    Args:
        encode_vocab_size: source vocab size (incl. specials)
        decode_vocab_size: target vocab size (incl. specials)
        embed_dim: embedding size and LSTM hidden size

    Returns:
        Compiled tf.keras.Model:
          inputs  -> [source_ids (B, T_enc), target_ids (B, T_dec)]
          outputs -> token distributions over target vocab (B, T_dec, V_tgt)
    """

    # ----- Inputs -----
    # Source (encoder) ids: (batch, max_encoder_seq_length)
    source_input_with_att = tf.keras.layers.Input(
        shape=(max_encoder_seq_length,), dtype='int64', name='source_input_with_att'
    )
    # Target (decoder) ids: (batch, max_decoder_seq_length)
    target_input_with_att = tf.keras.layers.Input(
        shape=(max_decoder_seq_length,), dtype='int64', name='target_input_with_att'
    )

    # ----- Embeddings (IDs -> dense vectors) -----
    # TIP: consider mask_zero=True to automatically ignore PADs
    source_embedding_layer_with_att = tf.keras.layers.Embedding(
        input_dim=encode_vocab_size, output_dim=embed_dim, name='source_embedding_layer_with_att'
        # , mask_zero=True
    )
    target_embedding_layer_with_att = tf.keras.layers.Embedding(
        input_dim=decode_vocab_size, output_dim=embed_dim, name='target_embedding_layer_with_att'
        # , mask_zero=True
    )

    # Embedded sequences:
    # source_embeddings: (B, T_enc, D)
    source_embeddings_with_att = source_embedding_layer_with_att(source_input_with_att)
    # target_embeddings: (B, T_dec, D)
    target_embeddings_with_att = target_embedding_layer_with_att(target_input_with_att)

    # ----- Encoder LSTM -----
    # return_sequences=True: keep hidden states at every encoder step for attention keys/values
    # return_state=True    : also return final (h, c) to init the decoder
    encoder_lstm_layer_with_att = tf.keras.layers.LSTM(
        embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_with_att'
    )
    # encoder_out: (B, T_enc, D), encoder_state_h: (B, D), encoder_state_c: (B, D)
    encoder_out_with_att, encoder_state_h_with_att, encoder_state_c_with_att = \
        encoder_lstm_layer_with_att(source_embeddings_with_att)

    # ----- Decoder LSTM (teacher forcing) -----
    # Initialized with encoder final states so decoding is conditioned on the source
    decoder_lstm_layer_with_att = tf.keras.layers.LSTM(
        embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_with_att'
    )
    # decoder_lstm_out: (B, T_dec, D)  -> these are the "queries" for attention
    decoder_lstm_out_with_att = decoder_lstm_layer_with_att(
        target_embeddings_with_att, initial_state=[encoder_state_h_with_att, encoder_state_c_with_att]
    )

    # ----- Attention -----
    # Keras Attention: Attention([query, value]) where key=value by default.
    # query  : decoder time steps  (B, T_dec, D)
    # value  : encoder time steps  (B, T_enc, D)
    # output : context vectors aligned to each decoder step (B, T_dec, D)
    attention_context_vectors = tf.keras.layers.Attention(name='attention_layer')(
        [decoder_lstm_out_with_att, encoder_out_with_att]
    )

    # Concatenate decoder hidden state with its attention context per step
    # concat: (B, T_dec, 2D)
    concat_decode_out_with_att = tf.keras.layers.Concatenate(axis=-1, name='concat_layer_with_att')(
        [decoder_lstm_out_with_att, attention_context_vectors]
    )

    # ----- Time-distributed classification over target vocab -----
    # Dense applies to last dim -> (B, T_dec, V_tgt)
    target_classification_with_att = tf.keras.layers.Dense(
        decode_vocab_size, activation='softmax', name='classification_with_att'
    )(concat_decode_out_with_att)

    # ----- Assemble & compile -----
    translation_model_with_att = tf.keras.models.Model(
        inputs=[source_input_with_att, target_input_with_att],
        outputs=[target_classification_with_att]
    )

    translation_model_with_att.compile(
        optimizer="Adam",
        loss='sparse_categorical_crossentropy',  # labels are integer IDs shaped (B, T_dec)
        metrics=['accuracy']
    )

    return translation_model_with_att


In [20]:
translation_model_with_att = create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim)

translation_model_with_att.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 source_input_with_att (Inp  [(None, 6)]                  0         []                            
 utLayer)                                                                                         
                                                                                                  
 target_input_with_att (Inp  [(None, 13)]                 0         []                            
 utLayer)                                                                                         
                                                                                                  
 source_embedding_layer_wit  (None, 6, 100)               200200    ['source_input_with_att[0][0]'
 h_att (Embedding)                                                  ]                       

In [21]:
translation_model_with_att.fit(x=[train_source_input_data, train_target_input_data],
                               y=train_target_output_data,
                               validation_data=([test_source_input_data, test_target_input_data],
                                                test_target_output_data),
                               epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tf_keras.src.callbacks.History at 0x7a1e3aa4cd10>

Validation accuracy is about one percentage point better.

**Question 1:** Why do you think the benefit of adding an attention layer is not larger?

[Return to Top](#returnToTop)  
<a id = 't5Example'></a>


Short version: The setup doesn’t give attention much room to shine.

Here’s why a ~+1% bump is plausible:

* **Sentences are short.** Your max lengths are ~6 (src) / ~10 (tgt). On short sequences an encoder LSTM already carries most information; the encoder “bottleneck” isn’t severe, so attention adds little.

* **Teacher forcing + token-accuracy metric.** Validation “accuracy” here is per-token under teacher forcing—not sequence quality (BLEU/chrF). Attention mainly helps when the model must keep long-range dependencies during **generation**; teacher forcing makes the task easier for both models, reducing the gap.

* **Padding not masked.** In your code, `Embedding(..., mask_zero=True)` isn’t set and masks aren’t passed to `Attention`. If PAD positions aren’t masked, the attention layer may attend to padding (noise), and PAD tokens dominate the loss/accuracy—diluting any benefit.

* **Vocabulary/OOV pressure.** You cap vocab at 2k/3k and map the rest to `<unk>`. When many tokens are `<unk>`, alignment cues get blurred, limiting what attention can exploit (especially for reordering like EN→DE).

* **Simple attention & capacity.** `tf.keras.layers.Attention` (dot-product, no projection) + a single LSTM with `units=embed_dim` is modest capacity. No bidirectional encoder, no multi-head, no projection of [h_t ; context_t]. Capacity limits can compress the gain.

* **Optimization & variance.** With few epochs/small data/suboptimal LR, the models may not separate much. ±1% can fall within run-to-run noise unless averaged across seeds.



### 3. T5 (Text-to-Text Transfer Transformer)

* **What it is:** A transformer that **uses both an encoder and a decoder** and treats **every NLP task as text-in → text-out**.
* **Paper:** *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* ([arXiv:1910.10683](https://arxiv.org/pdf/1910.10683.pdf)).
* **Key idea:** Prefix the input with a **task indicator** (e.g., `summarize:`, `translate English to German:`, `question: ... context: ...`) so the same model handles many tasks.
* **Tasks supported:** Summarization, QA, translation, classification, and more.

  * See **Appendix D** of the paper for the complete task set used for T5 **base** and **large** checkpoints and their training data.

**What we’ll do next:** Try out Hugging Face’s **T5-Large** to generate text from prompts using the task-prefix format.


In [65]:
!pip install -U -q transformers accelerate

In [66]:
!pip install -q torchinfo

In [67]:
#import torch
from torchinfo import summary

In [68]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

t5_tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-large")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-large")

Let's look at the makeup of this model.

In [69]:
summary(t5_model)

Layer (type:depth-idx)                                  Param #
T5ForConditionalGeneration                              --
├─Embedding: 1-1                                        32,899,072
├─T5Stack: 1-2                                          32,899,072
│    └─Embedding: 2-1                                   (recursive)
│    └─ModuleList: 2-2                                  --
│    │    └─T5Block: 3-1                                12,585,472
│    │    └─T5Block: 3-2                                12,584,960
│    │    └─T5Block: 3-3                                12,584,960
│    │    └─T5Block: 3-4                                12,584,960
│    │    └─T5Block: 3-5                                12,584,960
│    │    └─T5Block: 3-6                                12,584,960
│    │    └─T5Block: 3-7                                12,584,960
│    │    └─T5Block: 3-8                                12,584,960
│    │    └─T5Block: 3-9                                12,584,960
│    │    └─T

836 m trainable parameters. Quite a lot.

Let's create a short text to use as an example.

In [70]:
ARTICLE = ("Oh boy, what a lengthy and cumbersome excercise this was. " \
           "I had to look into every detail, check everything twice, " \
           " and then compare to prior results. Because of this tediousness " \
           " and extra work my homework was 2 days late.")

Next, we need to specify the task we want T5 to perform and include it at the begining of the input text.  We add a task prompt to the begining of our input.  Because we are summarizing, we add the word *summarize:* to the begining of our input.

In [76]:
# Prefix the task so T5 knows what to do. T5 treats everything as "text → text".
# Common prefixes: "summarize:", "translate English to German:", "question:", etc.
t5_input_text = "summarize: " + ARTICLE

# Tokenize to model-ready tensors.
# - Wrapping in a list creates a batch of size 1.
# - return_tensors='pt' returns PyTorch tensors (use 'tf' for TensorFlow).
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='pt')


First, we will generate a summary using the default output options.

In [77]:
# Generate summary token IDs from the input_ids.
# By default, .generate() will do *greedy decoding* (no sampling, no beam search)
# with the model’s built-in max_length. You can override behavior (see notes below).
t5_summary_ids = t5_model.generate(t5_inputs["input_ids"])

# Decode each generated sequence of IDs back to text.
# - skip_special_tokens=True removes tokens like <pad>, </s>
# - clean_up_tokenization_spaces=False preserves spacing exactly as produced
#   (set True if you want tokenizer to normalize spaces/punctuation)
print([
    t5_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    for g in t5_summary_ids
])


['homework was a lengthy and cumbersome excercise . because of this tediousness']


Not great. But let's get more sophisticated and prescribe a minimum length and use beam search to generate multiple outputs.  We also indicate the maximum length the output should be.  Finally, in order to reduce repetitive output we tell the model to avoid output that repeats trigrams (three word groupings).

In [78]:
# Beam search generation with basic anti-repetition + length controls.
t5_summary_ids = t5_model.generate(
    t5_inputs['input_ids'],
    num_beams=3,            # Use 3 beams: explore top-3 continuations at each step
    no_repeat_ngram_size=3, # Forbid repeating any 3-word phrase (helps reduce loops)
    min_length=20,          # Do not stop before 20 tokens are generated
    max_length=40           # Hard cap on total generated length (incl. special tokens)
)

# Decode IDs -> strings.
print([
    t5_tokenizer.decode(
        g,
        skip_special_tokens=True,        # drop <pad>, </s>, etc.
        clean_up_tokenization_spaces=False  # keep raw spacing exactly
    )
    for g in t5_summary_ids
])


['i had to look into every detail, check everything twice, and then compare to prior results . because of this tediousness and extra work my homework was 2 days late .']


That is a bit better thanks to our application of some hyperparameters.

Lastly, can T5 perform machine translation? Yes, in some limited instances.  We need to specify the input and output languages. Keep in mind that the model has only been trained to translate in particular directions e.g. English to Romanian but NOT Romanian to English.


In [79]:
t5_input_text = "translate English to German: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='pt')

In [80]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Ich habe es nicht geschafft, meinen ersten Test zu schreiben, da ich nicht genügend Zeit hatte, um meinen Test zu bearbeiten.']


Hmm... output language fluency is very good. But take the German output and feed it in to translate.google.com and see what this means.

- not great:
- ['I failed to take my first test because I didn't have enough time to complete my test.']


 Is it anything like its English input? This hallucination might be mitigated by changing some of the hyperparameters like num_beams.

Is a shorter example more accurate?  Maybe.

In [81]:
t5_input_text = "translate English to German: That was really not very good today; it was too difficult to solve."
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='pt')

In [82]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Das war heute wirklich nicht sehr gut; es war zu schwierig zu lösen.']


That is not bad, though some mistakes are there.
 - Note bad
 - ['That really wasn't very good today; it was too difficult to solve.']

[Return to Top](#returnToTop) <a id="prompts"></a>

# 4. Prompt Engineering & Generative Large Language Models

**Why it matters:** Very large language models (e.g., [GPT-3](https://arxiv.org/pdf/2005.14165.pdf)) enable **zero-shot** and **few-shot** learning: you provide a prompt (and optionally a few examples), and the model produces the answer—often without any task-specific fine-tuning.

**Key idea:** “**Pre-train, prompt, and predict**” ([Liu et al.](https://arxiv.org/pdf/2107.13586.pdf)). A single pre-trained model can serve many users/tasks; success depends largely on **how you phrase the prompt** and **what context** you include.

---

## Core Concepts

* **Zero-shot:** No exemplars—just instructions/context → model answers.
* **Few-shot:** Include a handful of input→output examples to steer behavior.
* **Prompt design:** Clear instructions, role/context, constraints, and format improve reliability.

---

## Two Prompting Families

### 1) Cloze (Mask) Prompts

* **What:** Fill-in-the-blank style prompts that mirror **masked language modeling**.
* **Who:** Pre-trained encoder/encoder-decoder models like **BERT** and **T5**.
* **How:** Hide tokens/spans and ask the model to predict the masked content.
* **Use cases:** Classification via pattern-verbalizer pairs, slot filling, factual recall.

### 2) Prefix (Next-Token) Prompts

* **What:** Provide a **prefix** and ask the model to continue it via **next-token prediction**.
* **Who:** **Decoder-only** models (GPT family).
* **How:** “Instruction + context + (optional) few-shot examples →” model completes the sequence.
* **Use cases:** Generation (QA, code, writing), summarization, translation, reasoning.

---

## Zero-Shot Instruction-Tuned Models

* **Example:** *Llama 3.1 8B* (instruction-tuned) treats tasks as **text-in → text-out** without fine-tuning.
* **Benefit:** Strong zero-/few-shot performance by following natural-language instructions.
* **Practice:** State the task, specify constraints (tone, length, format), and include evaluation criteria when possible.

---

## Practical Implications

* **No fine-tuning required:** Broadly shareable models; focus on **prompt crafting**.
* **Try structured prompts:** Role, goals, constraints, examples, and desired output schema.
* **Iterate & test:** Small prompt edits often yield large quality gains.

---


[Return to Top](#returnToTop)  
<a id = 'clozePrompts'></a>

### 4.1 Cloze Prompts

Cloze prompts take advantage of the masked language model task where an individual word or span of words anywhere in the input are masked and the language model learns to predict them.

In [83]:
#Delete the old model so we are managing memory
del t5_tokenizer
del t5_model

In [84]:
# Get a new model with a new checkpoint
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

t5_tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base")

### T5 Sentinel Tokens (Masked Word Modeling)

* **What is `<extra_id_0>`?**
  A **sentinel token** used by T5 to trigger its **span-masking** (fill-in-the-blank) capability.

  * T5 provides **up to 99** such tokens: `<extra_id_0>`, `<extra_id_1>`, …, `<extra_id_99>`.

* **Why it matters:**
  You can craft prompts like cloze tests to **probe the model’s pretrained knowledge**—T5 will try to fill in the masked span.

* **Example behavior:**
  Try a prompt where the missing word is a dog breed; T5 often handles **“poodle”** correctly, but substituting **“beagle”** can confuse it—revealing what the model learned during pretraining.

---

### Example Prompt Pattern

```text
"The <extra_id_0> is a small, intelligent dog breed known for its curly coat."
```

Model is expected to complete `<extra_id_0>` with something like “poodle”.

---

### Decoding Strategy (Beam Search)

* **Beam search** explores multiple likely continuations at each step.
* Here, we **return the top 3 candidates** instead of just the single best output to see alternative fills.

  * This helps diagnose ambiguity and understand the model’s confidence spread.

> Tip: Combine beam search with constraints (e.g., `no_repeat_ngram_size`) if you see repetitive outputs.


In [90]:
# Cloze-style prompt: T5 will try to fill the masked span <extra_id_0>.
PROMPT_SENTENCE = ("An Australian <extra_id_0> is a type of working dog .")

# Prepare a batch of size 1 for the model (PyTorch tensors).
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='pt')

# Generate candidates with beam search.
t5_summary_ids = t5_model.generate(
    t5_inputs['input_ids'],
    num_beams=10,            # explore top-10 continuations at each step
    # temperature=0.8,       # only used if do_sample=True (sampling); ignored with pure beams
    no_repeat_ngram_size=2,  # avoid repeating any 2-gram (helps reduce loops)
    num_return_sequences=3,  # return the top-3 finished beams (≤ num_beams)
    min_length=1,            # ensure at least 1 token is generated
    max_length=3             # cap total generated tokens very short (good for a single word)
)

# Decode each candidate. Skipping special tokens removes <extra_id_x> markers in outputs.
print([
    t5_tokenizer.decode(
        g,
        skip_special_tokens=True,        # drop <pad>, </s>, <extra_id_x>, etc.
        clean_up_tokenization_spaces=False
    )
    for g in t5_summary_ids
])


['Shepherd', 'working', 'Working']


In [91]:
#Keep our memory free of old models
del t5_tokenizer
del t5_model

[Return to Top](#returnToTop) <a id="prefixPrompt"></a>

# 4.2 Prefix Prompts

**What are they?**
Prefix prompts are inputs for **next-token prediction** models (decoder-only transformers) where you provide a **single sequence** that contains instructions (and optionally examples). The model then **continues** that sequence.

**Why they work:**
Decoder-only models (e.g., GPT-3) learn to predict the **next word given a long context window**. If your prefix contains the right instructions, constraints, and examples, the continuation often does what you want.

---

## How to Use Prefix Prompts

* **Zero-shot:** Just ask the question with clear instructions.
* **Few-shot:** Provide a couple of input→output examples, then your new input.
* **Good practice:** Specify format, style, and constraints (length, tone, JSON schema, etc.).

---

## Example Model: OPT (Decoder-Only)

* **Paper:** [OPT](https://arxiv.org/pdf/2205.01068.pdf)
* **What it is:** A family of GPT-3-like models (125M → 175B parameters).
* **Key property:** **Single sequence input** (the prefix); there’s no separate “encoder input” and “decoder target.”
* **What we’ll use:** The **350M** parameter checkpoint for quick experimentation.

**Workflow:**

1. Build a prefix: instructions + (optional) examples + your query.
2. Feed to the model as one sequence.
3. Generate a continuation (control length, temperature/top-p, repetition penalties).
4. Parse/validate the output (especially if you asked for structured formats).

**Tip:** Few-shot examples that closely match your desired task improve reliability and reduce prompt sensitivity.


As with T5, we'll just try out the pre-trained model and see what text it generates for a new starting sequence.

In [92]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

We'll give it a prompt now and see what it generates next.

In [93]:
prefix_prompt = 'Yesterday, I went to the store to buy '
input_ids = tokenizer.encode(prefix_prompt, return_tensors='pt')

### Parameter cheat sheet
* **`do_sample=True`** → enables *stochastic* decoding (uses `temperature`, `top_k`, `top_p`).
* **`top_k` & `top_p` together** → apply *both* filters: logits are limited to top-k, then truncated by nucleus.
* **`temperature`** → scales randomness. Try 0.7–1.0 for balanced quality/diversity.
* **`repetition_penalty`** → discourages repeats; 1.1–1.8 are common. Too high can harm fluency.
* **`num_return_sequences`** → must be ≤ batch_size×(beams or 1 for sampling). With sampling, you’ll get diverse outputs.
* **`max_length`** → legacy cap on *total* length for some models. Prefer:

  ```python
  max_new_tokens=35  # clearer: limits *generated* tokens only
  ```

### Practical tips

* **Match device:** `model.to(device); input_ids = input_ids.to(device)` to avoid CPU/GPU mismatches.
* **Control stop:** If your tokenizer defines `eos_token_id`, the model will stop on EOS; otherwise consider `stopping_criteria`.
* **Reproducibility:** Set a seed for deterministic sampling:

  ```python
  import torch
  torch.manual_seed(42)
  ```
* **Prompt matters:** Decoder-only models **continue** the prefix. Include clear instructions/examples and, if needed, an explicit delimiter before where generation should start.


In [100]:
# Stochastic text generation from a decoder-only LM (e.g., OPT/GPT-like)
generated_text_outputs = model.generate(
    input_ids,               # tokenized prefix prompt (batch or single sequence)
    max_length=35,           # TOTAL length cap (prefix+new tokens) for some models
                             # (Prefer max_new_tokens=... in recent HF)
    num_return_sequences=3,  # return 3 different samples (requires do_sample=True)
    repetition_penalty=1.5,  # >1.0 penalizes repeated tokens/phrases (helps reduce loops)
    top_p=0.92,              # nucleus sampling: keep smallest set with cum prob ≥ 0.92
    temperature=0.95,         # soften/sharpen logits (<1.0 = conservative, >1.0 = creative)
    do_sample=True,          # enable sampling (vs default greedy/beam search)
    top_k=125,               # truncate to top-125 tokens before sampling (used with top_p)
    early_stopping=True      # mainly relevant for beam search; with sampling it has little effect
)

# Print each generated continuation
for i, seq in enumerate(generated_text_outputs):
    print()
    print(f"{i}: {tokenizer.decode(seq, skip_special_tokens=True, clean_up_tokenization_spaces=True)}")



0: Yesterday, I went to the store to buy  a new bike. I was really excited because it didn't have anything else listed on Amazon but instead there were two bikes

1: Yesterday, I went to the store to buy ~~a new~~ used book. As a note-to kindle fanboy that has always liked using text on my

2: Yesterday, I went to the store to buy  - 1 pair of jeans (the length and width are different) - a pair of chinos from my normal shopping list


Now let's try a long prompt to give the model a lot of context to work with and see how well it performs.  We'll ask it to generate a recipe and see how well if follows instructions.  We'll try it with several smaller models avaiable on HuggingFace. Finally, we'll include the output for that same prompt from chatGPT for comparison purposes.

In order to do so without exceeding the memory, you should **STOP the notebook and Reconnect it**.

In [1]:
!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet

We'll also use the HuggingFace AutoTokenizer and AutoModelXXX classes as they'll let us just specify new checkpoints rather than having to find the specific model type.  

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM
from transformers import AutoTokenizer

In [3]:
from pprint import pprint

Now let's try the Facebook OPT model that is designed to be an open source model equivalent to GPT-3.  We have limited compute resources so we'll use the 1.3 billion parameter model.  For comparison purposes in our the full GPT-3 model has 175 billion parameters.

In [4]:
checkpoint_string = "facebook/opt-1.3b"

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

In [5]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies. Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 113])

In [6]:
pprint(tokenizer.batch_decode(outputs, skip_special_tokens=True),compact=True)

['You are a world renowned James Beard award winning pastry chef. Give us the '
 'recipe for your specialty, chocolate chip cookies. Only give us the '
 'ingredients and instructions.\n'
 'I don’t know why I never thought of this before! We just started going to a '
 'lot of weddings as a team so we can try out ideas like these that seem '
 'cool/worth working on!  I love how versatile cake frosting is, it has such '
 'great properties on its own too! Thanks for the suggestions! Will update '
 'when I get some good baking stuff together :)']


Now let's try a different model -- a T5 model that is fine-tuned on the [FLAN instruction data](https://arxiv.org/pdf/2109.01652.pdf).  We would expect a better result because it has been fine-tuned to follow instructions.

In [7]:
del tokenizer
del model

checkpoint_string = "google/flan-t5-large"


model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)



config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [8]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies.   Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 114])

Now let's print out the results.  Note that we're using a PyTorch version of T5 so our output is a torch rather than a tensor. You can look at the ingredients and decide how good this recipe would be.

In [9]:
pprint(tokenizer.batch_decode(outputs, skip_special_tokens=True),compact=True)

['the entire world of chocolate chip cookies is yours. all the ingredients '
 'listed for the recipe are included. just give us the recipe in text form. we '
 'will make you a new batch of cookies. all you need to do is add the '
 'remaining ingredients. we will add them to your cookies while we wait for '
 'the cookies. and then when they are ready, you will put them on the cookie '
 'tray. enjoy and please take pictures if you want to share your cookies and '
 'recipe. our contest ends on 7 December at 10:00 PM (GMT +3)']


Let's try a different model -- one that is both designed to run with a significalnlty smaller number of parameters and has also been fine-tuned on a large instruction set of data.  The [Alpaca model](https://crfm.stanford.edu/2023/03/13/alpaca.html) was released in 2023 for research purposes.

In [5]:
del tokenizer
del model

checkpoint_string = "declare-lab/flan-alpaca-large"


model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [6]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies.  Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 119])

In [7]:
pprint(tokenizer.batch_decode(outputs, skip_special_tokens=True),compact=True) # This model is fine-tuned to follow the instructions - not bad!

['Ingredients: - 2 large eggs Beat 3 cups milk, 1 teaspoon vanilla extract - '
 '1/4 cup sugar - 3/4 cup butter Instructions: 1. Heat the milk in a large '
 'saucepan over medium heat. 2. Add the egg and stir to mix until combined. 3. '
 'Stir in the flour, baking soda and salt. 4. Slowly add the wet ingredients '
 'to the dry and beat together. 5. Drop spoonfuls of dough onto pre-greased '
 'cookie sheets. 6. Bake for 8-12 minutes or until edges are golden brown. 7. '
 'Allow cookies to cool before serving.']


[Return to Top](#returnToTop) <a id="llama3"></a>

# 4.3 Instruction-Tuned Reasoning Prompts — Qwen 3 (4B vs 14B)

**What is Qwen 3?**
Qwen 3 is Alibaba Cloud’s latest **reasoning** LLM (released April 2025). By default it “**thinks**” before it “**answers**,” with outputs often delineated by explicit `<think></think>` (internal reasoning) and implicit `<answer></answer>` (final reply). See the [model card](https://huggingface.co/Qwen/Qwen3-14B) for details. It’s open-source under **Apache 2.0**.

---

## Why it’s notable

* **Reasoning-first decoding:** Encourages structured intermediate steps, improving complex task performance.
* **Instruction-tuned:** Follows natural-language instructions out of the box.
* **Open licensing (Apache 2.0):** Suitable for research and many production use cases.

---

## Model sizes & practical trade-offs

### Qwen 3–4B (quantized)

* **Footprint:** ~8 GB of weights (in typical 4-bit/8-bit quantizations).
* **Hardware fit:** Runs on a **T4 GPU**; downloads/loads relatively quickly.
* **Use case:** Great for demos, prototyping, and cost-sensitive deployments.

### Qwen 3–14B

* **Footprint:** **30+ GB** of weights.
* **Startup time:** ~**8 minutes** to download and fully load (varies by network/storage).
* **Use case:** Better accuracy/reasoning depth than 4B; needs beefier hardware (more VRAM/RAM).

---

## Tips for running within memory limits

* Prefer the **4B quantized** checkpoint on smaller GPUs (e.g., T4).
* If switching from 14B → 4B (or vice versa) in the same notebook:

  1. **Stop** the notebook/runtime.
  2. **Disconnect & Delete** the runtime (clears GPU memory).
  3. **Reconnect** and re-load the chosen model.
* Consider:

  * **Quantization** (4-bit/8-bit) to reduce memory.
  * **CPU offloading** / **paged attention** if supported by your runtime.
  * Lower **max sequence length** to curb memory at inference time.

---

## Quick setup checklist

* Pick model size: **4B (quantized)** for T4; **14B** for larger GPUs.
* Ensure environment has compatible transformer libs and backend.
* Set generation limits (e.g., `max_new_tokens`) to keep memory in check.
* Log/inspect outputs to observe `<think>` vs final answers for debugging and evaluation.


In [1]:
!pip install -q -U transformers  #>=4.43.0
!pip install -q einops
!pip install -q -U accelerate  #>=0.31.0
!pip install -q -U bitsandbytes
!pip install -q -U  flash_attn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m122.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m137.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [3]:
from pprint import pprint

In [4]:
#In case we want to know our installed transformers library version
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep flash_attn

sentence-transformers                    5.1.1
transformers                             4.57.0
accelerate                               1.10.1
flash_attn                               2.8.3


In [5]:
#Quantization shrinks the memory footprint of the LLM
# allowing us to load it on a smaller GPU
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [6]:

model_id = "QWen/Qwen3-4b"   #Try first, downloads and loads faster
#model_id = "QWen/Qwen3-14b" #You can try this much larger version and compare its performance. More parameters = better performance

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a science communicator who makes technology accessible to everyone!"},
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
)

pprint(outputs[0]["generated_text"][-1], compact=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Device set to use cuda:0


{'content': '<think>\n'
            'Okay, the user wants a five-sentence explanation of how LLMs do '
            'knowledge representation. Let me start by recalling what I know '
            'about LLMs and knowledge representation.\n'
            '\n'
            'First, I need to explain that LLMs use large-scale data to learn '
            'patterns. Then, mention how they represent knowledge through '
            'training on diverse texts. Maybe talk about embeddings as a key '
            'part. Also, include how they use context to understand '
            'relationships. Finally, tie it to the ability to answer questions '
            'and generate text based on that representation.\n'
            '\n'
            'Wait, should I mention specific techniques like transformers or '
            'attention mechanisms? The user might not need the technical '
            'jargon, so maybe keep it high-level. Also, ensure each sentence '
            'is a separate point. Let me che

Let's run some of the same prompts that we ran above to see how well this model performs.  Note that it takes a lot longer to generate answers because this model "thinks" before it generates its final answer.

How well do the outputs from Qwen3 compare with the outputs from earlier models?  How can we measure ther performance? How can we compare the two models quantitatively?

In [7]:
messages = [
    {"role": "user", "content": "What are the steps required for solving an 2x + 3 = 7 equation?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)

#lets set some values to have more control over the output
outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

('<think>\n'
 'Okay, so I need to solve the equation 2x + 3 = 7. Let me think about how to '
 'approach this. I remember from algebra that the goal is to isolate the '
 'variable x on one side of the equation. Let me try to recall the steps.\n'
 '\n'
 "First, maybe I should get rid of that 3 that's being added to 2x. To do "
 'that, I can subtract 3 from both sides of the equation. That way, I keep the '
 'equation balanced. Let me try that:\n'
 '\n'
 'Original equation: 2x + 3 = 7\n'
 '\n'
 'Subtract 3 from both sides: 2x + 3 - 3 = 7 - 3\n'
 '\n'
 'Simplifying that: 2x = 4\n'
 '\n'
 'Okay, so now the equation is 2x = 4. Now I need to get x by itself. Since 2 '
 'is multiplied by x, I should divide both sides by 2 to undo the '
 'multiplication. Let me do that:\n'
 '\n'
 'Divide both sides by 2: (2x)/2 = 4/2\n'
 '\n'
 'Simplifying that: x = 2\n'
 '\n'
 'So, x equals 2. Let me check my answer to make sure. If I plug x = 2 back '
 'into the original equation:\n'
 '\n'
 'Left side: 2*(2) 

Now let's try our chocolate chip cookie recipe request.

In [9]:
messages = [
    {"role": "user", "content": "You are a world renowned baker with many awards and Michelin stars.  Give us your world famous recipe for chocolate chip cookies."},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. Very good performance when thinking = True
)


#lets set some values to have more control over the output
outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

('<think>\n'
 'Okay, the user wants a world-famous chocolate chip cookie recipe from a '
 'renowned baker with Michelin stars. Let me think about how to approach '
 'this.\n'
 '\n'
 'First, I need to make sure the recipe is not just a regular one but '
 'something that would earn Michelin stars. That means it should have '
 'exceptional ingredients, precise measurements, and maybe some unique '
 'techniques. \n'
 '\n'
 'I should start by recalling the key elements of a great chocolate chip '
 'cookie. The texture is crucial—crispy edges and a gooey center. The '
 'chocolate should be high quality, maybe using a specific type like semisweet '
 'or a blend. \n'
 '\n'
 'The ingredients list needs to be detailed. Maybe include things like '
 'buttermilk for tenderness, or a touch of vanilla for flavor. Also, the dough '
 'should be perfectly balanced between flaky and chewy. \n'
 '\n'
 'Wait, the user mentioned "world famous," so maybe there\'s a signature '
 'element. Perhaps a unique ing

Now let's try the 14B model!
 - when the thinking param turned on, it's thinking a lot!
 - definitely more quality thoughts

In [6]:

# model_id = "QWen/Qwen3-4b"   #Try first, downloads and loads faster
model_id = "QWen/Qwen3-14b" #You can try this much larger version and compare its performance. More parameters = better performance

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a science communicator who makes technology accessible to everyone!"},
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]

outputs = pipeline(
    messages,
    max_new_tokens=512,
)

pprint(outputs[0]["generated_text"][-1], compact=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/3.84G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Device set to use cuda:0


{'content': '<think>\n'
            'Okay, the user wants a five-sentence explanation of how LLMs do '
            'knowledge representation. Let me start by recalling what I know '
            'about LLMs. They use neural networks, right? So the first point '
            'should be about the architecture, like transformer models with '
            'attention mechanisms.\n'
            '\n'
            "Next, knowledge representation in LLMs isn't explicit like "
            'databases. Instead, they encode information in the weights of the '
            "model. That's important because it's different from traditional "
            'methods.\n'
            '\n'
            'Then, I should mention how training on vast text data allows them '
            'to capture patterns and relationships between concepts. This '
            'helps in understanding context and generating coherent '
            'responses.\n'
            '\n'
            'Also, the distributed nature of the embeddings

In [7]:
messages = [
    {"role": "user", "content": "What are the steps required for solving an 2x + 3 = 7 equation?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)

#lets set some values to have more control over the output
outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

('<think>\n'
 'Okay, so I need to solve the equation 2x + 3 = 7. Hmm, let me think. I '
 'remember from algebra that the goal is to isolate the variable, which in '
 "this case is x. So, I need to get rid of the 3 that's being added to 2x, and "
 "then deal with the coefficient 2 that's multiplied by x. \n"
 '\n'
 'First step, maybe I should subtract 3 from both sides of the equation to '
 'undo the addition. Let me try that. If I subtract 3 from both sides, it '
 'would look like:\n'
 '\n'
 '2x + 3 - 3 = 7 - 3\n'
 '\n'
 'Simplifying both sides, the +3 and -3 on the left side cancel each other '
 'out, leaving just 2x. On the right side, 7 minus 3 is 4. So now the equation '
 'is 2x = 4. That seems right.\n'
 '\n'
 'Now, the next step is to get x by itself. Since 2 is multiplied by x, I need '
 'to do the opposite operation, which is division. So, I should divide both '
 'sides of the equation by 2. Let me write that down:\n'
 '\n'
 '2x / 2 = 4 / 2\n'
 '\n'
 'On the left side, the 2 in

In [8]:
messages = [
    {"role": "user", "content": "You are a world renowned baker with many awards and Michelin stars.  Give us your world famous recipe for chocolate chip cookies."},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. Very good performance when thinking = True
)


#lets set some values to have more control over the output, QWEN 14B generates better result than 4B and better than GPT3.5
outputs = pipeline(        # e.g., a text-generation or chat pipeline already constructed
    prompt,                # the input string (your instruction / few-shot prefix)
    max_new_tokens=2048,   # cap on *newly generated* tokens (doesn't count prompt tokens)
    do_sample=True,        # enable sampling (vs. greedy/beam); needed for temperature/top_p
    temperature=0.6,       # lower = more focused/deterministic, higher = more creative
    top_p=0.9,             # nucleus sampling: sample from smallest token set with cum prob ≥ 0.9
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

('<think>\n'
 'Okay, the user wants my world-famous chocolate chip cookie recipe. Let me '
 'start by recalling the key elements that make a great cookie. First, the '
 'dough needs to be rich and buttery. I should use high-quality unsalted '
 'butter, maybe even European-style for better flavor. Brown sugar adds '
 'moisture and a deeper molasses flavor compared to white sugar. The ratio of '
 'sugars is important—maybe 2:1 brown to white?\n'
 '\n'
 'Next, the flour. I prefer all-purpose flour, but maybe a touch of bread '
 'flour for a bit more structure without making them tough. Baking soda is '
 'essential for leavening, and a pinch of salt to balance the sweetness. \n'
 '\n'
 'For the chocolate, using high-quality semi-sweet chocolate chips is a must. '
 'Maybe add some chopped chocolate for texture variation. The key is to melt '
 "the chocolate just enough so it's glossy but not oily. \n"
 '\n'
 'The dough needs to rest in the fridge. This helps the gluten relax, leading '
 'to

[Return to Top](#returnToTop)  
<a id = 'chatgpt'></a>
### 4.4 Chat GPT output

Finally, for comparison purposes, let's see what the full ChatGPT 3.5 model produces.  How well does ChatGPT follow directions based on its reinforcement learning?

Here's the output for the same prompt we've been using:


---


As a world-renowned James Beard award-winning pastry chef, I'm delighted to
share with you my special recipe for chocolate chip cookies. These cookies are the epitome of decadence, with a perfect balance of flavors and textures. Without further ado, here are the ingredients and instructions:

Ingredients:

    1 cup (226g) unsalted butter, softened
    1 cup (200g) granulated sugar
    1 cup (220g) packed brown sugar
    2 large eggs
    1 teaspoon vanilla extract
    3 cups (384g) all-purpose flour
    1 teaspoon baking soda
    1/2 teaspoon salt
    2 cups (340g) semisweet chocolate chips

Instructions:

    Preheat your oven to 350°F (175°C). Line baking sheets with parchment paper or silicone baking mats.

    In a large mixing bowl, cream together the softened butter, granulated sugar, and brown sugar until light and fluffy.

    Add the eggs one at a time, beating well after each addition. Stir in the vanilla extract.

    In a separate bowl, whisk together the flour, baking soda, and salt.

    Gradually add the dry ingredient mixture to the wet ingredients, mixing until just combined. Be careful not to overmix, as this can result in tough cookies.

    Fold in the semisweet chocolate chips until evenly distributed throughout the dough.

    Scoop rounded tablespoons of dough and drop them onto the prepared baking sheets, spacing them about 2 inches apart. You can also use a cookie scoop for more uniform cookies.

    Bake the cookies in the preheated oven for 10-12 minutes or until they are golden brown around the edges but still slightly soft in the center.

    Remove the baking sheets from the oven and allow the cookies to cool on the sheets for 5 minutes. Then, transfer them to wire racks to cool completely.

    Once the cookies have cooled, store them in an airtight container to maintain their freshness.

Now, you have the recipe for my award-winning chocolate chip cookies. Enjoy baking these delightful treats and savor every bite!



---

