# Lesson Notebook 5: Text Generation

In this notebook we will look at 3 different examples:

1. Building a Seq2Seq model for machine translation using RNNs with and without Attention

2. Playing with T5 for summarization and translation

3. Exercise with prompts and language generation using the OPT model

The sequence to sequence architecture is inspired by the Keras Tutorial https://keras.io/examples/nlp/lstm_seq2seq/.


<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Seq2Seq Model](#encoderDecoder)
      * 2.1 [Data Acquisition](#dataAcquisition)
      * 2.2 [Seq2Seq without Attention](#s2sNoAttention)
      * 2.3 [Seq2Seq with Attention](#s2sAttention)
  * 3. [T5](#t5Example)
    * 3.1 [Tokenization](#tokenization)
    * 3.2 [Model Structure & Output](#modelOutput)
  * 4. [Prompt Engineering and Generative Large Language Models](#prompts)
    * 4.1 [Cloze Prompts](#clozeExample)
    * 4.2 [Prefix Prompts](#prefixExample)
    * 4.3 [Class Exercise](#classExercise)
  * 5. [Answers](#answers)      




  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-summer-main/blob/master/materials/lesson_notebooks/lesson_5_Text_Generation.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

We first need to do the usual setup. We will also use some nltk and sklearn components in order to tokenize the text.

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally. This can also be done on Colab.

In [1]:
#@title Installs

!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet
!pip install nltk --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
#@title Imports

import numpy as np

import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds

import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer

import os
import nltk

import matplotlib.pyplot as plt

import re
import textwrap

from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration
from transformers import GPT2Tokenizer, TFOPTForCausalLM

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

[Return to Top](#returnToTop)  
<a id = 'encoderDecoder'></a>


## 2. Building a Seq2Seq model for Translation using RNNs with and without Attention

### 2.1 Downloading and pre-processing Data


Let's get the data. Just like the Keras tutorial, we will use http://www.manythings.org as the source for the parallel corpus, but we will use German.  Machine translation requires sentence pairs for training, that is individual sentences in German and the corresponding sentence in English.

In [4]:
!!curl -O http://www.manythings.org/anki/deu-eng.zip
!!unzip deu-eng.zip

['Archive:  deu-eng.zip',
 '  inflating: deu.txt                 ',
 '  inflating: _about.txt              ']

Next, we need to set a few parameters.  Note these numbers are much smaller than we would set in a real world system.  For example, vocabulary sizes of 2000 and 3000 are unrealistic unless we were dealing with a highly specialized domain.

In [5]:
embed_dim = 100  # Embedding dimensions for vectors and LSTMs.
num_samples = 10000  # Number of examples to consider.

# Path to the data txt file on disk.
data_path = "deu.txt"

# Vocabulary sizes that we'll use:
english_vocab_size = 2000
german_vocab_size = 3000

Next, we need to format the input. In particular we would like to use nltk to help with the tokenization. We will then use sklearn's CountVectorizer to create a vocabulary from the most frequent words in each language.

(Before, we used pre-trained word embeddings from Word2Vec that came with a defined vocabulary. This time, we'll start from scratch, and need to extract the vocabulary from the training text.)

In [6]:
input_texts = []
target_texts = []

max_input_length = -1
max_output_length = -1


with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")

    tokenized_source_text = nltk.word_tokenize(input_text, language='english')
    tokenized_target_text = nltk.word_tokenize(target_text, language='german')

    if len(tokenized_source_text) > max_input_length:
      max_input_length = len(tokenized_source_text)

    if len(tokenized_target_text) > max_output_length:
      max_output_length = len(tokenized_target_text)


    source_text = (' '.join(tokenized_source_text)).lower()
    target_text = (' '.join(tokenized_target_text)).lower()

    input_texts.append(source_text)
    target_texts.append(target_text)

vectorizer_english = CountVectorizer(max_features=english_vocab_size)
vectorizer_english.fit(input_texts)
vocab_english = vectorizer_english.get_feature_names_out()

vectorizer_german = CountVectorizer(max_features=german_vocab_size)
vectorizer_german.fit(target_texts)
vocab_german = vectorizer_german.get_feature_names_out()

print('Maximum source input length: ', max_input_length)
print('Maximum target output length: ', max_output_length)

Maximum source input length:  6
Maximum target output length:  10


In [7]:
input_texts[:2]

['go .', 'hi .']

In [8]:
target_texts[:2]

['geh .', 'hallo !']

Looks simple but correct.

So the source and target sequences have max lengths 6 and 11, respectively. As we will add start and end tokens (\<s> and \</s>) to our decoder side we will set the respective max lengths to: 

In [9]:
max_encoder_seq_length = 6
max_decoder_seq_length = 13 #11 + start + end

Next, we create the dictionaries translating between integer ids and tokens for both source (English) and target (German).

In [10]:
source_id_vocab_dict = {}
source_vocab_id_dict = {}

for sid, svocab in enumerate(vocab_english):
  source_id_vocab_dict[sid] = svocab
  source_vocab_id_dict[svocab] = sid

source_id_vocab_dict[english_vocab_size] = "<unk>"
source_id_vocab_dict[english_vocab_size + 1] = "<pad>"

source_vocab_id_dict["<unk>"] = english_vocab_size
source_vocab_id_dict["<pad>"] = english_vocab_size + 1

target_id_vocab_dict = {}
target_vocab_id_dict = {}

for tid, tvocab in enumerate(vocab_german):
  target_id_vocab_dict[tid] = tvocab
  target_vocab_id_dict[tvocab] = tid

# Add unknown token plus start and end tokens to target language

target_id_vocab_dict[german_vocab_size] = "<unk>"
target_id_vocab_dict[german_vocab_size + 1] = "<start>"
target_id_vocab_dict[german_vocab_size + 2] = "<end>"
target_id_vocab_dict[german_vocab_size + 3] = "<pad>"

target_vocab_id_dict["<unk>"] = german_vocab_size
target_vocab_id_dict["<start>"] = german_vocab_size + 1
target_vocab_id_dict["<end>"] = german_vocab_size + 2
target_vocab_id_dict["<pad>"] = german_vocab_size + 3

Lastly, we need to create the training and test data that will feed into our two models. It is convenient to define a small function for that that also takes care off padding and adding start/end tokens on the decoder side.

Notice that we need to create three sequences of vocab ids: inputs to the encoder (starting language), inputs to the decoder (output language, for the preceding tokens in the output sequence) and labels for the decoder (the correct next word to predict at each time step in the output, which is shifted one over from the inputs to the decoder).

In [11]:
def convert_text_to_data(texts, 
                         vocab_id_dict, 
                         max_length=20, 
                         type=None,
                         train_test_vector=None,
                         samples=100000):
  
  if type == None:
    raise ValueError('\'type\' is not defined. Please choose from: input_source, input_target, output_target.')
  
  train_data = []
  test_data = []

  for text_num, text in enumerate(texts[:samples]):

    sentence_ids = []

    for token in text.split():

      if token in vocab_id_dict.keys():
        sentence_ids.append(vocab_id_dict[token])
      else:
        sentence_ids.append(vocab_id_dict["<unk>"])
    
    vocab_size = len(vocab_id_dict.keys())
    
    # Depending on encoder/decoder and input/output, add start/end tokens.
    # Then add padding.
    
    if type == 'input_source':
      ids = (sentence_ids + [vocab_size - 1] * max_length)[:max_length]

    elif type == 'input_target':
      ids = ([vocab_size -3] + sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]

    elif type == 'output_target':
      ids = (sentence_ids + [vocab_size - 2] + [vocab_size -1] * max_length)[:max_length]

    if train_test_vector is not None and not train_test_vector[text_num]:
      test_data.append(ids)
    else:
      train_data.append(ids)


  return np.array(train_data), np.array(test_data)


train_test_split_vector = (np.random.uniform(size=10000) > 0.2)

train_source_input_data, test_source_input_data = convert_text_to_data(input_texts, 
                                                                       source_vocab_id_dict,
                                                                       type='input_source',
                                                                       max_length=max_encoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_input_data, test_target_input_data = convert_text_to_data(target_texts,
                                                                       target_vocab_id_dict,
                                                                       type='input_target',
                                                                       max_length=max_decoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_output_data, test_target_output_data = convert_text_to_data(target_texts,
                                                                         target_vocab_id_dict,
                                                                         type='output_target',
                                                                         max_length=max_decoder_seq_length,
                                                                         train_test_vector=train_test_split_vector)




Let us look at a few examples. They appear coorect.

In [12]:
train_source_input_data[:2]

array([[ 745, 2000, 2001, 2001, 2001, 2001],
       [ 829, 2000, 2001, 2001, 2001, 2001]])

In [13]:
train_target_input_data[:2]

array([[3001, 1078, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [3001, 1246, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

In [14]:
train_target_output_data[:2]

array([[1078, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [1246, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

[Return to Top](#returnToTop)  
<a id = 's2sNoAttention'></a>

### 2.2 The Seq2seq model without Attention

We need to build both the encoder and the decoder and we'll use LSTMs.  We'll set up the system first without an attention layer between the encoder and decoder.

In [15]:
def create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim):

    source_input_no_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,),
                                                dtype='int64',
                                                name='source_input_no_att')
    target_input_no_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,),
                                                dtype='int64',
                                                name='target_input_no_att')

    source_embedding_layer_no_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                                              output_dim=embed_dim,
                                                              name='source_embedding_layer_no_att')

    target_embedding_layer_no_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                                               output_dim=embed_dim,
                                                               name='target_embedding_layer_no_att')

    source_embeddings_no_att = source_embedding_layer_no_att(source_input_no_att)
    target_embeddings_no_att = target_embedding_layer_no_att(target_input_no_att)

    encoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_no_att')
    encoder_out_no_att, encoder_state_h_no_att, encoder_state_c_no_att = encoder_lstm_layer_no_att(source_embeddings_no_att)

    decoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_no_att')
    decoder_lstm_out_no_att = decoder_lstm_layer_no_att(target_embeddings_no_att, [encoder_state_h_no_att, encoder_state_c_no_att])

    target_classification_no_att = tf.keras.layers.Dense(decode_vocab_size,
                                                         activation='softmax',
                                                         name='classification_no_att')(decoder_lstm_out_no_att)

    translation_model_no_att = tf.keras.models.Model(inputs=[source_input_no_att, target_input_no_att], outputs=[target_classification_no_att])

    translation_model_no_att.compile(optimizer="Adam",
                                     loss='sparse_categorical_crossentropy',
                                     metrics=['accuracy'])
    
    return translation_model_no_att


Now we can call the function we created to instantiate that model and confirm that it is set up the way we like using model.sumary().

In [16]:
encode_vocab_size = len(source_id_vocab_dict.keys())
decode_vocab_size = len(target_id_vocab_dict.keys())

translation_model_no_att = create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim)

translation_model_no_att.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 source_input_no_att (InputLaye  [(None, 6)]         0           []                               
 r)                                                                                               
                                                                                                  
 target_input_no_att (InputLaye  [(None, 13)]        0           []                               
 r)                                                                                               
                                                                                                  
 source_embedding_layer_no_att   (None, 6, 100)      200200      ['source_input_no_att[0][0]']    
 (Embedding)                                                                                  

It never hurts to look at the shapes of the outputs.

In [17]:
translation_model_no_att.predict(x=[train_source_input_data, train_target_input_data]).shape



(7990, 13, 3004)

Now that everything checks out, we can train our model.

In [18]:
translation_model_no_att.fit(x=[train_source_input_data, train_target_input_data],
                             y=train_target_output_data,
                             validation_data=([test_source_input_data, test_target_input_data],
                                              test_target_output_data),
                             epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f00b110f760>

[Return to Top](#returnToTop)  
<a id = 's2sAttention'></a>

### 2.3 The Seq2seq model with Attention

All we need to do is add an attention layer that ceates a context vector for each decoder position. We can use the attention layer provided by Keras in *tf.keras.layers.Attention()*.  We will then simply concatenate these corresponding context vectors with the output of the LSTM layer in order to predict the translation tokens one by one.

In [19]:
def create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim):

    source_input_with_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,), 
                                                  dtype='int64',
                                                  name='source_input_with_att')
    target_input_with_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,), 
                                                  dtype='int64',
                                                  name='target_input_with_att')

    source_embedding_layer_with_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                                                output_dim=embed_dim,
                                                                name='source_embedding_layer_with_att')

    target_embedding_layer_with_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                                                 output_dim=embed_dim,
                                                                 name='target_embedding_layer_with_att')

    source_embeddings_with_att = source_embedding_layer_with_att(source_input_with_att)
    target_embeddings_with_att = target_embedding_layer_with_att(target_input_with_att)

    encoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_with_att')
    encoder_out_with_att, encoder_state_h_with_att, encoder_state_c_with_att = encoder_lstm_layer_with_att(source_embeddings_with_att)

    decoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_with_att')
    decoder_lstm_out_with_att = decoder_lstm_layer_with_att(target_embeddings_with_att, [encoder_state_h_with_att, encoder_state_c_with_att])

    attention_context_vectors = tf.keras.layers.Attention(name='attention_layer')([decoder_lstm_out_with_att, encoder_out_with_att])

    concat_decode_out_with_att = tf.keras.layers.Concatenate(axis=-1, name='concat_layer_with_att')([decoder_lstm_out_with_att, attention_context_vectors])

    target_classification_with_att = tf.keras.layers.Dense(decode_vocab_size,
                                                           activation='softmax',
                                                           name='classification_with_att')(concat_decode_out_with_att)

    translation_model_with_att = tf.keras.models.Model(inputs=[source_input_with_att, target_input_with_att], outputs=[target_classification_with_att])

    translation_model_with_att.compile(optimizer="Adam",
                                       loss='sparse_categorical_crossentropy',
                                       metrics=['accuracy'])

    return translation_model_with_att


In [20]:
translation_model_with_att = create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim)

translation_model_with_att.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 source_input_with_att (InputLa  [(None, 6)]         0           []                               
 yer)                                                                                             
                                                                                                  
 target_input_with_att (InputLa  [(None, 13)]        0           []                               
 yer)                                                                                             
                                                                                                  
 source_embedding_layer_with_at  (None, 6, 100)      200200      ['source_input_with_att[0][0]']  
 t (Embedding)                                                                              

In [21]:
translation_model_with_att.fit(x=[train_source_input_data, train_target_input_data],
                               y=train_target_output_data,
                               validation_data=([test_source_input_data, test_target_input_data],
                                                test_target_output_data),
                               epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f006468fc70>

Validation accuracy is about one percentage point better.

**Question 1:** Why do you think the benefit of adding an attention layer is not larger?

[Return to Top](#returnToTop)  
<a id = 't5Example'></a>

## 3. T5

Now we turn to text generation with transformers. The T5 system was introduced [here](https://arxiv.org/pdf/1910.10683.pdf).  This model uses both the encoder and the decoder configurations of transformers and connects them together.  A big difference with this model is that it designed to accept text as an input and produce text as an output for a number of different tasks ranging from summarization and question answering to classification.  The system needs to be told which task to perform as the first part of the input text.  Be sure to look in *Appendix D* of the paper to see a complete set of the tasks that T5 base and large checkpoints can perform right out of the box and the data used to train them.

Let's play a bit with Huggingface's (Large) implementation of T5.

In [22]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-large')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')

t5_model.summary()

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  32899072  
                                                                 
 encoder (TFT5MainLayer)     multiple                  334939648 
                                                                 
 decoder (TFT5MainLayer)     multiple                  435627520 
                                                                 
Total params: 737,668,096
Trainable params: 737,668,096
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


737 m trainable parameters. Quite a lot. 

Let's create a short text to use as an example.

In [23]:
ARTICLE = ("Oh boy, what a lengthy and cumbersome excercise this was. " \
           "I had to look into every detail, check everything twice, " \
           " and then compare to prior results. Because of this tediousness " \
           " and extra work my homework was 2 days late.")

Next, we need to specify the task we want T5 to perform and include it at the begining of the input text.  We add a task prompt to the begining of our input.  Because we are summarizing, we add the word *summarize:* to the begining of our input.

In [24]:
t5_input_text = "summarize: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

First, we will generate a summary using the default output options.

In [25]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False)
       for g in t5_summary_ids])



['homework was a lengthy and cumbersome excercise . because of this tedious']


Not great. But let's get more sophisticated and prescribe a minimum length and use beam search to generate multiple outputs.  We also indicate the maximum length the output should be.  Finally, in order to reduce repetitive output we tell the model to avoid output that repeats trigrams (three word groupings).

In [26]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=20,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['i had to look into every detail, check everything twice, and then compare to prior results . because of this tediousness and extra work my homework was 2 days late .']


That is a bit better thanks to our application of some hyperparameters. 

Lastly, can T5 perform machine translation? Yes, in some limited instances.  We need to specify the input and output languages. Keep in mind that the model has only been trained to translate in particular directions e.g. English to Romanian but NOT Romanian to English.


In [27]:
t5_input_text = "translate English to German: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [28]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Ich habe es nicht geschafft, meinen ersten Test zu schreiben, da ich nicht genügend Zeit hatte, um meinen Test zu bearbeiten.']


Hmm... output language fluency is very good. But take the German output and feed it in to translate.google.com and see what this means. Is it anything like its English input? This hallucination might be mitigated by changing some of the hyperparameters like num_beams.

Is a shorter example more accurate?  Maybe.

In [29]:
t5_input_text = "translate English to German: That was really not very good today; it was too difficult to solve."
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [30]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Das war heute wirklich nicht sehr gut; es war zu schwierig zu lösen.']


That is not bad, though some mistakes are there.

[Return to Top](#returnToTop)  
<a id = 'prompts'></a>

## 4. Prompt Engineering and Generative Large Language Models

The development of very large language models such as [GPT3](https://arxiv.org/pdf/2005.14165.pdf) have led to increased interest in few shot and zero shot approaches to tasks.  These generative language models allow a user to provide a prompt with several examples followed by a question the model must answer.  GPT3, especially its 175 billion parameter model, demonstrates the feasibility of a zero shot model where the model can simply be presented with the prompt and in many instances provide the correct answer.  

The implication of this zero shot capability is that a very large generative language model can be pre-trained and then shared by a large group of people because it requires no fine-tuning or parameter manipulation. Instead, the users work on the wording of their prompt and providing enough context that the model an perform the task correctly. [Liu et. al.](https://arxiv.org/pdf/2107.13586.pdf) characterize this as "pre-train, prompt, and predict."

There are multiple approaches to pre-train, prompt and predict.  Here we explore two of them.  First we look at cloze prompts.  These leverage the masked language model approach used in BERT an T5 where individual words or spans are masked and during pre-training the model learns to predict the maked tokens. Second we look at prefix prompts.  These leverage the next word prediction capability of decoder only models in the GPT family. 

[Return to Top](#returnToTop)  
<a id = 'clozePrompts'></a>

### 4.1 Cloze Prompts

Cloze prompts take advantage of the masked language model task where an individual word or span of words anywhere in the input are masked and the language model learns to predict them. 

In [31]:
#Delete the old model so we are managing memory
del t5_tokenizer
del t5_model

#get a new model with a new checkpoint
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/892M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"\<extra_id_0\>" is the special token (called a sentinel token) we can use with T5 to invoke its masked word modeling ability. There are up to 99 of these tokens. This means we can construct sentences, like a fill in the blank test, that allow us to probe the knowledge embedded in the model based on its pre-training.  Here's an example that works well.  After you've run it try substituting beagle for poodle and you'll see the model gets confused.

Notice two that we are using a beam search approach and accepting the top three choices rather than just the first choice.

In [32]:
PROMPT_SENTENCE = ( "An Australian <extra_id_0> is a type of working dog .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=10,
                                   #temperature=0.8,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Shepherd', 'working', 'Working']


In [33]:
#Keep our memory free of old models
del t5_tokenizer
del t5_model

[Return to Top](#returnToTop)  
<a id = 'prefixPrompt'></a>

### 4.2 Prefix Prompts

Prefix prompts are used with models that predict the next word given a large context window.  If you fill that window with the right information you can get the model to generate the output you want.  GPT3 relies on this approach to successfully perform.  You can either include a couple of examples of what you want the model to do and then ask your question or you can just ask your question.

Let's take a look at a decoder-only generative pretrained text generation model: [OPT](https://arxiv.org/pdf/2205.01068.pdf). This model doesn't have separate input and output sequences, instead we will feed in one sequence (the prefix prompt) and ask the model to continue generating text to complete that same sequence.  The OPT model is intended to replicate the functionality of the GPT-3 model and comes in several size from 125 million parameters to 175 billion parameters.  We'll work with the 350 million parameter model.

As with T5, we'll just try out the pre-trained model and see what text it generates for a new starting sequence.

In [34]:
from transformers import GPT2Tokenizer, TFOPTForCausalLM


In [35]:
tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m")
model = TFOPTForCausalLM.from_pretrained("facebook/opt-350m")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/663M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFOPTForCausalLM.

All the layers of TFOPTForCausalLM were initialized from the model checkpoint at facebook/opt-350m.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFOPTForCausalLM for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

We'll give it a prompt now and see what it generates next.

In [37]:
prefix_prompt = 'Yesterday, I went to the store to buy '
input_ids = tokenizer.encode(prefix_prompt, return_tensors='tf')

In [39]:
generated_text_outputs = model.generate(
    input_ids, 
    max_length=35,
    num_return_sequences=3,
    repetition_penalty=1.5,
    top_p=0.92,
    temperature=.85,
    do_sample=True,
    top_k=125,
    early_stopping=True
)

#Print output for each sequence generated above
for i, beam in enumerate(generated_text_outputs):
  print()
  print("{}: {}".format(i, tokenizer.decode(beam, skip_special_tokens=True, clean_up_tokenization_spaces=True)))



0: Yesterday, I went to the store to buy  two new shoes.
Well... maybe you could tell me about what was in those shoes? [](/b11

1: Yesterday, I went to the store to buy  my starter. I ordered some pre-owned sneakers and they were going for $20 but there's no where on the

2: Yesterday, I went to the store to buy iced tea and this person comes in with a mug of water.
The tea is cold enough that it has no liquid


Now let's try a long prompt to give the model a lot of context to work with and see how well it performs.  We'll ask it to generate a recipe and see how well if follows instructions.  We'll try it with several smaller models avaiable on HuggingFace. Finally, we'll include the output for that same prompt from chatGPT for comparison purposes.

In order to do so without exceeding the memory, you should **STOP the notebook and Reconnect it**. 

In [1]:
!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25h

We'll also use the HuggingFace AutoTokenizer and AutoModelXXX classes as they'll let us just specify new checkpoints rather than having to find the specific model type.  

In [2]:

from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM
from transformers import AutoTokenizer

Now let's try the Facebook OPT model that is designed to be an open source model equivalent to GPT-3.  We have limited compute resources so we'll use the 1.3 billion parameter model.  For comparison purposes in our the full GPT-3 model has 175 billion parameters.

In [3]:
checkpoint_string = "facebook/opt-1.3b"

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)

Downloading (…)lve/main/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

In [4]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies. Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 209])

In [5]:
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies. Only give us the ingredients and instructions. You can share this with us at our website or through email to: [email protected] I have made the same cookies many times over the years and it\'s always become popular with my clients as well as food bloggers that use me in their recipes..\nThank you so much! :) -I am not very technical nor do I like making things up on the fly :D So I don\'t think I\'ll be willing to go into detail about everything involved... But maybe some tips would help? haha\nYour oven control software should allow multiple temperature zones using "secret menus."  The secret menu should show heat levels from low-medium (if you only have one oven) to high-very hot (which is what I\'ve found is necessary). If the oven goes above these levels you\'ll lose some baking time -- maybe an hour total if you bake them until done.']


Now let's try a different model -- a T5 model that is fine-tuned on the [FLAN instruction data](https://arxiv.org/pdf/2109.01652.pdf).  We would expect a better result because it has been fine-tuned to follow instructions.

In [6]:
del tokenizer
del model

checkpoint_string = "google/flan-t5-large"


model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)



Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [7]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies.   Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 116])

Now let's print out the results.  Note that we're using a PyTorch version of T5 so our output is a torch rather than a tensor. You can look at the ingredients and decide how good this recipe would be.

In [8]:
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['you are famous for scrumptious cookies. let your imagination run wild! flour and baking powder, brown sugar and vanilla extract. and to make this recipe, you will need: chocolate chips 1/2 cup non-dairy chocolate chips melted, sugar, flour and a pinch of salt. milk only. And a chocolate drizzle. and a pity tear. and your mouth a little tighter. a little less. and add a teaspoon of cinnamon and powdered sugar. or use all-purpose flour.']


Let's try a different model -- one that is both designed to run with a significalnlty smaller number of parameters and has also been fine-tuned on a large instruction set of data.  The [Alpaca model](https://crfm.stanford.edu/2023/03/13/alpaca.html) was recently released for research purposes.

In [9]:
del tokenizer
del model

checkpoint_string = "declare-lab/flan-alpaca-large"


model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_string)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_string)

Downloading (…)lve/main/config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [10]:
inputs = tokenizer("You are a world renowned James Beard award winning pastry chef. Give us the recipe for your specialty, chocolate chip cookies.  Only give us the ingredients and instructions.", return_tensors="pt")
outputs = model.generate(**inputs,
                         do_sample=True, min_length=100, max_length=300, temperature=0.97, repetition_penalty=1.2
            )
outputs.shape

torch.Size([1, 144])

In [11]:
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Ingredients: 1 cup all-purpose flour, 1/2 cup cocoa powder, 3/4 teaspoon baking soda, 1/4 teaspoon baking soda 2 cups white sugar, 1 cup butter Instructions: Preheat the oven to 350 degrees Fahrenheit (180 degrees Celsius). Butter a 9x13 inch baking pan; line the bottom of the prepared baking pan with parchment paper. In a medium bowl, cream together butter and sugar until light and fluffy. Slowly stir in egg. Beat in vanilla extract. Add dry ingredients to butter mixture, stirring just until combined. Stir in chocolate chips. Drop tablespoonfuls by ladle onto each lined cookie sheet. Bake for 20 minutes or until edges are golden brown.']


Finally, for comparison purposes, let's see what the full ChatGPT model produces.  How well does ChatGPT follow directions based on its reinforcement learning.  

Here's the output for the same prompt we've been using:


---


As a world-renowned James Beard award-winning pastry chef, I'm delighted to 
share with you my special recipe for chocolate chip cookies. These cookies are the epitome of decadence, with a perfect balance of flavors and textures. Without further ado, here are the ingredients and instructions:

Ingredients:

    1 cup (226g) unsalted butter, softened
    1 cup (200g) granulated sugar
    1 cup (220g) packed brown sugar
    2 large eggs
    1 teaspoon vanilla extract
    3 cups (384g) all-purpose flour
    1 teaspoon baking soda
    1/2 teaspoon salt
    2 cups (340g) semisweet chocolate chips

Instructions:

    Preheat your oven to 350°F (175°C). Line baking sheets with parchment paper or silicone baking mats.

    In a large mixing bowl, cream together the softened butter, granulated sugar, and brown sugar until light and fluffy.

    Add the eggs one at a time, beating well after each addition. Stir in the vanilla extract.

    In a separate bowl, whisk together the flour, baking soda, and salt.

    Gradually add the dry ingredient mixture to the wet ingredients, mixing until just combined. Be careful not to overmix, as this can result in tough cookies.

    Fold in the semisweet chocolate chips until evenly distributed throughout the dough.

    Scoop rounded tablespoons of dough and drop them onto the prepared baking sheets, spacing them about 2 inches apart. You can also use a cookie scoop for more uniform cookies.

    Bake the cookies in the preheated oven for 10-12 minutes or until they are golden brown around the edges but still slightly soft in the center.

    Remove the baking sheets from the oven and allow the cookies to cool on the sheets for 5 minutes. Then, transfer them to wire racks to cool completely.

    Once the cookies have cooled, store them in an airtight container to maintain their freshness.

Now, you have the recipe for my award-winning chocolate chip cookies. Enjoy baking these delightful treats and savor every bite!



---




[Return to Top](#returnToTop)  
<a id = 'classExercise'></a>

### 4.3 In-Class Exercise (or on your own):
- Try changing the text_start input text to see how OPT completes different types of starting sentences (prefix prompts). (If time, we can brainstorm some sentences to try in groups or collect in the chat during the live session.)
- You can alter num_return_sequences to return a larger or smaller number of output options (i.e. beams).
- You might want to play with the parameters for repetition_penalty to see how they affect the model's output.
- You might also want to see what happens when you increase max_length, and how that relates to the repetition constraints. As the text gets longer, it will be more challenging for the model to avoid repeating itself. So stricter constraints against repetition might make the model get more creative or wander farther from the input sequence.

[Return to Top](#returnToTop)  
<a id = 'answers'></a>

## 5. Answers

**Question 1:** Why do you think the benefit of adding an attention layer is not larger?

      Answer:   The nature of our training and test sets and the artificial size of the inputs (6 words) and outputs (11 words) means that the gains we might see on long sentences aren't a part of this test.