## Lesson Notebook 5: Text Generation

In this notebook we will look at 2 components:

1. Buliding a Seq2Seq model for Translation using RNNs with and without Attention

2. Playing with T5

Part 1 is inspired by the Keras Tutorial https://keras.io/examples/nlp/lstm_seq2seq/.

We first need to do the usual setup. We will also use some nltk and sklearn components in order to tokenize the text.


  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-summer-main/blob/master/materials/lesson_notebooks/lesson_5_Text_Generation.ipynb)

In [1]:
#@title Installs

!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet
!pip install nltk --quiet

In [2]:
#@title Imports

import numpy as np
import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration


import sklearn as sk
import os
import nltk

import matplotlib.pyplot as plt

import re

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 1. Buliding a Seq2Seq model for Translation using RNNs with and without Attention

#### 1.a Downloading and pre-processing Data


Let's get the data. Just like the Keras tutorial, we will use http://www.manythings.org as the source for the parallel corpus, but we will use German: 

In [4]:
!!curl -O http://www.manythings.org/anki/deu-eng.zip
!!unzip deu-eng.zip

['Archive:  deu-eng.zip',
 'replace deu.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y',
 '  inflating: deu.txt                 ',
 'replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y',
 '  inflating: _about.txt              ']

Next, we need to set a few parameters

In [5]:
embed_dim = 100  # Embedding dimensions for vectors and LSTMs.
num_samples = 10000  # Number of examples to consider.
# Path to the data txt file on disk.
data_path = "deu.txt"

# Vocabulary sizes that we consider:
english_vocab_size = 2000
german_vocab_size = 3000

Next, we need to format the input. In particular we would like to use nltk to help with the tokenization. We will then use sklearn to perform the counting.

In [6]:
input_texts = []
target_texts = []

max_input_length = -1
max_output_length = -1


with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")

    tokenized_source_text = nltk.word_tokenize(input_text, language='english')
    tokenized_target_text = nltk.word_tokenize(target_text, language='german')

    if len(tokenized_source_text) > max_input_length:
      max_input_length = len(tokenized_source_text)

    if len(tokenized_target_text) > max_output_length:
      max_output_length = len(tokenized_target_text)


    source_text = (' '.join(tokenized_source_text)).lower()
    target_text = (' '.join(tokenized_target_text)).lower()

    input_texts.append(source_text)
    target_texts.append(target_text)

  
vect_english = CountVectorizer(max_features=english_vocab_size)
vect_german = CountVectorizer(max_features=german_vocab_size)

vectorized_english_input = vect_english.fit_transform(input_texts)
vectorized_german_target = vect_german.fit_transform(target_texts)

print('Maximum source input length: ', max_input_length)
print('Maximum target output length: ', max_output_length)

Maximum source input length:  6
Maximum target output length:  11


In [7]:
input_texts[:2]

['go .', 'hi .']

In [8]:
target_texts[:2]

['geh .', 'hallo !']

Looks simple but correct.

So the source and target sequences have max lengths 6 and 11, respectively. As we will add start and end tokens to our decoder side we will set the respective max lengths to: 

In [9]:
max_encoder_seq_length = 6
max_decoder_seq_length = 13

Next, we create the dictionaries translating between ids and tokens for both source (English) and target (German).

In [10]:
sid_svocab_dict = {}
svocab_sid_dict = {}

for sid, svocab in enumerate(vect_english.get_feature_names()):
  sid_svocab_dict[sid] = svocab
  svocab_sid_dict[svocab] = sid

sid_svocab_dict[english_vocab_size] = "<unk>"
sid_svocab_dict[english_vocab_size + 1] = "<pad>"

svocab_sid_dict["<unk>"] = english_vocab_size
svocab_sid_dict["<pad>"] = english_vocab_size + 1

tid_tvocab_dict = {}
tvocab_tid_dict = {}

for tid, tvocab in enumerate(vect_german.get_feature_names()):
  tid_tvocab_dict[tid] = tvocab
  tvocab_tid_dict[tvocab] = tid

# Add unknown token plus start and end tokens to target language

tid_tvocab_dict[german_vocab_size] = "<unk>"
tid_tvocab_dict[german_vocab_size + 1] = "<start>"
tid_tvocab_dict[german_vocab_size + 2] = "<end>"
tid_tvocab_dict[german_vocab_size + 3] = "<pad>"

tvocab_tid_dict["<unk>"] = german_vocab_size
tvocab_tid_dict["<start>"] = german_vocab_size + 1
tvocab_tid_dict["<end>"] = german_vocab_size + 2
tvocab_tid_dict["<pad>"] = german_vocab_size + 3





Lastly, we need to create the training and test data that will feed into our two models. It is convenient to define a small function for that that also takes care off padding and adding start/end tokens on the decoder side:

In [11]:
def convert_text_to_date(texts, 
                         vocab_id_dict, 
                         max_length=20, 
                         type=None,
                         train_test_vector=None,
                         samples=100000):

  
  if type == None:
    raise ValueError('\'type\' is not defined. Please choose from: input_source, input_target, output_target.')
  
  train_data = []
  test_data = []

  for text_num, text in enumerate(texts[:samples]):

    sentence_ids = []

    for token in text.split():

      if token in vocab_id_dict.keys():
        sentence_ids.append(vocab_id_dict[token])
      else:
        sentence_ids.append(vocab_id_dict["<unk>"])
    
    vocab_size = len(vocab_id_dict.keys())
    
    # Depending on encoder/decoder and input/output, add start/end tokens.
    # Then add padding.
    
    if type == 'input_source':
      ids = (sentence_ids + [vocab_size - 1] * max_length)[:max_length]

    elif type == 'input_target':
      ids = ([vocab_size -3] + sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]

    elif type == 'output_target':
      ids = (sentence_ids + [vocab_size - 2] + [vocab_size -1] * max_length)[:max_length]

    if train_test_vector is not None and not train_test_vector[text_num]:
      test_data.append(ids)
    else:
      train_data.append(ids)


  return np.array(train_data), np.array(test_data)


train_test_split_vector = (np.random.uniform(size=10000) > 0.2)

train_source_input_data, test_source_input_data = convert_text_to_date(input_texts, 
                                         svocab_sid_dict, 
                                         type='input_source',
                                         max_length=max_encoder_seq_length,
                                         train_test_vector=train_test_split_vector,
                                         #samples=2
                                         )

train_target_input_data, test_target_input_data = convert_text_to_date(target_texts, 
                                         tvocab_tid_dict, 
                                         type='input_target',
                                         max_length=max_decoder_seq_length,
                                         train_test_vector=train_test_split_vector,
                                         #samples=2
                                         )

train_target_output_data, test_target_output_data = convert_text_to_date(target_texts, 
                                         tvocab_tid_dict, 
                                         type='output_target',
                                         max_length=max_decoder_seq_length,
                                        train_test_vector=train_test_split_vector,
                                         #samples=2
                                          )




Let us look at a few examples. They appear coorect.

In [12]:
train_source_input_data[:2]

array([[ 848, 2000, 2001, 2001, 2001, 2001],
       [ 848, 2000, 2001, 2001, 2001, 2001]])

In [13]:
train_target_input_data[:2]

array([[3001, 1244, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [3001, 3000, 1218, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

In [14]:
train_target_output_data[:2]

array([[1244, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [3000, 1218, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

#### 1.b The Seq2seq model without Attention

This is straightforward to build:

In [15]:
encode_vocab_size = len(sid_svocab_dict.keys())
decode_vocab_size = len(tid_tvocab_dict.keys())

### Translation Model ###

source_input_no_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,), 
                                     dtype='int64',
                                     name='source_input_no_att')
target_input_no_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,), 
                                     dtype='int64',
                                     name='target_input_no_att')

source_embedding_layer_no_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                              output_dim=embed_dim, 
                                              name='source_embedding_layer_no_att')

target_embeddings_layer_no_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                              output_dim=embed_dim, 
                                              name='target_embedding_layer_no_att')


source_embeddings_no_att = source_embedding_layer_no_att(source_input_no_att)
target_embeddings_no_att = target_embeddings_layer_no_att(target_input_no_att)

encoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_no_att')
encoder_out_no_att, encoder_state_h_no_att, encoder_state_c_no_att = encoder_lstm_layer_no_att(source_embeddings_no_att)


decoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_no_att')
decoder_lstm_out_no_att = decoder_lstm_layer_no_att(target_embeddings_no_att, [encoder_state_h_no_att, encoder_state_c_no_att])


target_classification_no_att = tf.keras.layers.Dense(decode_vocab_size, 
                                              activation='softmax', 
                                              name='classification_no_att')(decoder_lstm_out_no_att)


translation_model_no_att = tf.keras.models.Model(inputs=[source_input_no_att, target_input_no_att], outputs=[target_classification_no_att])


translation_model_no_att.compile(optimizer="Adam",
                          loss='sparse_categorical_crossentropy', 
                          metrics=['accuracy'])


In [16]:
translation_model_no_att.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 source_input_no_att (InputLaye  [(None, 6)]         0           []                               
 r)                                                                                               
                                                                                                  
 target_input_no_att (InputLaye  [(None, 13)]        0           []                               
 r)                                                                                               
                                                                                                  
 source_embedding_layer_no_att   (None, 6, 100)      200200      ['source_input_no_att[0][0]']    
 (Embedding)                                                                                  

It never hurts to look at the shapes of the outputs.

In [17]:
translation_model_no_att.predict(x=[train_source_input_data, train_target_input_data]).shape

(7998, 13, 3004)

In [18]:
translation_model_no_att.fit(x=[train_source_input_data, train_target_input_data],
                      y=train_target_output_data,
                      validation_data=([test_source_input_data, test_target_input_data], 
                                       test_target_output_data
                                       ),
                      epochs=50
                      )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f1daca8d750>

#### 1.c The Seq2seq model with Attention

All we need to do is add an attention layer that ceates a context vector for each decoder position. We will then simply concatenate these corresponding context vectors with the output of the LSTM layer in order to predict the translation tokens one by one.

In [19]:
### Translation Model ###

source_input_with_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,), 
                                     dtype='int64',
                                     name='source_input_with_att')
target_input_with_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,), 
                                     dtype='int64',
                                     name='target_input_with_att')

source_embedding_layer_with_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                              output_dim=embed_dim, 
                                              name='source_embedding_layer_with_att')

target_embeddings_layer_with_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                              output_dim=embed_dim, 
                                              name='target_embedding_layer_with_att')


source_embeddings_with_att = source_embedding_layer_with_att(source_input_with_att)
target_embeddings_with_att = target_embeddings_layer_with_att(target_input_with_att)

encoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_with_att')
encoder_out_with_att, encoder_state_h_with_att, encoder_state_c_with_att = encoder_lstm_layer_with_att(source_embeddings_with_att)


decoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_with_att')
decoder_lstm_out_with_att = decoder_lstm_layer_with_att(target_embeddings_with_att, [encoder_state_h_with_att, encoder_state_c_with_att])


attention_context_vectors = tf.keras.layers.Attention(name='attention_layer')([decoder_lstm_out_with_att, encoder_out_with_att])


concat_decode_out_with_att = tf.keras.layers.Concatenate(axis=-1, name='concat_layer_with_att')([decoder_lstm_out_with_att, attention_context_vectors])

target_classification_with_att = tf.keras.layers.Dense(decode_vocab_size, 
                                              activation='softmax', 
                                              name='classification_with_att')(concat_decode_out_with_att)


translation_model_with_att = tf.keras.models.Model(inputs=[source_input_with_att, target_input_with_att], outputs=[target_classification_with_att])


translation_model_with_att.compile(optimizer="Adam",
                          loss='sparse_categorical_crossentropy', 
                          metrics=['accuracy'])


In [20]:
translation_model_with_att.fit(x=[train_source_input_data, train_target_input_data],
                      y=train_target_output_data,
                      validation_data=([test_source_input_data, test_target_input_data], 
                                       test_target_output_data
                                       ),
                      epochs=50
                      )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f1dc5e76b10>

Validation accuracy is about 3/4 of a percentage point better. Nice.

**Question:** Why do you think the benefit is not larger?

## 2. T5

Now we turn to text generation with transformers. 

Let's play a bit with Huggingface's (Large) implementation of T5.

In [21]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-large')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')

t5_model.summary()

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 32899072  
                                                                 
 encoder (TFT5MainLayer)     multiple                  302040576 
                                                                 
 decoder (TFT5MainLayer)     multiple                  402728448 
                                                                 
Total params: 737,668,096
Trainable params: 737,668,096
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


737 m trainable parameters. Quite a lot. 

Let's create a short text to use as an example.

In [22]:
ARTICLE = ( "Oh boy, what a lengthy and cumbersome excercise this was. I had to look into every detail, check everything twice,\
         and then compare to prior results. Because of this tediousness and extra work my homework was 2 days late.")

Next, for T5 to work we need to specify the task and include it in the input text.

In [23]:
t5_input_text = "summarize: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

Now we will first generate a summary without and further specifications.

In [24]:
# Generate Summary
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])

print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['homework was a lengthy and cumbersome excercise . because of this tedious']


Oh boy. Not great. But let's get more sophisticated and prespribe a min length and use Beamsearch: 

In [25]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=20,
                                    max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['i had to look into every detail, check everything twice and compare with prior results. because of this tediousness my homework was 2 days late!']


That is a bit better! 

Lastly, can it translate?


In [26]:
t5_input_text = "translate English to German: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [27]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=20,
                                    max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Ich habe es mir sehr schwer gemacht, diese Aufgabe zu bewältigen.']


Hmm... language fluency is very good. But the system shortened things a lot. A shorter example maybe?

In [28]:
t5_input_text = "translate English to German: That was really not very good today; it was too difficult to solve."
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [29]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=20,
                                    max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Das war heute wirklich nicht sehr gut; es ist zu schwierig, die Sache aufzulösen.']


That is not bad, though some mistakes are there.