# T-725 Natural Language Processing: Lab 5
In today's lab, we will be working with neural networks, using GRUs and Transformers for text generation.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [9]:
import os
import warnings

# Suppress some warnings from TensorFlow about deprecated functions
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Generating text with neural networks
Let's create a neural language model and use it to generate some text. This time, we will use character embeddings rather than word embeddings. They are created in exactly the same way, and are often used together in neural network-based models. One benefit of using character embeddings is that we can generate words that our model has never seen before.

The model takes as input a sequence of characters and predicts which character is most likely to follow. We will generate text by repeatedly predicting and appending the next character to a string. First, however, we need some text to train it on.


In [2]:
# Based on the following tutorial:
# https://www.tensorflow.org/tutorials/text/text_generation

import tensorflow as tf
import numpy as np
import os
import time

# Let's download some text by Shakespeare to train our model
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)

with open(path_to_file, encoding='utf-8') as f:
  shakespeare = f.read()

print("First 250 characters:")
print(shakespeare[:250])

print ("Length of text: {:,} characters".format(len(shakespeare)))

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
First 250 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

Length of text: 1,115,394 characters


Now we can create training examples for our model. Each example will be a pair of strings: one input string containing 100 characters, and a target string that is one character ahead. For example, the first pair we create is:

**Input string**:  `'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'`

**Target string**: `'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '`

However, before we can start training, we need to convert our text into a list of integers, where each integer represents a different character. For example, "First Citizen" becomes:

```
Character:   F   i   r   s   t      C   i   t   i   z   e   n
Integer:   [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]
```

In [3]:
BATCH_SIZE = 64  # Batch size
BUFFER_SIZE = 10000  # Buffer size to shuffle the dataset

def split_input_target(chunk):
  # Create (input_string, output_string) pairs
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

def prepare_text(text):
  # The unique characters in the file
  vocab = sorted(set(text))
  print ('{} unique characters'.format(len(vocab)))

  # Creating a mapping from unique characters to indices
  char_map = {
      'char_to_index': {char: index for index, char in enumerate(vocab)},
      'index_to_char': np.array(vocab)
  }

  text_as_int = np.array([char_map['char_to_index'][c] for c in text])

  # The maximum length sentence we want for a single input in characters
  seq_length = 100
  examples_per_epoch = len(text) // (seq_length+1)

  # Create training examples / targets
  char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
  sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
  dataset = sequences.map(split_input_target)

  # (TF data is designed to work with possibly infinite sequences,
  # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
  # it maintains a buffer in which it shuffles elements).
  dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

  return dataset, vocab, examples_per_epoch, char_map

Now we can create and train the neural network.

In [29]:
import os

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size,
                                embedding_dim,
                                batch_input_shape=[batch_size, None]),
      tf.keras.layers.GRU(rnn_units,
                          return_sequences=True,
                          recurrent_initializer='glorot_uniform',
                          stateful=True),
      tf.keras.layers.Dense(vocab_size)
  ])

  return model


def create_model(text, epochs=3, embedding_dim=256, rnn_units=1024):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  vocab_size = len(vocab)  # Length of the vocabulary in chars
  # embedding_dim = 256  # The embedding dimension
  # rnn_units = 1024  # Number of RNN units

  model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

  # Compile the model
  model.compile(optimizer='adam', loss=loss)

  # Create checkpoints once the model has been trained
  checkpoint_dir = './training_checkpoints'
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
  checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
      filepath=checkpoint_prefix,
      save_weights_only=True)

  # Train the model
  history = model.fit(
      dataset,
      epochs=epochs,
      callbacks=[checkpoint_callback])

  tf.train.latest_checkpoint(checkpoint_dir)
  model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
  model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
  model.build(tf.TensorShape([1, None]))

  return model, char_map

In [5]:
shake_model, shake_chars = create_model(shakespeare)

65 unique characters
Epoch 1/3
Epoch 2/3
Epoch 3/3


Now that we've trained our model, we can finally use it to generate some text. The following function takes a model and a string as input, and continually predicts and appends the next character to the string until it becomes 1,000 characters long.

In [6]:
def generate_text(model, char_map, start_string, temperature=1.0):
  # Evaluation step (generating text using the learned model)
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  if not start_string:
    print("start_string can't be empty")
    return ""

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char_map['char_to_index'][s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(char_map['index_to_char'][predicted_id])

  return (start_string + ''.join(text_generated))

Let's generate some text!

In [7]:
print(generate_text(shake_model, shake_chars, "ROMEO: ", temperature=1.0))

ROMEO: Indersers?

MENCUTET:
Go, sirch my bods; that se? Go the call couse, I have looks at with dier:
Afoing Herrow!-
LEONTES:
Prothers; kill should so if, and risparedr nume!
I have 'twers live,
Arw'd lenge of yor trought thou art threazen.

HESRY BOLINGBROKE:
Hese chowery her, mut give
With hold in.
Dusting and mare the admigies of dares,
Loves, or bid the nobarry pilie, har great what an well by the ne
That him thee, Gives years, I do ass.

ASTOLLO:
Pray thou though sourte come dog.
Or IAll this mickly parpor to enve!
No, if a great and prince him littas thy vise!

TONCESTER:
Athis such ase med, my daughty be flemble wothim.

PETRUCHIO:
Wey, it gut shoulows from mornel!
And resormy: but you be as? For The think or hen lime
To a roward: thy shall Monty in theerook, which you art,
Benest thou fless as we'll keeply parrnieve
The devisemend. of you with you,
stand in, morsert.
They say that so seee, shall good deptives itseen
With upcordleing call do as thy bose of Montious, and in the

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 on Monday morning, October 2nd. Remember to save your file before uploading it.

## Question 1
The `temperature` parameter of `generate_text()`, defined earlier in the notebook, controls how predictable the generated text will be. The lower the temperature, the more the function will tend to append the most likely character (according to the model's prediction). A higher temperature introduces some randomness, leading to more unpredictable text.

The text we generated above used a temperature of 1.0. Try generating more text using the Shakespeare model, once using a temperature of 0.2 and again using a temperature of 0.8.

In [19]:
# Your solution here
print("\n\n-------------- TEMPERATURE 0.2 --------------"
      "\n"+generate_text(shake_model, shake_chars, "JULIET: ", temperature=0.2))
print("\n\n-------------- TEMPERATURE 0.8 --------------"
      "\n"+generate_text(shake_model, shake_chars, "MERCUTIO: ", temperature=0.8))



-------------- TEMPERATURE 0.2 --------------
JULIET: I will be so see the seath of the seal.

KING RICHARD III:
A marred me so see the common of your such a manter.

KING RICHARD II:
And the seath the rest of the send the seems and the sears of my lord.

KING RICHARD II:
And there is the seems of the seave the marred of the see the seal the world be soul the dead of the seems of the rest of the servent of the dead and the common of the commons of the proud that shall not the dead and the seave the sees of the marreal of the seal.

LEONTES:
Go, sir, she shall be sone the seal the seal of the seath.

KING RICHARD II:
And there is the see the send the dear of the seed of the marred me to the prould the send the send the dead with the see the seather and the dead is a man and the dead of the seath of the world the sender of the see the dead the procest the see the dead of this with the seals of the first of the comes me the seather her for the sears the marred the seest the marred with 

## Question 2
NLTK's `names` corpus contains a list of approximately 8,000 English names. Train a new model on `names_raw` for at least 20 epochs using the `create_model(text, epochs=n)` function defined earlier. Use the trained model to generate a list of names (with the `generate_text` function defined earlier), starting with your own first name. Your name should not contain any non-English characters, and should end with an `\n`.

Print out the names that do not appear in the training data. Do you get any actual names (or at least names that sound plausible)?

In [9]:
# Don't modify this code cell
import nltk
from nltk.corpus import names
nltk.download('names')

# Print out a few examples
names_raw = names.raw()
names_unique = set(names_raw.split())
names_raw = "\n".join(names_unique)
print(names_raw.splitlines()[:5])

['Matilda', 'Patel', 'Cayla', 'Marinna', 'Barret']


[nltk_data] Downloading package names to
[nltk_data]     C:\Users\pasqu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.


In [21]:
# Your solution here
names_model, names_chars = create_model(names_raw, epochs=20)
print(generate_text(names_model, names_chars, "DAMIANO\n", temperature=1.0))

# ANSWER: It appears that the model generated names similar to plausible names, but not real names.

55 unique characters
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
DAMIANO
Zovley
Quin
Heanaula
Diratta
Aimia
Aleare
Dorvara
Hettina
Chrissan
Caldrinah
Frely
Caricorna
Deo
Commel
Sil
Joke
Inerig
Foli
Emilaet
Keejia
Vrinita
Roty
Jelie
Coronanche
Mior
Runy
Chamery
Moratt
Orsa
Dara
Hineldaddi
Jerheld
Terili
Derghibae
Marny
Reah
Carlita
Guencie
Netal
Eveline
Juedor
Labia
Varite
Kaunn
Taquel
Frabecton
Radiand
Orama
Joan
Mergiell
Ehholina
Janabrie
Ehlin
Kitran
Gallol
Marie
Pethaus
Shinth
Gylyc
Gantidrle
Vid
Jolia
Gigra
Shpid
Lejaine
Tigoma
Lellen
Oleltins
Rachay
Edrisse
Genndetta
Cardeol
Gayddin
Lansaw
Shestoel
Haullanda
Chichichy
Zhnete
Ringa
Rogh
Edrie
Romorde
Yad
Terelloe
Marryn
Celyn
Koria
Dlon
Walmy
Sadia
Griedris
Doway
Donky
Roza
Merilin
Hamberco
Cledstia
Orrid
Zadrie
Avianna
Selande
Marria
Gordi
Evwem
P

##Question 3
The size of the model can make a difference when it comes to performance. Create a new model that has twice the number of hidden units as the previous model and double the size of the embeddings. How does the performance change? What happens if you decrease these parameters?

In [30]:
# Your solution here
names_model_2, names_chars_2 = create_model( names_raw, epochs=20, embedding_dim=512, rnn_units=2048)
print(generate_text(names_model_2, names_chars_2, "Damiano\n", temperature=1.0))
names_model_3, names_chars_3 = create_model( names_raw, epochs=20, embedding_dim=64, rnn_units=256)
print(generate_text(names_model_3, names_chars_3, "Damiano\n", temperature=1.0))

55 unique characters
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Damiano
Cahesqueeaie-auamamaayyneeeaeaiatanaapeata-Eierearayabalieaazeaatanaaynanaleranatana
Nantananalatataaarenatana
Miaia
Baniaa
Agkaneangareaanataa
Adaroaelaraha
Esslyayneatardaiananadeyata
Latta-Caranffhadaneetenaarauttatalaraamaanadaialaba
Haudfana-Watadatatanaeeadada
Nestajayaneethautt
Dafapia
Osahefffaynasatty
Camarienkaataasaeaahafatte
Tistaralahaylarealana
Wiakanagana
Estonanarananaroneetanasanattasa
Tingeana
Kamfaina
Whathahana
Batadinananarauenemy
Naendara
Adarayakda
Savanabcealliak
Liadanaaaeeiceneta
Maydaratanamadanaphaaia
Evanara
Ramafetada
Mariellia-Leraana
Buisa
Heardasqueladrameaop
Caanarana
Ignelaa
Tatana-Janadfarstara
Banedaaedaia
DaMaugunasataa
Damandatteataa
Qudawa
Saveg
Caradialgaiatacsadae
Mauga
Cathathaeana
Emaay


## Question 4
Transformer large language models can also generate text. The following code imports a pretrained GPT-2 model from Huggingface's Transformer library. This model can then be used directly to generate text, given a prompt as context. Alter the prompt to have the transformer model (GPT-2) generate an engaging story beginning using one of the following story starters:


*   It was the day the moon fell.
*   Am I in heaven?  What happened to me?
*   Wandering through the graveyard it felt like something was watching me.
*   Three of us.  We were the only ones left, the only ones to make it to the island.

There are several different methods to choose from to generate the text (as seen in the commented out lines below). Try out the different methods and play with the parameters. This [blogpost](https://huggingface.co/blog/how-to-generate) explains their differences.

Which method has the best performance?

Can GPT-2 generate Shakespere?

In [12]:
# Uncomment if transformers is not installed
!pip install transformers



In [1]:
# Do not modify this code
# https://huggingface.co/docs/transformers/main_classes/text_generation

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=100) # Greedy search
#outputs = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True) # Beam search
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7) # Sampling
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50) # Top-k
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50, top_p=0.92) # Top-p

tokenizer.batch_decode(outputs, skip_special_tokens=True)

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\n']

In [14]:
# Your solution here

prompts = [
  "It was the day the moon fell.",
  "Am I in heaven?  What happened to me?",
  "Wandering through the graveyard it felt like something was watching me.",
  "Three of us.  We were the only ones left, the only ones to make it to the island."
]

def run_gpt2(prompt, method="greedy"):
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  if method == "greedy":
    outputs = model.generate(input_ids, max_length=100)
  elif method == "beam":
    outputs = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True)
  elif method == "sampling":
    outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7)
  elif method == "topk":
    outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50)
  elif method == "topp":
    outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50, top_p=0.92)
  else:
    print("Invalid method")
    return
  print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

print("\n\n-------------- Greedy search --------------")
run_gpt2(prompts[3], method="greedy")
print("\n\n-------------- Beam search --------------")
run_gpt2(prompts[3], method="beam")
print("\n\n-------------- Sampling --------------")
run_gpt2(prompts[3], method="sampling")
print("\n\n-------------- Top-k --------------")
run_gpt2(prompts[3], method="topk")
print("\n\n-------------- Top-p --------------")
run_gpt2(prompts[3], method="topp")

# ANSWER: The top-k method seems to have the best performance, but also the sampling had good performance. Shakespeare is not generated.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




-------------- Greedy search --------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Three of us.  We were the only ones left, the only ones to make it to the island.  We were the only ones to make it to the island.  We were the only ones to make it to the island.  We were the only ones to make it to the island.  We were the only ones to make it to the island.  We were the only ones to make it to the island.  We were the only ones to make it to the island.']


-------------- Beam search --------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Three of us.  We were the only ones left, the only ones to make it to the island.  It was the only place we could go.  And we were the ones who made it.  The only ones who could make it.\n\n"I\'m sorry.  I didn\'t mean to hurt you.  But I don\'t know how you feel about me.  You don\'t deserve to be here.  That\'s why you\'re here."\n\nI was']


-------------- Sampling --------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Three of us.  We were the only ones left, the only ones to make it to the island.  We were not going to let those small villages die. We were not going to give them to anyone. We were going to take things we knew could not be taken.  That's what we did. That's what we do: We take things, and we go with them.  That's what we do: We move forward, and we start over.  That's what"]


-------------- Top-k --------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Three of us.  We were the only ones left, the only ones to make it to the island.  It seems like our days without any of them came later and we'll always remember this part of the book.  I mean our lives, not only because of them, but also to have some of the stories that came afterwards, which had a sense of the story that was not lost.  It's so amazing to me to read with such an effect and see the effect so very"]


-------------- Top-p --------------
["Three of us.  We were the only ones left, the only ones to make it to the island.  We made a mistake, and it was too late. And now it is too late again.  \xa0But our people know what I will tell them.  We will call it the Baha'i Faith. We will say that Jesus has made it possible for you to see in all the lives of your children a God who is a true, loving, and true God."]
