<a href="https://colab.research.google.com/github/giorgiosld/Natural-Language-Processing/blob/main/labs/lab5/T_725_Lab05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T-725 Natural Language Processing: Lab 5
In today's lab, we will be working with neural networks, using GRUs and Transformers for text generation.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [None]:
import os
import warnings

# Suppress some warnings from TensorFlow about deprecated functions
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Generating text with neural networks
Let's create a neural language model and use it to generate some text. This time, we will use character embeddings rather than word embeddings. They are created in exactly the same way, and are often used together in neural network-based models. One benefit of using character embeddings is that we can generate words that our model has never seen before.

The model takes as input a sequence of characters and predicts which character is most likely to follow. We will generate text by repeatedly predicting and appending the next character to a string. First, however, we need some text to train it on.


In [None]:
# Based on the following tutorial:
# https://www.tensorflow.org/tutorials/text/text_generation

import tensorflow as tf
import numpy as np
import os
import time

# Let's download some text by Shakespeare to train our model
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)

with open(path_to_file, encoding='utf-8') as f:
  shakespeare = f.read()

print("First 250 characters:")
print(shakespeare[:250])

print ("Length of text: {:,} characters".format(len(shakespeare)))

Now we can create training examples for our model. Each example will be a pair of strings: one input string containing 100 characters, and a target string that is one character ahead. For example, the first pair we create is:

**Input string**:  `'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'`

**Target string**: `'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '`

However, before we can start training, we need to convert our text into a list of integers, where each integer represents a different character. For example, "First Citizen" becomes:

```
Character:   F   i   r   s   t      C   i   t   i   z   e   n
Integer:   [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]
```

In [None]:
# Hyper-parameters:

BATCH_SIZE = 64  # Batch size
BUFFER_SIZE = 10000  # Buffer size to shuffle the dataset
SEQUENCE_LENGTH = 100  # Length of input sequence
EMBEDDING_DIMENSION = 65  # Embedding dimension
RNN_UNITS = 1024  # Number of RNN units

In [None]:
def split_input_target(chunk):
  # Create (input_string, output_string) pairs
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

def prepare_text(text):
  # The unique characters in the file
  vocab = sorted(set(text))
  print ('{} unique characters'.format(len(vocab)))

  # Creating a mapping from unique characters to indices
  char_map = {
      'char_to_index': {char: index for index, char in enumerate(vocab)},
      'index_to_char': np.array(vocab)
  }

  text_as_int = np.array([char_map['char_to_index'][c] for c in text])

  # The maximum length sentence we want for a single input in characters
  seq_length = SEQUENCE_LENGTH
  examples_per_epoch = len(text) // (seq_length+1)

  # Create training examples / targets
  char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
  sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
  dataset = sequences.map(split_input_target)

  # (TF data is designed to work with possibly infinite sequences,
  # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
  # it maintains a buffer in which it shuffles elements).
  dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

  return dataset, vocab, examples_per_epoch, char_map

Now we can create and train the neural network.

In [None]:
import os

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size,
                                embedding_dim),
      tf.keras.layers.GRU(rnn_units,
                          return_sequences=True,
                          recurrent_initializer='glorot_uniform',
                          stateful=True),
      tf.keras.layers.Dense(vocab_size)
  ])

  return model

In [None]:
def create_model(text, epochs=3):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  train_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, BATCH_SIZE)
  train_model.compile(optimizer='adam', loss=loss)

  train_model.fit(dataset, epochs=epochs)

  pred_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
  pred_model.build(input_shape=(1, 100))
  pred_model.set_weights(train_model.get_weights())

  return pred_model, char_map

In [None]:
shakes_model, shakes_chars = create_model(shakespeare, epochs=3)

Now that we've trained our model, we can finally use it to generate some text. The following function takes a model and a string as input, and continually predicts and appends the next character to the string until it becomes 1,000 characters long.

In [None]:
def generate_text(model, char_map, start_string, temperature=1.0):
  # Evaluation step (generating text using the learned model)
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  if not start_string:
    print("start_string can't be empty")
    return ""

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char_map['char_to_index'][s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(char_map['index_to_char'][predicted_id])

  return (start_string + ''.join(text_generated))

Let's generate some text!

In [None]:
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=1.0))

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59 on Friday, September 27th. Remember to save your file before uploading it.

## Question 1
The `temperature` parameter of `generate_text()`, defined earlier in the notebook, controls how predictable the generated text will be. The lower the temperature, the more the function will tend to append the most likely character (according to the model's prediction). A higher temperature introduces some randomness, leading to more unpredictable text.

The text we generated above used a temperature of 1.0. Try generating more text using the Shakespeare model:

(a) once using a temperature of 0.2 and

(b) again using a temperature of 0.8

and describe the difference.

In [None]:
# Your solution here


## Question 2
NLTK's `names` corpus contains a list of approximately 8,000 English names. Train a new model on `names_raw` for at least 20 epochs using the `create_model(text, epochs=n)` function defined earlier. Use the trained model to generate a list of names (with the `generate_text` function defined earlier), starting with your own first name. Your name should not contain any non-English characters, and should end with an `\n`.

Print out the names that do not appear in the training data.

(a) Do you get any actual names (or at least names that sound plausible)?

In [None]:
# Don't modify this code cell
import nltk
from nltk.corpus import names
nltk.download('names')

# Print out a few examples
names_raw = names.raw()
names_unique = set(names_raw.split())
names_raw = "\n".join(names_unique)
print(names_raw.splitlines()[:5])

In [None]:
# Your solution here


##Question 3
The size of the model can make a difference when it comes to performance. Create a new model that has twice the number of hidden units as the previous model and double the size of the embeddings.

(a) How does the performance change?

(b) What happens if you decrease these parameters?

In [None]:
# Your solution here


## Question 4
Transformer large language models can also generate text. The following code imports a pretrained GPT-2 model from Huggingface's Transformer library. This model can then be used directly to generate text, given a prompt as context. Alter the prompt to have the transformer model (GPT-2) generate an engaging story beginning using one of the following story starters:


*   It was the day the moon fell.
*   Am I in heaven?  What happened to me?
*   Wandering through the graveyard it felt like something was watching me.
*   Three of us.  We were the only ones left, the only ones to make it to the island.

There are several different methods to choose from to generate the text (as seen in the commented out lines below). Try out the different methods and play with the parameters. This [blogpost](https://huggingface.co/blog/how-to-generate) explains their differences.

(a) Which method has the best performance?

(b) Can GPT-2 generate Shakespere?

In [None]:
# Uncomment if transformers is not installed
!pip install transformers

In [None]:
# Do not modify this code
# https://huggingface.co/docs/transformers/main_classes/text_generation

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

In [None]:
# Do not modify this code

prompt = "Today I believe we can finally"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

outputs = gpt2_model.generate(input_ids, max_length=100) # Greedy search
#outputs = gpt2_model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True) # Beam search
#outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7) # Sampling
#outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50) # Top-k
#outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50, top_p=0.92) # Top-p

tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
# Your answer here
