# **GR5242 HW03 Problem 2: Shakespeare with LSTM networks**

**Instructions**: This problem is an individual assignment -- you are to complete this problem on your own, without conferring with your classmates.  You should submit a completed and published notebook to Courseworks; no other files will be accepted.

## Description:

This homework exercise has 3 primary goals:
 * Introduce some basic concepts from natural language processing
 * Get some practice training recurrent neural networks, specifically on text data
 * Be able to generate fake text data from your favorite author!   

By the end of this exercise, you will have a basic, but decent, computer program which can simulate the writing patterns of any author of your choice.

Here is an outline of the rest of the exercise.
 1. Data loading
     - Downloading a text from Project Gutenberg that we will try to model
     - Data preprocessing and numerical encoding
     - Making a training `Dataset` object
 3. Learn to generate text with a neural network
     - Defining the recurrent network
     - Training
     - Predicting and sampling text from the model

There are 12 questions (70 points) in total, which include coding and written questions. You can only modify the codes and text within \### YOUR CODE HERE ### and/or \### YOUR ANSWER HERE ###.


In [1]:
import numpy as np
import tensorflow as tf
import re

## Character-level language modeling

Our goal here is to build a model of language letter-by-letter. Since we may also allow numbers, spaces, and punctuation, it's better to say character-by-character. We will start by fixing an "alphabet": the set of allowed characters.

In math notation, let's call the alphabet $A$. In code,

In [2]:
alphabet = " abcdefghijklmnopqrstuvwxyz1234567890.,!?:;ABCDEFGHIJKLMNOPQRSTUVWXYZ\n"

# Section 1: Data loading and preprocessing

We will start by downloading training data from Project Gutenberg: https://www.gutenberg.org/. Project Gutenberg is a free repository of public domain books. Find any book you like, and download it in Plain Text UTF-8 format.

For example, we will use Shakespeare's complete works: https://www.gutenberg.org/ebooks/100. There is a link on that page to the Plain Text format data.  Download the pg100.txt file, and then upload it from your computer to colab (click at left on the File icon, then click the upload icon).  

*Important*: whichever work you choose, make sure you have enough data! The size of your plain text file should be at least 2MB.

In [3]:
# after uploading your file to Colab, set this variable to its path:
txt_path = "pg100.txt"

Let's load the text and see what it says:

In [6]:
with open(txt_path) as txt_file:
  text = txt_file.read()

print("text is", len(text), "characters long.")
print()
print("A sample from the middle:")
print()
print(text[len(text) // 2 : len(text) // 2 + 100])

text is 5546921 characters long.

A sample from the middle:

all bear the guilt
Of our great quell?

MACBETH.
Bring forth men-children only;
For thy undaunted me


### Data standardization

Now, we will clean the data: converting the data to lowercase, removing extra spaces and linebreaks, and get rid of characters which are not in our alphabet.

In [7]:
# remove extra characters by replacing them with spaces
text = re.sub(rf"[^{alphabet}]", " ", text)

Let's see how it looks again:

In [8]:
print(text[len(text) // 2 : len(text) // 2 + 100])

all bear the guilt
Of our great quell?

MACBETH.
Bring forth men children only;
For thy undaunted me


### Numerical encoding

Unfortunately, neural networks don't understand text. So, we need to convert our characters to numerical values. Here are some helper functions for doing this.

In [9]:
# let's build a dictionary mapping characters to integers
char2int = {c: i for i, c in enumerate(alphabet)}
alphabet_array = np.array([c for c in alphabet])

# this function will turn a string into a numpy array of integers
def int_encode(string):
  if any(c not in char2int for c in string):
    raise ValueError(
        "Found a character which was not in the alphabet in the input "
        f"to int_encode. Valid alphabet characters: {alphabet}"
      )
  return np.array([char2int[c] for c in string])

# this function will decode a numpy array of integers back to a string
def int_decode(int_array):
  return ''.join(alphabet_array[int_array])

(Question 1a: 4 points) Test out `int_encode` by passing `test_string` in and printing the result.

In [10]:
# Let's test these out!
### YOUR CODE HERE ###
test_string=text[len(text) // 2 : len(text) // 2 + 100] 
int_encode(test_string)

array([ 1, 12, 12,  0,  2,  5,  1, 18,  0, 20,  8,  5,  0,  7, 21,  9, 12,
       20, 69, 57,  6,  0, 15, 21, 18,  0,  7, 18,  5,  1, 20,  0, 17, 21,
        5, 12, 12, 40, 69, 69, 55, 43, 45, 44, 47, 62, 50, 37, 69, 44, 18,
        9, 14,  7,  0,  6, 15, 18, 20,  8,  0, 13,  5, 14,  0,  3,  8,  9,
       12,  4, 18,  5, 14,  0, 15, 14, 12, 25, 42, 69, 48, 15, 18,  0, 20,
        8, 25,  0, 21, 14,  4,  1, 21, 14, 20,  5,  4,  0, 13,  5])

(Question 1b: 4 points) Decode the result from the last cell using `int_decode` to make sure it is the same as `test_string`

In [11]:
### YOUR CODE HERE ###
int_decode(int_encode(test_string))

'all bear the guilt\nOf our great quell?\n\nMACBETH.\nBring forth men children only;\nFor thy undaunted me'

Is the decoding the same as `test_string`? It should -- you have a bug above if not.

### Make a training dataset

First, we make a numerical encoded version of the entire dataset:

In [12]:
enctext = int_encode(text)

Use `tf.convert_to_tensor` to make it into a TensorFlow tensor:

In [13]:
enctext = tf.convert_to_tensor(enctext)

Now, convert to a tensorflow `Dataset` using `tf.data.Dataset.from_tensor_slices`:

In [14]:
text_dataset = tf.data.Dataset.from_tensor_slices(enctext)
text_dataset

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

# Section 2: Training a NN

Our model will work as follows:
 - One-hot encoded input gets passed into a linear embedding layer. These two operations are combined with the `Embedding` layer: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
 - LSTM cell
 - Linear decoder layer

TensorFlow/Keras has two main ways of interfacing with recurrent networks. In the case of LSTMs, those are:
 - the LSTM layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
 - the LSTMCell layer https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTMCell

Both models are sequential: the goal is to process a batch of sequences of input features and produce a batch of sequences of output features. The `LSTM` class makes this simple and easy, and the `LSTMCell` class gives more control by allowing you to process the sequences one element at a time. We will use the `LSTM` layer to keep things simple, but keep in mind that some of what we do could be made more efficient with `LSTMCell`.

The inputs and outputs to recurrent networks in Keras have shape: `(batch_dimension, sequence_dimension, feature_dimension)`. In this case, our feature dimension is `len(alphabet)`.

Something to keep in mind: the output of this network will be stateful! In each batch, the `k`th output along the sequence dimension will be the logits for predicting the `k+1`th input in the batch.

In [15]:
# We will use this constant below
HIDDEN_DIM = 128

(Question 2a: 10 points) Model definition: make a Sequential model with an Embedding layer with input dimension `len(alphabet)` and output dimension `HIDDEN_DIM`, followed by an LSTM layer with `HIDDEN_DIM` features, followed by a Dense layer with `len(alphabet)` features

In [16]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(len(alphabet), HIDDEN_DIM))
model.add(tf.keras.layers.LSTM(HIDDEN_DIM, return_sequences=True))
model.add(tf.keras.layers.Dense(len(alphabet))) # YOUR CODE HERE

model.summary

<bound method Model.summary of <keras.engine.sequential.Sequential object at 0x7fe3e3e58c90>>

(Question 2b: 8 points) If we want to use the output of the model as logits for predicting a character (which we can think of as a class), what loss should we use?

In [17]:
loss =tf.keras.losses.SparseCategoricalCrossentropy() # YOUR CODE HERE

In [18]:
model.compile(optimizer="adam", loss=loss)

In [19]:
# Defining some parameters about data batching, explained in the next section
# Note: after you get the entire assignment working, you can make these bigger and train for longer, to get better performance
SEQUENCE_LENGTH = 32
BATCH_SIZE = 16

### Making the dataset of (input, target) pairs

To train the model, we need to make a `tf.data.Dataset` containing input and target sequences. Our input sequences will be sequences of length `SEQUENCE_LENGTH` containing int-encoded characters from the input. Our target sequences will be the "next characters" corresponding to the input sequence: so, if the input sequence is the 10th, 11th, ... characters, then the target sequence is the 11th, 12th, ... characters.

We will walk through using `tf.data.Dataset` methods to create these.

(Question 2c: 8 points) Use our friend the `batch` method of `text_dataset`, which we defined above, to make sequences of length `SEQUENCE_LENGTH`.

In [20]:
input_seqs = text_dataset.batch(SEQUENCE_LENGTH) # YOUR CODE HERE
input_seqs

<BatchDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>

(Question 2d: 8 points) Now, use batch again to create target sequences from the following version of the dataset which has been offset by 1 element:

In [21]:
target_text_dataset = text_dataset.skip(1)
target_seqs = target_text_dataset.batch(SEQUENCE_LENGTH) # YOUR CODE HERE

(Question 2e: 6 points) Now, use the function `tf.data.Dataset.zip` to create a dataset of (input, target) pairs:

In [22]:
pairs = tf.data.Dataset.zip((input_seqs,target_seqs)) # YOUR CODE HERE
pairs

<ZipDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.int64, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

(Question 2f: 4 points) Finally, call `.batch` again to make batches of pairs of length `BATCH_SIZE`:

In [23]:
pairs_batched = pairs.batch(BATCH_SIZE) # YOUR CODE HERE
pairs_batched

<BatchDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None, None), dtype=tf.int64, name=None))>

In [24]:
# This fixes the size of the training data
# 8000 batches is a reasonable starting number.
train = pairs_batched.shuffle(1000).take(8000) 

(Question 2g: 2 points) Train the model! 
Note:


1.   Given the above parameters, this training should take about 5 minutes.  
2.   Performance will improve throughout training but will not get very good (as judged by the samples).  For this exercise, that is adequate.  
3. However, to really see what this model can do, you can (should!) set a larger number of batches above and a larger hidden state so that you can take many epochs of the data.  If you train for ~10hrs on a model with 256 hidden states (as we show in class), performance can be quite good.  This step is not required for the assignment, but please do try it on your own.



In [25]:
### YOUR CODE HERE ###
model.fit(train, epochs=1, batch_size=512)



<keras.callbacks.History at 0x7fe3dfa8fed0>

Here, make sure the loss goes down as it trains.

# Section 3: Did it work? Let's see what the model learned

Here, we'll write some functions to see how well the model has learned to predict text and to draw samples from the model.

First, we'll give you a function to "seed" the model with some input text and then predict the most likely future text. It will be your job to create a variation on this function in the question below, so make sure you understand how it works.

In [26]:
def predict(seed_string, sample_length=50):
  # Convert seed_string to int
  current_text_ints = list(int_encode(seed_string))

  for i in range(sample_length):
    # Add an empty batch dimension and convert to tensor
    text_arr = np.array(current_text_ints).reshape(1, -1)
    text_arr = tf.convert_to_tensor(text_arr)

    # Get the full sequence of predictions, remove the batch dim
    logits = model(text_arr)[0]

    # Remove the batch dimension and get the final logits
    final_logits = logits[-1]

    # Get the prediction using tf.argmax
    pred = tf.argmax(final_logits)

    # Append this to `current_text_ints`
    current_text_ints.append(pred.numpy())
  
  return int_decode(np.array(current_text_ints))

In [27]:
test_seed = "to be, or "
predict(test_seed, 50)

'to be, or the the the the the the the the the the the the th'

In [None]:
# feel free to try your own seed!  

It seems like maybe the model learned something, but the output is a little boring. Let's make it more interesting with *randomness*!

Right now, the function always picks the most likely next letter. Instead, let's sample the next letter from the model's predicted probability distribution.

(Question 3a: 8 points) Fill in the blanks in the function below.

In [39]:
def generate(seed_string, sample_length=50):
  # Convert seed_string to int
  current_text_ints = list(int_encode(seed_string))

  for i in range(sample_length):
    # Add an empty batch dimension and convert to tensor
    text_arr = np.array(current_text_ints).reshape(1, -1)
    text_arr = tf.convert_to_tensor(text_arr)

    # Get the full sequence of predictions, remove the batch dim
    logits = model(text_arr)[0]

    # Remove the batch dimension and get the final logits
    final_logits = logits[-1]

    # Normalize the final_logits to a probability distribution
    probs = tf.nn.softmax(final_logits)  # YOUR CODE HERE

    # Call .numpy so we can use a numpy function
    probs = probs.numpy()

    # Sample from the probability distribution using
    # the function np.random.choice
    index=np.arange(len(probs))
    sample = int(np.random.choice(index,1,p=probs))  # YOUR CODE HERE

    # Append this to `current_text_ints`
    current_text_ints.append(sample)
  
  return int_decode(np.array(current_text_ints))

(Question 3b: 6 points) Test this function `generate`. Is its output different from `predict` when given the same seed? How does it differ, and why?

In [44]:
# YOUR CODE HERE
generate(test_seed, 50)

'to be, or BGX; 8ouf!BT;qGev? m6P8UC8BK.F\nbjtt6buUs8ct5dsEH?q'

It is different from the output from the predict function with more random characters. It is because we take randomness from the alphabet and each character has a chance to be shown after the test string so that the output will have nonsense to read.  

(Question 3c: 2 points) Try running `generate` a few times with the same seed. Are the results the same or different? Why?

 '### Your Answer Here ###'<br>
 The results are not the same. It is because we take the sample with normal distribution of probs index to predict the text. Everything comes after the test string will be randomly chosen from the sample and build a text with random characters.