# **GR5242 HW04 Problem 1: Shakespeare with LSTM networks**

**Instructions**: This problem is an individual assignment -- you are to complete this problem on your own, without conferring with your classmates.  You should submit a completed and published notebook to Courseworks; no other files will be accepted.

## Description:

This homework exercise has 3 primary goals:
 * Introduce some basic concepts from natural language processing
 * Get some practice training recurrent neural networks, specifically on text data
 * Be able to generate fake text data from your favorite author!   

By the end of this exercise, you will have a basic, but decent, computer program which can simulate the writing patterns of any author of your choice.

Here is an outline of the rest of the exercise.
 1. Data loading
     - We will start by downloading a text from Project Gutenberg that we will try to model
     - Data preprocessing and numerical encoding
     - Making training `Dataset` and `DataLoader` objects
 3. Learn to generate text with a neural network
     - Defining the recurrent network
     - Training
     - Predicting and sampling text from the model

     There are 12 questions (70 points) in total, which include coding and written questions. You can only modify the codes and text within \### YOUR CODE HERE ### and/or \### YOUR ANSWER HERE ###.


In [1]:
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import re

## Character-level language modeling

Our goal here is to build a model of language letter-by-letter. Since we may also allow numbers, spaces, and punctuation, it's better to say character-by-character. We will start by fixing an "alphabet": the set of allowed characters.

In math notation, let's call the alphabet $A$. In code,

In [2]:
alphabet = " ""'abcdefghijklmnopqrstuvwxyz1234567890.,!?:;ABCDEFGHIJKLMNOPQRSTUVWXYZ\n"

# Section 1: Data loading and preprocessing

We will start by downloading training data from Project Gutenberg: https://www.gutenberg.org/. Project Gutenberg is a free repository of public domain books. Find any book you like, and download it in Plain Text UTF-8 format.

For example, we will use Shakespeare's complete works: https://www.gutenberg.org/ebooks/100. There is a link on that page to the Plain Text format data.  Download the pg100.txt file, and then upload it from your computer to colab (click at left on the File icon, then click the upload icon).  

*Important*: whichever work you choose, make sure you have enough data! The size of your plain text file should be at least 2MB.

In [None]:
import requests

# URL of the book's plain text file on Project Gutenberg
url = "https://www.gutenberg.org/cache/epub/2701/pg2701.txt"  # Moby Dick; Or, The Whale

# Fetch and read the book
response = requests.get(url)
text = response.text

# Display the first 1000 characters to verify
print(text[:1000])

The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Moby Dick; Or, The Whale

Author: Herman Melville

Release date: July 1, 2001 [eBook #2701]
                Most recently updated: August 18, 2021

Language: English

Credits: Daniel Lazarus, Jonesey, and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK MOBY DICK; OR, THE WHALE ***




MOBY-DICK;

or, THE WHALE.

By Herman Melville



CONTENTS

ETYMOLOGY.

EXTRACTS (Supplied by a Sub-Sub-Librarian).

CHAPTER 1. Loomings.

CHAPTER 2. 


Let's load the text and see what it says:

In [4]:

print("text is", len(text), "characters long.")
print()
print("A sample from the middle:")
print()
print(text[len(text) // 2 : len(text) // 2 + 100])

text is 1260542 characters long.

A sample from the middle:

whole boat in its complicated coils,
twisting and writhing around it in almost every direction. All


### Data standardization

Now, we will clean the data: converting the data to lowercase, removing extra spaces and linebreaks, and get rid of characters which are not in our alphabet.

In [5]:
# remove extra characters by replacing them with spaces
text = re.sub(rf"[^{alphabet}]", " ", text)

Let's see how it looks again:

In [6]:
a = 110042
b = a+131
x_prompt = text[a:b]
print(x_prompt)


But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 


### Numerical encoding

Unfortunately, neural networks don't understand text. So, we need to convert our characters to numerical values. Here are some helper functions for doing this.

In [7]:
# let's build a dictionary mapping characters to integers
char2int = {c: i for i, c in enumerate(alphabet)}
alphabet_array = np.array([c for c in alphabet])

# this function will turn a string into a numpy array of integers
def int_encode(string):
  if any(c not in char2int for c in string):
    raise ValueError(
        "Found a character which was not in the alphabet in the input "
        f"to int_encode. Valid alphabet characters: {alphabet}"
      )
  return np.array([char2int[c] for c in string])

# this function will decode a numpy array of integers back to a string
def int_decode(int_array):
  return ''.join(alphabet_array[int_array])

(Question 1a: 4 points) Test out `int_encode` by passing `test_string` in and printing the result.

In [8]:
# Let's test these out!
### YOUR CODE HERE ###
print(int_encode(x_prompt))

[45 22 21  0  9 10  8  9  0  2  3 16 23  6  0 21  9  6  0  7 13 26 10 15
  8  0 20  4 22  5  0  2 15  5  0  5  2 19 12  0 19 16 13 13 10 15  8  0
  4 13 16 22  5 20 39  0 21  9  6 19  6  0 70  7 13 16  2 21  6  5  0  2
  0 13 10 21 21 13  6  0 10 20 13  6  0 16  7  0 20 22 15 13 10  8  9 21
 39  0  7 19 16 14  0 24  9 10  4  9  0  3  6  2 14  6  5  0  7 16 19 21
  9  0  2 15  0  2 15  8  6 13  0]


(Question 1b: 4 points) Decode the result from the last cell using `int_decode` to make sure it is the same as `test_string`

In [9]:
### YOUR CODE HERE ###
print(int_decode(int_encode(x_prompt)))

But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 


Is the decoding the same as `test_string`? It should -- you have a bug above if not.

### Make a training dataset

First, we make a numerical encoded version of the entire dataset:

In [10]:
enctext = int_encode(text)

Use `torch.tensor` to make it into a PyTorch tensor:

In [11]:
enctext = torch.tensor(enctext)
print(enctext)

tensor([ 0, 63,  9,  ..., 70,  0, 70])


# Section 2: Training a NN

Our model will work as follows:
 - One-hot encoded input gets passed into a linear embedding layer. These two operations are combined with the `Embedding` layer: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
 - LSTM cell
 - Linear decoder layer

Torch has two main ways of interfacing with recurrent networks. In the case of LSTMs, those are:
 - the LSTM layer https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM
 - the LSTMCell layer https://pytorch.org/docs/stable/generated/torch.nn.LSTMCell.html

Both models are sequential: the goal is to process a batch of sequences of input features and produce a batch of sequences of output features. The `LSTM` class makes this simple and easy, and the `LSTMCell` class gives more control by allowing you to process the sequences one element at a time. We will use the `LSTM` layer to keep things simple, but keep in mind that some of what we do could be made more efficient with `LSTMCell`.

The inputs and outputs to recurrent networks in Torch have shape: `(batch_dimension, sequence_dimension, feature_dimension)`. In this case, our feature dimension is `len(alphabet)`.

Something to keep in mind: the output of this network will be stateful! In each batch, the `k`th output along the sequence dimension will be the logits for predicting the `k+1`th input in the batch.

In [12]:
# We will use this constant below
HIDDEN_DIM = 128

In [13]:
# Defining some parameters about data batching, explained in the next section
# Note: after you get the entire assignment working, you can make these
# bigger and train for longer, to get better performance
SEQUENCE_LENGTH = 128
BATCH_SIZE = 64

### Making the dataset of (input, target) pairs

To train the model, we need to make a `torch.utils.data.Dataset` containing input and target sequences. Our input sequences will be sequences of length `SEQUENCE_LENGTH` containing int-encoded characters from the input. Our target sequences will be the "next characters" corresponding to the input sequence: so, if the input sequence is the 10th, 11th, ... characters, then the target sequence is the 11th, 12th, ... characters.

We will walk through using `torch.utils.data.Dataset` methods to create these.

(Question 2c: 8 points) Write a `batch` function for a `torch` tensor, which we defined above, to make disjoint consecutive sequences of consecutive characters of length `SEQUENCE_LENGTH`. `torch.split()` and `torch.vstack()` may be useful. Remember to be careful of the edge case that arises when `len(enctext) % SEQUENCE_LENGTH != 0`.

In [14]:
def batch(enctext, SEQUENCE_LENGTH):
    
    # YOUR CODE HERE
    num_complete_seqs = len(enctext) // SEQUENCE_LENGTH
    
    usable_length = num_complete_seqs * SEQUENCE_LENGTH
    text_subset = enctext[:usable_length]
    batched = text_subset.view(-1, SEQUENCE_LENGTH)
    
    return batched

(Question 2d: 8 points) Now, use batch to create target sequences from the following version of the dataset which has been offset by 1 element:

In [15]:
input_seqs  = batch(enctext, SEQUENCE_LENGTH)
target_seqs = batch(enctext[1:], SEQUENCE_LENGTH)

(Question 2e: 6 points) Now, use the `torch` builtin class `torch.utils.data.TensorDataset` to create a dataset of (input, target) pairs:

In [16]:
pairs = TensorDataset(input_seqs, target_seqs)
pairs

<torch.utils.data.dataset.TensorDataset at 0x7f8cf031aa60>

(Question 2f: 4 points) Finally, define a `torch.utils.data.DataLoader` object to generate batches of pairs of length `BATCH_SIZE`:

In [17]:
train_loader = DataLoader(pairs, batch_size=BATCH_SIZE, shuffle=True)
train_loader

<torch.utils.data.dataloader.DataLoader at 0x7f8cf031d910>

You may uncomment the below cell if you would like to understand the structure of the `train_loader`.

In [19]:
# for i, (x, y) in enumerate(train_loader):
#    print(x, y)
#    if i > 1:
#        break

(Question 2a: 10 points) Model definition: make a Sequential model with an Embedding layer with input dimension `len(alphabet)` and output dimension `HIDDEN_DIM`, followed by an LSTM layer with `HIDDEN_DIM` features, followed by a Linear layer with `len(alphabet)` features. A helper class is provided to extract tensors from the output of the LSTM layer to prepare as input to the input of the final linear layer. Use of this class in the Sequential container would look something like `('extract', extract_tensor(return_sequences=return_sequences))`  

In [21]:
from collections import OrderedDict

return_sequences = True

# LSTM() returns tuple of (tensor, (recurrent state))
class extract_tensor(nn.Module):
    def __init__(self, return_sequences=False):
        super(extract_tensor, self).__init__()
        self.return_sequences = return_sequences

    def forward(self,x):
        # Output shape (batch, features, hidden)
        tensor, _ = x
        # Reshape shape (batch, hidden)
        if not self.return_sequences:
            tensor = tensor[:, -1, :]
        return tensor

model = nn.Sequential(OrderedDict([
    ('embed', nn.Embedding(len(alphabet), HIDDEN_DIM)),
    ('lstm', nn.LSTM(HIDDEN_DIM, HIDDEN_DIM, batch_first=True)),
    ('extract', extract_tensor(return_sequences=return_sequences)),
    ('decode', nn.Linear(HIDDEN_DIM, len(alphabet)))
]))



(Question 2b: 8 points) If we want to use the output of the model as logits for predicting a character (which we can think of as a class), what loss should we use? Name this `criterion`. Additionally, define an optimizer to use in training. As per usual, we will recommend the use of `optim.Adam`.

In [22]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

(Question 2g: 2 points) Train the model!


Write a training loop in PyTorch for a model that processes batched input data. Use `NUM_EPOCHS = 40` as the number of training epochs, and ensure that:
1. Each epoch consists of iterating over batches from a `train_loader`.
2. For each batch, the model's gradients are zeroed, a forward pass is made, and a loss is calculated using a provided criterion.
3. After each batch's loss is calculated, perform backpropagation and optimizer steps.
4. Track and print the average loss at the end of each epoch.



In [24]:
NUM_EPOCHS = 40

for epoch in range(NUM_EPOCHS):
    total_loss = 0
    num_batches = 0
    
    for batch_inputs, batch_targets in train_loader:
        optimizer.zero_grad()
    
        outputs = model(batch_inputs)
        outputs = outputs.view(-1, len(alphabet))
        batch_targets = batch_targets.view(-1)
        
        loss = criterion(outputs, batch_targets)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1
    
    avg_loss = total_loss / num_batches
    if (epoch + 1) % 5 == 0: 
        print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Average Loss: {avg_loss:.4f}')

Epoch [5/40], Average Loss: 1.8243
Epoch [10/40], Average Loss: 1.6756
Epoch [15/40], Average Loss: 1.6017
Epoch [20/40], Average Loss: 1.5555
Epoch [25/40], Average Loss: 1.5239
Epoch [30/40], Average Loss: 1.5000
Epoch [35/40], Average Loss: 1.4821
Epoch [40/40], Average Loss: 1.4672


Here, make sure the loss goes down as it trains.

# Section 3: Did it work? Let's see what the model learned

Here, we'll write some functions to see how well the model has learned to predict text and to draw samples from the model.

First, we'll give you a function to "seed" the model with some input text and then predict the most likely future text. It will be your job to create a variation on this function in the question below, so make sure you understand how it works.

In [25]:
def predict(seed_string, sample_length=50):
  # Convert seed_string to int
  current_text_ints = list(int_encode(seed_string))

  for i in range(sample_length):
    # Add an empty batch dimension and convert to tensor
    text_arr = np.array(current_text_ints).reshape(1, -1)
    text_arr = torch.tensor(text_arr)

    # set our model to return only one output instead of the sequence
    model.extract.return_sequences = False

    # Get the full sequence of predictions, remove the batch dim
    logits = model(text_arr)

    # Remove the batch dimension and get the final logits
    final_logits = logits[-1]

    # Get the prediction using tf.argmax
    pred = torch.argmax(final_logits)

    # Append this to `current_text_ints`
    current_text_ints.append(pred.numpy())

  return int_decode(np.array(current_text_ints))

In [None]:
pred_length = 150

# Print the initial prompt (x_prompt) used for prediction
print("Initial prompt (x_prompt) used for prediction:")
print(x_prompt)  # Assuming x_prompt is already defined

a = 110042
bb = a + 131 + pred_length
x_prompt_plus = text[a:bb]  # Extracting substring from text

# Print the extracted substring from 'text' within the specified range
print("\nExtracted substring (x_prompt_plus) from 'text' starting at index", a, "up to index", bb, ":")
print(x_prompt_plus)

# Print the predicted text based on x_prompt with specified prediction length
print("\nPredicted text based on x_prompt with a prediction length of", pred_length, "characters:")
print(predict(x_prompt, pred_length))

Initial prompt (x_prompt) used for prediction:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 

Extracted substring (x_prompt_plus) from 'text' starting at index 110042 up to index 110323 :
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel s 
face; and this bright face shed a distinct spot of radiance upon the 
ship s tossed deck, something like that silver plate now inserted into 
the V

Predicted text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel and the ship s 
content to the ship s and strange the ship s 
content to the ship s and strange the ship s 
content to the ship s and strange the ship


In [None]:
# feel free to try your own seed!

It seems like maybe the model learned something, but the output is a little boring. Let's make it more interesting with *randomness*!

Right now, the function always picks the most likely next letter. Instead, let's sample the next letter from the model's predicted probability distribution.

(Question 3a: 8 points) Fill in the blanks in the function below.

In [29]:
def generate(seed_string, sample_length=50, temperature=0.7):
    current_text_ints = list(int_encode(seed_string))

    for i in range(sample_length):
        text_arr = np.array(current_text_ints).reshape(1, -1)
        text_arr = torch.tensor(text_arr)

        model.extract.return_sequences = False
        logits = model(text_arr)
        final_logits = logits[-1]

        # Apply temperature scaling
        logits_temp = final_logits / temperature
        probs = F.softmax(logits_temp, dim=0)
        probs = probs.detach().numpy()

        # Sample from the distribution
        sample = np.random.choice(len(alphabet), p=probs)
        current_text_ints.append(sample)

    return int_decode(np.array(current_text_ints))

(Question 3b: 6 points) Test this function `generate`. Is its output different from `predict`? How does it differ, and why?

In [30]:
# YOUR CODE HERE

# Print the initial prompt (x_prompt) used for prediction
print("Initial prompt (x_prompt) used for prediction:")
print(x_prompt)  # Assuming x_prompt is already defined

a = 110042
bb = a + 131 + pred_length
x_prompt_plus = text[a:bb]  # Extracting substring from text

# Print the extracted substring from 'text' within the specified range
print("\nExtracted substring (x_prompt_plus) from 'text' starting at index", a, "up to index", bb, ":")
print(x_prompt_plus)

# Print the predicted text based on x_prompt with specified prediction length
print("\nPredicted text based on x_prompt with a prediction length of", pred_length, "characters:")
print(predict(x_prompt, pred_length))

# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


Initial prompt (x_prompt) used for prediction:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 

Extracted substring (x_prompt_plus) from 'text' starting at index 110042 up to index 110323 :
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel s 
face; and this bright face shed a distinct spot of radiance upon the 
ship s tossed deck, something like that silver plate now inserted into 
the V

Predicted text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel and the ship s 
content to the ship s and strange the ship s 
content to the ship s and strange the ship s 
content to the ship s and strange the ship

Generated text based on x_prompt with a prediction length of 150 characters:
But hi

(Question 3c: 2 point) Try running `generate` a few times. Are the results the same or different? Why?

In [31]:
# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))


# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))



# Print the predicted text based on x_prompt with specified prediction length
print("\nGenerated text based on x_prompt with a prediction length of", pred_length, "characters:")
print(generate(x_prompt, pred_length))



Generated text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel be seem them in his and being and with the strongly and containing over the near the sea of the fear were in its round your 
no contain peared in the 

Generated text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel 
suddenly a 
same every sezzled a hundred with age at one of the silentering remains, and sailors, in one and some man of the whale s way on the grand

Generated text based on x_prompt with a prediction length of 150 characters:
But high above the flying scud and dark rolling clouds, there 
floated a little isle of sunlight, from which beamed forth an angel to the shee 
whales, but seemed transped you see him and renom can subs

The output will be different each time you run the generate function because it's sampling from a probability distribution rather than always taking the most likely next character.