# Lab 9: Natural Language Processing
COSC 410: Applied Machine Learning\
Colgate University\
*Prof. Apthorpe*

This lab is due to Gradescope by the beginning of lab next week (2:45p on 4/7). You may work with a partner on this lab – if you do, submit only one solution as a “group” on Gradescope. 

## Introduction

In this lab, you will implement a recurrent neural network to perform text generation.  The network you will create will perform **character-level forecasting**. Given a sequence of characters, the model will predict the next character in the sequence. When applied iteratively, this allows the model to generate new sequences of text. Note that the model will never be given specific instruction about English spelling, grammar, or other conventions. It will try to learn all of these things from the training input. 

We will be using plain text files as training data, starting with the Brothers Grimm fairytale "Little Red-Cap" (known in America as "Little Red Riding Hood").  This text is on the short end of the amount of training input needed to train a text generation model and may result in generated text that mimics entire passages of the input. However, a smaller input text dramatically reduces training time while still showing how the process works -- perfect for this lab exercise.

## Provided Files
 * `Lab9.ipynb`: This file
 * `red_riding_hood.txt`: plaintext version of the Brothers Grimm fairytale "Little Red-Cap" 
 
## Part 1: Data Import and Preprocessing

In [5]:
import random
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras as ks

Complete the `load_input` function, which should 
  1) load a `.txt` file into one (long) string
  2) replace all '\n' characters with ' ' (space) characters
  3) convert all characters to lowercase
  4) return the result

In [1]:
def load_input(filename):
    fileStr = open(filename, "r", encoding='utf8').read()
    fileStr = fileStr.replace('\n', ' ')
    fileStr = fileStr.lower()
    return fileStr

RNNs can't operate on strings directly, so we need to convert the characters into integers.

Complete the following functions to compute the **vocabulary** of the text (a list containing all the **unique** characters in the text), encode string texts into integer lists, and decode integer lists back to string texts 

In [2]:
def get_vocab(text):
   uniqChars = []
   for i in range(0, len(text)):
      isUniq = True
      for j in range(0, len(uniqChars)):
         if(text[i] == uniqChars[j] and i != j):
            isUniq = False
            break
      if (isUniq):
         uniqChars.append(text[i])
   return uniqChars

def encode(text, vocab):
   encodedText = []
   for i in range(0, len(text)):
      encodedText.append(vocab.index(text[i]))
   return encodedText

def decode(tokens, vocab):
   decodedList = []
   for i in range(0, len(tokens)):
      decodedList.append(vocab[tokens[i]])
   decodedText = "".join(decodedList)
   return decodedText


Next we need to create training examples and training labels for our model. The goal of the model is to take a sequence of characters and predict what character should come next. Complete the following function to divide the text into overlapping *subsequences* of characters (training examples) and a list of the characters immediately after each subsequence (training labels). 

In [3]:
def generate_sequences(tokens, seq_length):
   sequences = []
   next = []
   for i in range(0, len(tokens) - seq_length):
      seq = []
      for j in range(i, i + seq_length):
         seq.append(tokens[j])
      sequences.append(seq)
      next.append(tokens[i + seq_length])
   return (sequences, next)

   """Divides tokens (list of integers) into overlapping subsequences of length seq_length.
       Returns these subsequences as a list of lists, also returns a list with the 
       integer value immediately following each subsequence
    
       Example:
          generate_sequences([0, 1, 2, 2, 3, 4, 5, 6, 3, 7, 2, 8], 4) -->
              [[0, 1, 2, 2],
               [1, 2, 2, 3],
               [2, 2, 3, 4],
               [2, 3, 4, 5],
               [3, 4, 5, 6],
               [4, 5, 6, 3],
               [5, 6, 3, 7], 
               [6, 3, 7, 2]]]  (1st return value)
               
             [3, 4, 5, 6, 3, 7, 2, 8]  (2nd return value)
       
       The reference implementation is 6 LoC."""


If you have programmed the previous functions correctly, the following cell will run with no errors and produce the following output:
```
Length of input text (in characters): 7376
Vocab size: 36
Training examples shape: (7325, 50)
Training labels shape: (7325,)
```


In [7]:
text = load_input("red_riding_hood.txt")
vocab = get_vocab(text)
tokens = encode(text, vocab)
assert(decode(tokens, vocab) == text)

seq_length = 50
x, y = generate_sequences(tokens, seq_length)
x, y = np.array(x), np.array(y)

print(f"Length of input text (in characters): {len(text)}")
print(f"Vocab size: {len(vocab)}")
print(f"Training examples shape: {x.shape}")
print(f"Training labels shape: {y.shape}")

Length of input text (in characters): 7376
Vocab size: 36
Training examples shape: (7326, 50)
Training labels shape: (7326,)


## Part 2: RNN Creation & Training

Complete the following function that creates and compiles an LSTM model for character prediction.

In [8]:
def create_model(vocab_size, embedding_dim, rnn_units):

   model = ks.Sequential([
    ks.Input(shape=(None,)),
    ks.layers.Embedding(vocab_size, embedding_dim),
    ks.layers.LSTM(rnn_units),
    ks.layers.Dense(vocab_size, activation='softmax')
])
   model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
   return model

   """Creates, compiles, and returns a LSTM model for character prediction. The model should have 
       at least 3 layers: an Embedding layer, a LSTM layer, and a Dense layer. 
       The model should produce 1 prediction per input sequence (i.e. the next character following the sequence),
       NOT 1 prediction per step of the sequence.
       
       Arguments:
          vocab_size: number of unique characters accross all training examples, also the input size of the Embedding layer
          embedding_dim: output size of Embedding layer
          rnn_units: number of units in LSTM layer
          
       Use the "adam" optimizer for best performance.
       
       The reference implementation is 7 LoC using the Keras Sequential API
    """
    # TODO

Complete the following function that takes a trained model and uses it to generate new text:

In [9]:
def generate_text(model, seed, num_chars, vocab):
   text = []
   text.append(encode(seed, vocab))
   for i in range(0, num_chars):
      pred = model.predict(text)
      newChar = encode(np.random.choice(vocab, p=pred[0]), vocab)
      text[0].append(newChar[0]) 
   return decode(text[0], vocab)
   

   """Iteratively runs model.predict() to generate successive characters starting from the characters in seed. 
       Each generated character is appended to the input of the following model.predict() call. 
       
       Returns the generated text decoded back into a string.
       
       Remember that model.predict will return a probability distribution, not a single integer. 
       You will need to convert these probabilities into an integer by RANDOMLY SAMPLING an index
       based on the distribution weights, NOT by using np.argmax (which can lead to repetitions in generated text)
       
       You will have to be careful with your array shapes. You will want to include print statements to inspect
           the shapes of intermediate values to help with debugging.
       
       Arguments:
          model: trained model
          seed: string with "starter" seed for text generation. This will need to be encoded before it is used in model.predict
          num_chars: the number of characters that should be generated
          vocab: list of unique characters in all training examples
       
       The reference implementation is 7 LoC
    """

   

To test the `create_model` and `generate_text` functions, the following cell creates a model and uses it to generate 10 characters *untrained*. This will produce gibberish, but will let you know whether there are runtime errors you need to fix before training

In [10]:
embedding_dim = 256
rnn_units = 512
seed = "a"
num_chars_to_generate = 20

model = create_model(len(vocab), embedding_dim, rnn_units)

generated_text = generate_text(model, seed, num_chars_to_generate, vocab)
print(generated_text)

apveh’;dwrpy.ftngl!yu


Once you have the previous cell working, it is time to train! The following two cells create and train a model, printing some example generated text after each epoch. You can stop and resume the training at any point by interrupting the kernel and then re-running the cell that calls `model.fit`. As the training progresses, you will hopefully see the generated text looking more and more like English

In [12]:
embedding_dim = 256
rnn_units = 512
batch_size = 128
epochs = 30
seed = "a"
num_chars_to_generate = 100

generate_text_callback = ks.callbacks.LambdaCallback(on_epoch_end=lambda epoch, log: print(generate_text(model, seed, num_chars_to_generate, vocab)))
model = create_model(len(vocab), embedding_dim, rnn_units)

In [17]:
model.fit(x, y, batch_size=batch_size, epochs=epochs, callbacks=[generate_text_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1c1145a7d60>

Finally, experiment with the trained model in the following cell to see how different seeds affect the generated text

In [16]:
# seed = "little-red"
# seed = "hood"
# seed = "wolf"
seed = "tired"
num_chars_to_generate = 200

generated_text = generate_text(model, seed, num_chars_to_generate, vocab)
print(generated_text)

tired-wher winegesat heanthed anith oredgon ‘bertolke, ther yofr, thily, and rtrer’ ind. sther rethocmen stheo  fouug dourerd ‘unicuthe, indind pint ther sf lid;r’ whed ‘nand thermas theund ‘the betry sit 


## Part 3: Questions

**Question 1:** This model performs *character-level* forecasting. Another approach would be to perform *word-level* forecasting, where the model takes a sequence of words and predicts the next word in the sequence. In the following cell, discuss the pros and cons of character-level vs. word-level text generation. What are 2 reasons why character-level forecasting might be preferable. What are two reasons why word-level forecasting might be preferable?

**Question 2:** The model you created was not given any specific instruction about English words, English grammar, or anything else related to the language other than the sequence of characters in the example text. What elements of proper English do you see emerging in the text generated after each training epoch? How many epochs does it take for these to appear? What does the model still struggle with?