# Natural Language Processing & RNNs





Proprietary material - Under Creative Commons 4.0 licence CC-BY-NC-ND https://creativecommons.org/licenses/by-nc-nd/4.0/



In the lecture, we learned about the use of Recurrent Neural Networks (RNNs) in NLP, the process of word embeddings and all the potential applications of NLP, a lot of which we presently use in our daily lives!

There are a few contexts in which RNNs are generally applied:

- Character-based: Predict the next character given a sequence of previous characters (e.g. it might predict "e" given "Hello ther").


- Word-based: Predict the next word given a sequence of previous words (e.g. it might predict (e.g. it might predict "summer" given "School is out for").

- Learning word embeddings: Learn to convert a word into a numerical/vector representation.

In this tutorial we will mess around with word embeddings and then we'll train a character-based model!

# Setup

We first need to get the standard imports to get this translation party started!

In [None]:
import numpy as np
import keras

# Word Embeddings

Word embeddings means that you take a normal word like 'hello' and converts it to a vector.

To practice working with word embeddings, we will download the text8 corpus and train some word embeddings on it, then play around with the embeddings and see what we can discover.

The text8 corpus is quite small compared to the corpora people generally use to train word embeddings, but in the interest of time, we will use text8. Our embeddings will be fairly noisy and unreliable as a result, though.

gensim is a library that makes it super easy to train and work with word embeddings. All you need to do to train embeddings using Word2Vec is call Word2Vec(corpus)

In [None]:
from gensim.models.word2vec import Word2Vec
from gensim import downloader

In [None]:
# Download the text8 corpus (a fairly small corpus of text)
corpus = downloader.load('text8')

# Train the Word2Vec model on the corpus
model = Word2Vec(corpus)

# Save the vectors learned by the model in wv
wv = model.wv



This section is very open-ended. The main objective is for you to get comfortable working with continuous vector representations of words, so feel free to try out a variety of things!

Here are some ideas:
 - Find the nearest neighbours of a few words, what patterns can you find?
 - Are some words semantically isolated? In other words, are words that are very different in meaning very 'far away' in space? Look for some words which are quite far away from other words in the embedding space.
 - Think of some analogy (i.e. A is to B as X is to ?) and find similar relationships|

In [None]:
# See the vector for "hello"
print(wv['hello'])

In [None]:
# Find the nearest neighbours of "house"
print(wv.most_similar('house', topn=10))

In [None]:
# Find the nearest neighbours of "silly"
print(wv.most_similar('silly', topn=10))

Now, play around with these word embeddings. You might want to look into "most_similar" function in particular. Documentation is [here](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py).

This section is vert open-ended. Feel free to try whatever you'd like, but 
Here are some ideas:
- Pick some analogy, such as "Ottawa is to Canada as _ is to _" and see what the model fills in
- Pick an odd word and look at the words that are the most similar
- Download various 


In [None]:
# TODO - Get comfortable with word embeddings!


# Setting up and training an LSTM

A Long Short-Term Memory (LSTM) model is a specialized RNN which was built to have a better memory. Thankfully we can treat it the same as a regular RNN model on the surface - it just has unquestionably better performance.

We will be training a character-based LSTM model to predict the next character given previous characters. We will train this model on the collected works of Friedrich Nietzsche.

This basic idea of training a model that predicts future elements of a sequence given previous elements is very fundamental and underpins almost all models in NLP, such as [GPT-3](https://github.com/openai/gpt-3).

In the interest of time we are working with a much smaller corpus than we would use in an industry application of NLP.

First, we will just read the corpus and create dictionaries. We will then assign a numeric index to each character and create dictionaries mapping from characters to indices and vice versa.

In [None]:
from keras.utils.data_utils import get_file

# Download the file from the internet
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')

# Read the file and make it all lowercase
with open(path, encoding='utf-8') as f:
    text = f.read().lower()

# Print number of characters
print('corpus length:', len(text))

# Get a sorted list of unique characters in text
chars = sorted(list(set(text)))
print('total chars:', len(chars))

# Create dict for mapping characters to indices and vice versa
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Next, we split the corpus into smaller sequences for training. Note that if you use these models in your own projects, you would want the number of sequences to be in the millions (this is currently in the hundreds of thousands)

In [None]:
# Cut the text into smaller sequences of <maxlen> characters
# shifting over the window <step> characters at a time.
maxlen = 40
step = 3

# These will be our model's inputs after some preprocessing
sentences = []

# These will be what our model will need to predict.
next_chars = []

# Traverse the text to create the sequences above.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('# sequences:', len(sentences))

Next, we one-hot encode the characters to get a n\*m\*v shaped tensor (tensors are just higher dimensional matrices/vectors) for our training features x and n*v shaped tensor for our training labels y.

Here, n is the number of sequences, m is the max length of a sequence (40 in our case) and v is the size of the vocabulary (57) in our case.


In [None]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
      # Each row is the sentence's 40 characters where we have a one-hot
      # encoded vector for each character.
        x[i, t, char_indices[char]] = 1
    # Also one-hot encode the output variables
    y[i, char_indices[next_chars[i]]] = 1

Great! We've prepared the data! Now, we have to define our model. Let's import the necessary functions

In [None]:
import random
import tensorflow as tf
#tf.enable_eager_execution()
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

We will define a model with one LSTM layer with 128 neurons, followed by a softmax layer with one neuron per character. This second layer can be interpreted as the probabilities of each character coming next.

In [None]:
model = Sequential([
    LSTM(128, input_shape=(maxlen, len(chars))),
    Dense(len(chars), activation='softmax'),
])

# use the RMSprop optimizer with a learning rate of 0.01
optimizer = RMSprop(lr=0.01)
# compile the model using categorical cross-entropy loss
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Next, we define a function that prints some text output by our model based on the training it has gone through so far.

In [None]:
# Here we're just defining what should be done when an epoch ends.
# Don't worry if some of the syntax isn't super clear, its just for making
# the output nice and making some actual predictions. 
def on_epoch_end(epoch, _):
  # Function invoked at end of each epoch. Prints generated text.
  print(f'[Epoch {epoch}]: ', end='')

  # Select a random starting index and get maxlen characters from there
  start_index = random.randint(0, len(text) - maxlen - 1)
  sentence = text[start_index: start_index + maxlen]
  print(sentence, end='<END_SEED>')
  
  # Print 400 additional characters
  for i in range(400):
    # One-hot encode the input sentence
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x_pred[0, t, char_indices[char]] = 1.
    
    # Predict the next character based on the most probable char
    pred = np.argmax(model.predict(x_pred, verbose=0)[0])
    next_char = indices_char[pred]

    # Add the currently predicted character to the sentence for our next prediction
    sentence = sentence[1:] + next_char
    print(next_char, end='')
  
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

Finally, let's fit the model to the data and see how it does!

In [None]:
model.fit(x, y, batch_size=128, epochs=5, callbacks=[print_callback])

Now, we can generate text using the fit model!

**The importance of diversity**

Our model returns probabilities for how likely each letter is to come next in the sequence, which is what we want! However, if we try to generate a sentence we'll usually get repeative text, for example, "the stand the stand the stand..." or "of the strength of the strength...", and this stems from the fact that we're always taking the most probable character. Then maybe we don't always want this.

Diversity is how 'diverse' we want the text to be. If it's 0, it's like taking the most probable character (argmax). If it's 1, it's like taking a character essentially at random! Then, when generating a sentence, you have to play with the diversity value and find the right balance! 

In [None]:
# TODO - See what the outputs below look like when diversity is False
# TODO - Set using_diversity to True and play with the values of diversity
using_diversity = False
diversity = 0

In [None]:
# Predict a single character
def sample_character():
  # One-hot encode the input sentence
  x_pred = np.zeros((1, maxlen, len(chars)))
  for t, char in enumerate(sentence):
      x_pred[0, t, char_indices[char]] = 1.
  
  if not using_diversity:
    pred = np.argmax(model.predict(x_pred, verbose=0)[0])
  # Pick one of the more likely characters (depending on diversity)
  else:
    pred = model.predict(x_pred, verbose=0)[0]
    pred = np.asarray(pred).astype('float64')
    pred = np.log(pred) / diversity
    exp_pred = np.exp(pred)
    pred = exp_pred / np.sum(exp_pred)
    probas = np.random.multinomial(1, pred, 1)
    pred = np.argmax(probas)
  
  return pred

Predict a single letter (re-run for difference sentences):

In [None]:
# Select a random starting index and get maxlen characters from there
start_index = random.randint(0, len(text) - maxlen - 1)
sentence = text[start_index: start_index + maxlen]
print(sentence.replace("\n", " "), end=' --- Predicted: ')

pred = sample_character()
next_char = indices_char[pred]
print(next_char, end='')

Predict the rest of a sentence!

In [None]:
# Select a random starting index and get maxlen characters from there
start_index = random.randint(0, len(text) - maxlen - 1)
sentence = text[start_index: start_index + maxlen]
print(sentence.replace("\n", " "), end=' --- Predicted: ')

# You can change the range of this loop to generate more or less characters
for i in range(80):
  pred = sample_character()
  next_char = indices_char[pred]

  # Add the currently predicted character to the sentence for our next prediction
  sentence = sentence[1:] + next_char
  print(next_char, end='')

Is the sentence just repeating itself? Try switching 'using_diversity' to True and play with the diversity value (Between 0 and 1)!

## Further Experimentation
It's time to play with the model itself! Just remember that there's a cost to every change. Here's a few things you should try:

Training Time vs Performance:
- Try using a subset of the total dataset for faster training (Smaller dataset = Faster training but worse performance).
- Try changing the learning rate (Larger learning rate = Faster training but worse performance, and vice versa)

Complexity vs Performance:
- Change the model architecture by adding more units or layers.
- Change maxlen to use more or fewer characters. Changing the amount of context can greatly affect performance on generation tasks. However, larger maxlen means more inputs, which means a larger model and longer training time.

We understand that training can take awhile, so you might want to consider doing some experimentation outside the designated tutorial time. You can just make a quick change and leave to do something else for a while. The ML training process involves a lot of that!

#References
Lab developed for the LearnAI 2020-2021 academic year cohort (University of Toronto AI Club)