<a href="https://colab.research.google.com/github/adammoss/MLiS2/blob/master/workshops/workshop5/rnn_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In lectures we hand-coded the BPTT algorithm to train an RNN language model to predict the next word in a sentence.

Using the same training corpus, train a many-to-many LSTM model using TF2 to perform the same task, and compare your results against a vanilla RNN.

In this example we concatenate all the sentences into a single vector to make it easier to feed into the TF2 dataset API. We therefore only have a single stop word between sentences.

The downside of this approach is that sentences in different reviews will have different context, so ideally we would treat different reviews separately and pad inputs where necessary. 

Tensorflow has a similar character level RNN (not at the word level) here to help you: https://www.tensorflow.org/tutorials/text/text_generation

**NOTE: We do not attempt to implement regularisation here, so overfitting is likely an issue when training for a large number of epochs**

**NOTE 2: You can decrease training time significantly by switching to a GPU instance on Colab**

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
import os

import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
print(tf.__version__)

2.4.1


Download NLTK data

In [4]:
%%capture
nltk.download("book")

Upload imdb_sentences.txt file (or another file containing a list of sentences if you wish)

In [5]:
if not os.path.isfile('imdb_sentences.txt'):
  from google.colab import files
  uploaded = files.upload()

Add sentence start and end tags, convert to lower case and strip newlines

In [6]:
sentence_start_token = "SENTENCE_STOP"

In [7]:
with open('imdb_sentences.txt', 'r') as f:
  sentences = f.readlines()
sentences = ["%s %s" % (sentence_start_token, x.lstrip().rstrip('.\n').lower()) for x in sentences]

In [8]:
print("Parsed %d sentences." % (len(sentences)))
for i in range(0, 10):
  print("Example: %s" % sentences[i])

Parsed 12188 sentences.
Example: SENTENCE_STOP story of a man who has unnatural feelings for a pig
Example: SENTENCE_STOP starts out with a opening scene that is a terrific example of absurd comedy
Example: SENTENCE_STOP a formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers
Example: SENTENCE_STOP unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting
Example: SENTENCE_STOP even those from the era should be turned off
Example: SENTENCE_STOP the cryptic dialogue would make shakespeare seem easy to a third grader
Example: SENTENCE_STOP on a technical level it's better than you might think with some good cinematography by future great vilmos zsigmond
Example: SENTENCE_STOP future stars sally kirkland and frederic forrest can be seen briefly
Example: SENTENCE_STOP airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich business

Tokenize the sentences into words

In [9]:
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

In [10]:
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print("Found %d unique words tokens." % len(word_freq.items()))

Found 18153 unique words tokens.


In [11]:
vocab_size = 1000
unknown_token = 'UNKNOWN_TOKEN'

In [12]:
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
index_to_word = np.array(index_to_word)
word_to_index = dict([(w,i) for i, w in enumerate(index_to_word)])

Replace all words not in our vocabulary with the unknown token and discard sentences under min / over max number of words

In [13]:
min_sentence_length = 5

In [14]:
purged_sentences = []
for i, sent in enumerate(tokenized_sentences):
  if len(sent) >= min_sentence_length:
    purged_sentences.append([w if w in word_to_index else unknown_token for w in sent])


Flatten sentences

In [15]:
text = [word for sent in purged_sentences for word in sent]
    

Convert to integer representations

In [16]:
text_as_int = np.array([word_to_index[w] for w in text])

Set maximum length sentence we want for a single input in characters

In [17]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

Create the dataset

In [18]:
dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [19]:
sequences = dataset.batch(seq_length+1, drop_remainder=True)

In [20]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [21]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(' '.join(index_to_word[input_example.numpy()])))
  print ('Target data:', repr(' '.join(index_to_word[target_example.numpy()])))

Input data:  "SENTENCE_STOP story of a man who has UNKNOWN_TOKEN UNKNOWN_TOKEN for a UNKNOWN_TOKEN SENTENCE_STOP starts out with a opening scene that is a UNKNOWN_TOKEN example of absurd comedy SENTENCE_STOP a UNKNOWN_TOKEN UNKNOWN_TOKEN audience is turned into an UNKNOWN_TOKEN , UNKNOWN_TOKEN UNKNOWN_TOKEN by the crazy UNKNOWN_TOKEN of it 's UNKNOWN_TOKEN SENTENCE_STOP unfortunately it UNKNOWN_TOKEN absurd the whole time with no general UNKNOWN_TOKEN eventually making it just too off UNKNOWN_TOKEN SENTENCE_STOP even those from the UNKNOWN_TOKEN should be turned off SENTENCE_STOP the UNKNOWN_TOKEN dialogue would make UNKNOWN_TOKEN seem UNKNOWN_TOKEN to a third UNKNOWN_TOKEN SENTENCE_STOP on a UNKNOWN_TOKEN level it 's better than you"
Target data: "story of a man who has UNKNOWN_TOKEN UNKNOWN_TOKEN for a UNKNOWN_TOKEN SENTENCE_STOP starts out with a opening scene that is a UNKNOWN_TOKEN example of absurd comedy SENTENCE_STOP a UNKNOWN_TOKEN UNKNOWN_TOKEN audience is turned into an UNKN

In [22]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

**Now code and train your RNN...**

In [23]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [24]:
# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [25]:
model = build_model(
  vocab_size = vocab_size,
  embedding_dim = embedding_dim,
  rnn_units = rnn_units,
  batch_size = BATCH_SIZE)

In [26]:
def loss(labels, logits):
  return keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [27]:
model.compile(optimizer='adam', loss=loss)

In [28]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [29]:
EPOCHS = 200

In [30]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [31]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

In [32]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f729805f310>

In [33]:
model.build(tf.TensorShape([1, None]))

In [34]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            256000    
_________________________________________________________________
lstm_1 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_1 (Dense)              (1, None, 1000)           1025000   
Total params: 6,527,976
Trainable params: 6,527,976
Non-trainable params: 0
_________________________________________________________________


In [35]:
unknown_index = word_to_index['UNKNOWN_TOKEN']

In [36]:
def generate_text(model, start_word):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 30

  # Converting our start string to numbers (vectorizing)
  input_eval = [word_to_index[start_word]]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):

      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      if predicted_id != unknown_index:

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state

        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(index_to_word[predicted_id])

  sentence = ' '.join(text_generated)
  sentence = sentence.replace('SENTENCE_STOP', '. ')

  return sentence

In [37]:
for i in range(5):
  print(generate_text(model, start_word=u"SENTENCE_STOP"))

anyway major plot had little interest in the movie even sure because it was shown .  an opportunity wasted by no one to be a good
end your minute of any kind of lame have very weird for great , but i 'd be so not see that tom and
not worth fun and it just goes on their family .  they could n't get back along in the movie and what i 'm .  with ,
otherwise played by the way it certainly is worth is because of some sort of that time , effort and is
not worth am enough that they could n't even tell ends them , because it was a disappointment .  the human race , especially where
