<a href="https://colab.research.google.com/github/adammoss/MLiS2/blob/master/workshops/workshop5/rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In lectures we hand-coded the BPTT algorithm to train an RNN language model to predict the next word in a sentence.

Using the same training corpus (although you are encouraged to use more training examples and longer sentences), train a many-to-many LSTM model using TF2 to perform the same task, and compare your results against a vanilla RNN.


In [0]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
import os

import matplotlib.pyplot as plt
%matplotlib inline

In [19]:
print(tf.__version__)

2.1.0


Download NLTK data

In [0]:
%%capture
nltk.download("book")

Upload imdb_sentences.txt file (or another file containing a list of sentences if you wish)

In [21]:
if not os.path.isfile('imdb_sentences.txt'):
  from google.colab import files
  uploaded = files.upload()

Saving imdb_sentences.txt to imdb_sentences.txt


Add sentence start and end tags, convert to lower case and strip newlines

In [0]:
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

In [0]:
with open('imdb_sentences.txt', 'r') as f:
  sentences = f.readlines()
sentences = ["%s %s %s" % (sentence_start_token, x.lstrip().rstrip('.\n').lower(), sentence_end_token) for x in sentences]

In [24]:
print("Parsed %d sentences." % (len(sentences)))
for i in range(0, 10):
  print("Example: %s" % sentences[i])

Parsed 12188 sentences.
Example: SENTENCE_START story of a man who has unnatural feelings for a pig SENTENCE_END
Example: SENTENCE_START starts out with a opening scene that is a terrific example of absurd comedy SENTENCE_END
Example: SENTENCE_START a formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers SENTENCE_END
Example: SENTENCE_START unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting SENTENCE_END
Example: SENTENCE_START even those from the era should be turned off SENTENCE_END
Example: SENTENCE_START the cryptic dialogue would make shakespeare seem easy to a third grader SENTENCE_END
Example: SENTENCE_START on a technical level it's better than you might think with some good cinematography by future great vilmos zsigmond SENTENCE_END
Example: SENTENCE_START future stars sally kirkland and frederic forrest can be seen briefly SENTENCE_END
Example: SENTENCE_START airport 

Tokenize the sentences into words

In [0]:
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

In [26]:
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print("Found %d unique words tokens." % len(word_freq.items()))

Found 18154 unique words tokens.


In [0]:
vocab_size = 1000
unknown_token = 'UNKNOWN_TOKEN'

In [0]:
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i, w in enumerate(index_to_word)])

Replace all words not in our vocabulary with the unknown token and discard sentences under min / over max number of words

In [0]:
min_sentence_length = 5
truncate_sentence_length = 10

In [0]:
purged_sentences = []
for i, sent in enumerate(tokenized_sentences):
  if len(sent) >= min_sentence_length:
    purged_sentences.append([w if w in word_to_index else unknown_token for w in sent[0:truncate_sentence_length]])

In [31]:
print("Purged %d sentences." % (len(purged_sentences)))
for i in range(0, 10):
  print("Example: %s" % purged_sentences[i])

Purged 11872 sentences.
Example: ['SENTENCE_START', 'story', 'of', 'a', 'man', 'who', 'has', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'for']
Example: ['SENTENCE_START', 'starts', 'out', 'with', 'a', 'opening', 'scene', 'that', 'is', 'a']
Example: ['SENTENCE_START', 'a', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'audience', 'is', 'turned', 'into', 'an', 'UNKNOWN_TOKEN']
Example: ['SENTENCE_START', 'unfortunately', 'it', 'UNKNOWN_TOKEN', 'absurd', 'the', 'whole', 'time', 'with', 'no']
Example: ['SENTENCE_START', 'even', 'those', 'from', 'the', 'UNKNOWN_TOKEN', 'should', 'be', 'turned', 'off']
Example: ['SENTENCE_START', 'the', 'UNKNOWN_TOKEN', 'dialogue', 'would', 'make', 'UNKNOWN_TOKEN', 'seem', 'UNKNOWN_TOKEN', 'to']
Example: ['SENTENCE_START', 'on', 'a', 'UNKNOWN_TOKEN', 'level', 'it', "'s", 'better', 'than', 'you']
Example: ['SENTENCE_START', 'future', 'stars', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'and', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'can', 'be']
Example: ['SENTENCE_START', 'airport', 'UNKNOWN_TOK

Create the dataset

In [0]:
X = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in purged_sentences])
Y = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in purged_sentences])

In [36]:
print("Example: ", X[2], Y[2])

Example:  [1, 4, 999, 999, 278, 8, 465, 99, 43] [4, 999, 999, 278, 8, 465, 99, 43, 999]
