## NEXT WORD PREDICTOR

Let us train a RNN which predicts the next word given a sequence of words. We take some work of Shakespeare for our dataset.

In [1]:
import tensorflow as tf
import keras
import re
import numpy as np

Using TensorFlow backend.


In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 
                                       'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

In [21]:
len(text)

1115394

As the size is too big, we'll take a subset of this.

In [22]:
text = text[:100000]

### Data Cleaning 

Our preprocessing is going to involve using a `Tokenizer` to convert the text from sequence of words (strings) into sequence of integers, after removing punctuation and converting to lower case.

In [23]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True)
tokenizer.fit_on_texts([text])

In [39]:
sequence = tokenizer.texts_to_sequences([text])[0]
# Printing first 1000 characters of text
' '.join(tokenizer.index_word[i] for i in sequence)[:1000]

"first citizen before we proceed any further hear me speak all speak speak first citizen you are all resolved rather to die than to famish all resolved resolved first citizen first you know caius marcius is chief enemy to the people all we know't we know't first citizen let us kill him and we'll have corn at our own price is't a verdict all no more talking on't let it be done away away second citizen one word good citizens first citizen we are accounted poor citizens the patricians good what authority surfeits on would relieve us if they would yield us but the superfluity while it were wholesome we might guess they relieved us humanely but they think we are too dear the leanness that afflicts us the object of our misery is as an inventory to particularise their abundance our sufferance is a gain to them let us revenge this with our pikes ere we become rakes for the gods know i speak this in hunger for bread not in thirst for revenge second citizen would you proceed especially against c

### Features and Labels

To train the model, we use a certain number of words which we will feed into the network as features with the next word as the label. For example, if we set `training_length = 50`, then the model will take in 50 words as features and the 51st word as the label. 

We can make multiple training examples by slicing at different points. We can use the first 50 words as features with the 51st as a label, then the 2nd through 51st word as features and the 52nd as the label, then 3rd - 52nd with 53rd as label and so on. This gives us much more data to train on and the performance of the model is proportional to the amount of training data.

Complete the function to achieve this.

In [25]:
TRAINING_LENGTH = 50

def make_sequences(text, training_length=50):
    """Turn a set of texts into sequences of integers"""

    # Create the tokenizer object and train on texts
    tokenizer = 

    # Creating look-up dictionaries and reverse look-ups
    word_idx = tokenizer.word_index
    idx_word = tokenizer.index_word
    num_words = len(word_idx) + 1
    word_counts = tokenizer.word_counts

    print(f'There are {num_words} unique words.')

    # Convert text to sequence of integers
    sequence = 

    training_seq = []
    labels = []

    # Create multiple training examples from each sequence
    

    print(f'There are {len(training_seq)} training sequences.')

    # Return everything needed for setting up the model
    return word_idx, idx_word, num_words, word_counts, training_seq, labels

In [26]:
word_idx, idx_word, num_words, word_counts, features, labels = make_sequences(text, TRAINING_LENGTH)

There are 3105 unique words.
There are 17975 training sequences.


The text is now represented as a sequence of integers. Let's look at an example of a few features and the corresponding labels. The label is the next word in the sequence after the first 50 words. We'll see the 3rd generated sequence.

In [27]:
# First 10 words of 3rd sequence
features[3][:10]

[25, 358, 184, 163, 94, 30, 81, 35, 81, 81]

Complete this function which prints the feature and its corresponding label in words given a sequence number.

In [28]:
def find_answer(index):
    """Find label corresponding to features for index in training data"""
    

In [29]:
find_answer(3)

Features: we proceed any further hear me speak all speak speak first citizen you are all resolved rather to die than to famish all resolved resolved first citizen first you know caius marcius is chief enemy to the people all we know't we know't first citizen let us kill him and

Label:  we'll


### Training Data

Next we need to take the features and labels and convert them into training and validation data. The following function does this by splitting the data - after random shuffling because the features were made in sequential order - based on the `train_fraction` specified. All the inputs are converted into numpy arrays which is the correct input to a keras neural network. One important step is to convert the labels to one hot encoded vectors because our network will be trained using `categorical_crossentropy` and makes a prediction for each word in the vocabulary.

Complete the function below to achieve the same.

In [30]:
from sklearn.utils import shuffle
TRAIN_FRACTION = 0.8

def create_train_valid(features,
                       labels,
                       num_words,
                       train_fraction=TRAIN_FRACTION):
    """Create training and validation features and labels after shuffling generated set."""

    

In [31]:
X_train, X_valid, y_train, y_valid = create_train_valid(
    features, labels, num_words)
print(X_train.shape)
print(y_train.shape)

(14380, 50)
(14380, 3105)


### Build Model

With data encoded as integers, we're ready to build the recurrent neural network. This model is relatively simple and uses an LSTM cell as the heart of the network. After converting the words into embeddings, we pass them through a single LSTM layer, then into a fully connected layer with `relu` activation before the final output layer with a `softmax` activation. The final layer produces a probability for every word in the vocab. 

When training, these predictions are compared to the actual label using the `categorical_crossentropy` to calculate a loss. The parameters (weights) in the network are then updated using the Adam optimizer (a variant on Stochastic Gradient Descent) with gradients calculated through backpropagation.

We'll use shape of embeddings as 100. Complete the below function which makes the model.

In [32]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

In [33]:
EMBEDDING_SHAPE = 100
LSTM_CELLS = 64

def make_word_level_model(num_words,
                          embedding_shape,
                          lstm_cells=64):
    """Make a word level recurrent neural network."""

    model = Sequential()

    # Map words to an embedding
    
    # Add LSTM layer
    
    # Output layer
    
    # Compile the model
    
    
    return model


model = make_word_level_model(
    num_words,
    embedding_shape=EMBEDDING_SHAPE,
    lstm_cells=LSTM_CELLS)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 100)         310500    
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 3105)              400545    
Total params: 761,605
Trainable params: 761,605
Non-trainable params: 0
_________________________________________________________________


The model needs a loss to minimize (`categorical_crossentropy`) as well as a method for updating the weights using the gradients (`Adam`). We will also monitor accuracy which is not a good loss but can give us a more interpretable measure of the model performance.

### Train Model

We can now train the model on our training examples. 

In [34]:
history = model.fit(
    X_train,
    y_train,
    epochs=5,
    validation_data=(X_valid, y_valid))

Train on 14380 samples, validate on 3595 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The accuracy - both training and validation - increase over time and the loss decreases over time which gives us indication that our model is getting better with training. 

# Generating Output

Now for the fun part: we get to use our model to generate new Shakespearean verse. To do this, we feed the network a seed sequence, have it make a prediction, add the predicted word to the sequence, and make another prediction for the next word. We continue this for the number of words that we want. We compare the generated output to the actual abstract to see if we can tell the difference!

Complete the below function to achieve the same.

In [43]:
import random


def generate_output(model,
                    sequence,
                    new_words=50,
                    training_length = TRAINING_LENGTH):
    """Generate `new_words` words of output from a trained model."""



In [51]:
generate_output(model, sequence, 10)

Original Sequence :
 how lies their battle know you on which side they have placed their men of trust cominius as i guess marcius their bands i' the vaward are the antiates of their best trust o'er them aufidius their very heart of hope marcius i do beseech you by all the battles
Generated Sequence :
 < --- > the people and the people and the people and the
Actual Sequence :
 < --- > wherein we have fought by the blood we have shed
