<a href="https://colab.research.google.com/github/dimtr/PyDataEHV_workshop/blob/master/Shakespeare_PyData.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Text Generation with Recurrent Neural Networks
In this workshop, we see how recurrent Neural Networks could be used as Generative Models. They can learn the sequences of a problem and generate entirely new plausible sequences for the problem domain.

We will discover how to create a simple text generation model using Python in [Keras](https://keras.io/) that generates text, word-by-word. We will work with the dataset of Shakespeare's writing (from..). 

Given a sequence of words, the model trained on our dataset will predict the next most probable word. We will call the model repeatedly to generate longer sequences.  



##Setup

####Mounting Google Drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


####Import Keras and other libraries

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout,SimpleRNN, LSTM, Embedding
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras.callbacks import LambdaCallback, EarlyStopping
from keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle
from glob import glob

import numpy as np


Using TensorFlow backend.


####Reading the data

We load the Shakespeare text file and take a look at a part of the data

In [0]:
with open('/content/drive/My Drive/datasets/shakespeare.txt', encoding='utf-8') as f:
   story = f.readlines()
   
print(''.join(story[:20]))

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.



##Process the Text
Before training the model, we need to process the text in a form that is interpretable by the model.  

#### Story in words

First, we convert the story data into chunks of words or tokens. Since we want our model to also recognize puntuations as words, we use [replace()](https://docs.python.org/2/library/string.html#string.replace) to add white spaces around them and then use [split()](https://docs.python.org/2/library/stdtypes.html#str.split) to split the story into word chunks. 

In [0]:
story_in_words = []
for i, line in enumerate(story):
  story[i] = line.lower().replace('.', ' . ').\
                          replace(',', ' , ').\
                          replace('?', ' ? ').\
                          replace('"', ' " ').\
                          replace('!', ' ! ').\
                          replace(':', ' : ').\
                          replace(';', ' ; ').\
                          replace('--', ' ').\
                          replace('-', ' ').\
                          replace(',', ' , ')
  story_in_words.extend(story[i].split(' '))

print("Total number of words in story: %d" % len(story_in_words))
print("Unique words in story: %d" %len(set(story_in_words)))

Total number of words in story: 343653
Unique words in story: 14833


####Creating sequences

The next step is to split the entire text into sequences of a certain length. We specify this length by using the *hyper parameter* SEQ_LEN. Sequence length is the number of words that the generative model would take as input to predict the next word. 

We go over the story by, shifting by one word at each step and take SEQ_LEN + 1 words in a sequence at a time.



NOTE: You can play around with different values of the SEQ_LEN hyper parameter to investigate how it affects the performance of your model. This method of trying out different values of hyper parameters in order to find an optimal set for a learning algorithm is called *Hyperparameter Tuning*. 

In [0]:
SEQ_LEN = 10
step = 1
sentences = []
for i in range(0, len(story_in_words) - SEQ_LEN, step):
  sentences.append(story_in_words[i : i + SEQ_LEN + 1])


####Tokenizer

Now we need to map the strings to numeric representation. We want each unique word to be represented by a unique integer number. We use the [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) class by Keras for this task. The class then calls a function that fits the tokenizer on the our sequences of text and builds an internal vocabulary. 

Note: 0 is a reserved index in this class that won't be assigned to any word.

In [0]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)  # transforms each text sequence into a sequence of integer, where each integer represents a unique word
sequences = np.asarray(sequences)                    # converting a list to a numpy array
vocab = tokenizer.word_counts                        # Dict object of vocabulary with frequency count

VOCAB_LEN = len(tokenizer.word_counts) + 1
total = sequences.shape[0]


We take an example of a sentance and see how the Tokenizer class transforms it into a sequence of integers.

In [0]:
temp = 'You are all resolved rather to die than to famish ?'

tokens = tokenizer.texts_to_sequences([temp.split()])
print("Numeric representation of the sentence: 'You are all resolved rather to die than to famish ?' is --> ", tokens)

Numeric representation of the sentence: 'You are all resolved rather to die than to famish ?' is -->  [[13, 48, 41, 1535, 380, 8, 198, 71, 8, 3461, 16]]


####Training and Test data split

Data is shuffled and split into training and test dataset with 15% of the data in the test split. 

The first *SEQ_LEN* words of the sequence make the input data and the remaining last word is the target data or the next probable word that the model should output.

In [0]:
sequences = shuffle(sequences)

train_input = sequences[int(total * 0.15):, :-1]
train_output = sequences[int(total * 0.15):, -1]

test_input = sequences[:int(total * 0.15), :-1]
test_output = sequences[:int(total * 0.15), -1]


print("Input Training Data Shape:", train_input.shape)
print("Target Training Data Shape:", train_output.shape)

print("Input Test Data Shape:", test_input.shape)
print("Target Test Data Shape:", test_output.shape)

Input Training Data Shape: (292097, 10)
Target Training Data Shape: (292097,)
Input Test Data Shape: (51546, 10)
Target Test Data Shape: (51546,)


####Build the RNN Model

We use the [Keras Sequential Model](https://keras.io/getting-started/sequential-model-guide/) to define the model. To build our simple RNN text generation model, we use four keras layers:


*   [keras.layers.Embedding](https://keras.io/layers/embeddings/): used to train a dense representation of words and their relative meanings.
*   [keras.layers.SimpleRNN](https://keras.io/layers/recurrent/#simplernn): fully connected RNN layer where the output is fed back to the input.
*   [keras.layers.Dropout](https://keras.io/layers/core/#dropout): applies a regularization technique where randomly selected neurons are ignored or "dropped-out" during training.
*   [keras.layers.Dense](https://keras.io/layers/core/#dense): regular densely connected neural network layer with output size equal to the vocabulary size (number of unique words). This layer is added at the end and uses the softmax activation to output the probablities (that add up to one) for each word. 



In [0]:
model = Sequential()
model.add(Embedding(VOCAB_LEN, 256, input_length=SEQ_LEN))
model.add(SimpleRNN(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(SimpleRNN(64))
model.add(Dropout(0.2))
model.add(Dense(VOCAB_LEN, activation='softmax'))

We compile the model to configure it for training and use categorical crossentropy as our loss function. 

In [0]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 10, 256)           3797504   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 10, 128)           49280     
_________________________________________________________________
dropout_3 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 64)                12352     
_________________________________________________________________
dropout_4 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 14834)             964210    
Total params: 4,823,346
Trainable params: 4,823,346
Non-trainable params: 0
__________________________________________

###Helper Functions

We create a generator function that generates data batch-by-batch. The generator is run in parallel to the model, for efficiency. 

In [0]:
def generator(sent, word, batch_size):

  global index
  index = 0
  while True:
    x = np.zeros((batch_size, SEQ_LEN), dtype=np.int)
    y = np.zeros((batch_size, VOCAB_LEN), dtype=np.bool)

    for i in range(batch_size):
      x[i] = sent[index % len(sent)]
      y[i] = to_categorical(word[index % len(word)], num_classes=VOCAB_LEN) #convert integers to one-hot encoded vectors
      index = index + 1
    yield x,y

This function samples an index from a softmax probablity array based on the temperature. This technique is called temperature sampling and is used to improve the quality of samples from language models.

Note: The high temperature sample displays greater linguistic variety, but the low temperature sample is more grammatically correct. Lowering the temperature allows you to focus on higher probability output sequences and smooth over deficiencies of the model.

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.flip(np.argsort(probas))[0]

We make a simple display function that prints the stories 

In [0]:
def display(story):
  full = ''
  for word in story:
    if word == '|newline|':
      full = full + '\n'
    elif word in ',.;:?!':
      full = full + word
    else:
      full = full + ' ' + word
  
  print(full)


####Configure Checkpoints

We use two types of checkpoints for our model:


*   [ModelCheckpoint](https://keras.io/callbacks/#modelcheckpoint): to ensure that checkpoints are saved during training by monitoring a quality (validation accuracy in this case)
*   [EarlyStopping](https://keras.io/callbacks/#EarlyStopping): stops training when the monitored quality (validation accuracy) has not improved for a certain number of epochs. This threshold is set by the *patience* argument.



In [0]:
BATCH_SIZE =256
file_path = "/content/drive/My Drive/checkpoints/Shakespeare-RNN-epoch{epoch:03d}-words%d-sequence%d-batchsize%d-" \
            "loss{loss:.4f}-acc{acc:.4f}-val_loss{val_loss:.4f}-val_acc{val_acc:.4f}.hdf5" % \
            (VOCAB_LEN, SEQ_LEN, BATCH_SIZE)

checkpoint = ModelCheckpoint(file_path, monitor='val_acc', save_best_only=True) # latest best model according to the val_acc monitored will not be overwritten
early_stopping = EarlyStopping(monitor='val_acc', patience=20)
callbacks_list = [checkpoint, early_stopping]

####Train the Model

The model is trained for 30 epochs, but training could stop early due to the callback. 

In [0]:
model.fit_generator(generator(train_input, train_output, BATCH_SIZE),
                    steps_per_epoch=int(len(train_input)/BATCH_SIZE) + 1,
                    epochs=30,
                    callbacks=callbacks_list,
                    validation_data=generator(test_input, test_output, BATCH_SIZE),
                    validation_steps=int(len(test_input)/BATCH_SIZE) + 1)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/100





Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100


<keras.callbacks.History at 0x7faa370977f0>

###Generate Text

####Restore the latest checkpoint


In [0]:
model.load_weights('/content/drive/My Drive/Colab Notebooks/Shakespeare-epoch001-words14834-sequence10-loss3.9755-acc0.3642-val_loss4.1913-val_acc0.3557.hdf5')

####Conditional Samples

Here, the story is generated word-by-word, based on a prompt provided. If the number of words in the prompt exceed SEQ_LEN, then the prompt is truncated from the beginning to fit the sequence length. If the length of prompt is less than SEQ_LEN, then zeros are padded in the beginning. The whole story is genrated by predicting the next probable word in a loop.

Note: Since 0 is used to pad sequences, it is important that the Tokenizer does not use 0 as an index. 



In [0]:
def generate_cond_samples(no_of_words, temp):
    
    print('Enter prompt: \n')
    seed = input()
    sentence = tokenizer.texts_to_sequences([seed])[0]

    sentence = list(pad_sequences([sentence], maxlen=SEQ_LEN, padding='pre', truncating='pre')[0])


    gen_story = []
    gen_story.extend(seed.split())

    for i in range(no_of_words):
        x_pred = np.expand_dims(sentence, axis=0) 
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_indices = sample(preds, temp)

        for ix in next_indices:
          if ix == 0:
            continue
          else:
            next_word = tokenizer.index_word[ix]
            sentence = sentence[1:]
            sentence.append(ix)
            break

        gen_story.append(next_word)
        
    display(gen_story)
        
       
    

In [0]:
generate_cond_samples(no_of_words = 500, temp=0.5)

####Unconditional Samples

Here, the story is generated word-by-word by starting with a random seed. One integer is randomly sampled from the index of words and story is generated by prediciting the next probable work in a loop.

In [0]:
def generate_uncond_samples(no_of_words, temp):
    np.random.seed(0)
    seed = np.random.randint(1, VOCAB_LEN, size=SEQ_LEN)

    sentence = seed

    sentence = list(pad_sequences([sentence], maxlen=SEQ_LEN, padding='pre')[0])

    gen_story = []

    gen_story.extend(tokenizer.index_word[w] for w in seed)
    end_flag = 0

    for i in range(no_of_words):
        x_pred = np.expand_dims(sentence, axis=0)

        preds = model.predict(x_pred, verbose=0)[0]
        next_indices = sample(preds, temp)

        for ix in next_indices:
            if ix == 0:
                continue

            elif '|endofstory|' in tokenizer.word_index.keys():
                if ix == tokenizer.word_index['|endofstory|'] :
                    end_flag = 1
                    break
            else:
                next_word = tokenizer.index_word[ix]
                sentence = sentence[1:]
                sentence.append(ix)
                break

        if end_flag == 1:
            break


        gen_story.append(next_word)

    display(gen_story)

In [0]:
generate_uncond_samples(no_of_words = 500, temp=0.5)

 conclude idles compromise
 watery edge
 greets presumes majestical fidiused magistrate thrive, i will not be fast. 
 
 duke vincentio: 
 i have no more than all a cup of wine. 
 
 leontes: 
 what, sir, i am so, my lord, and i am dead. 
 
 duke vincentio: 
 i have been a thousand: but, and my life, and not i turn. 
 
 angelo: 
 ay, my lord, and let her kill my heart. 
 
 petruchio: 
 o, i will not be so. 
 
 sicinius: 
 i am not so. 
 
 nurse: 
 no, what you say, how i am, is gone. 
 
 romeo: 
 i am the duke of york, but i will see't to chide him; 
 and i will be so. 
 
 sebastian: 
 i have no less than a word of. 
 what is't, a king, i'll lay it from him, and the settled son and sanctimonious journey, 
 let me be not in this heads. 
 
 hastings: 
 what, thou art not so. 
 
 first watchman: 
 he was a man in a holy tale: 
 and let the use of the complexion of the vessel
 of its scraping back, and i come to me. 
 
 katharina: 
 nay, sir, i have not a fool, 
 and the king's father will b

Looking at the generated text, you'll see the model knows when to use punctuations, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.



###Build an LSTM Model

We keep the structure for the model similar but use the Keras LSTM layer instead of the SimpleRNN layer. We again use the [Keras Sequential Model](https://keras.io/getting-started/sequential-model-guide/) to define the LSTM based generative model by using four keras layers:


*   [keras.layers.Embedding](https://keras.io/layers/embeddings/): used to train a dense representation of words and their relative meanings.
*   [keras.layers.LSTM](https://keras.io/layers/recurrent/#lstm): long-short term memory layer composed of a *cell*, an *input gate*, an *output gate* and a *forget gate*.
*   [keras.layers.Dropout](https://keras.io/layers/core/#dropout): applies a regularization technique where randomly selected neurons are ignored or "dropped-out" during training.
*   [keras.layers.Dense](https://keras.io/layers/core/#dense): regular densely connected neural network layer with output size equal to the vocabulary size (number of unique words). This layer is added at the end and uses the softmax activation to output the probablities (that add up to one) for each word. 



In [0]:
lstm_model = Sequential()
lstm_model.add(Embedding(VOCAB_LEN, 256, input_length=SEQ_LEN))
lstm_model.add(LSTM(128, return_sequences=True))
lstm_model.add(Dropout(0.2))
lstm_model.add(LSTM(64))
lstm_model.add(Dropout(0.2))
lstm_model.add(Dense(VOCAB_LEN, activation='softmax'))

We compile the model to configure it for training and use categorical crossentropy as our loss function. 

In [0]:
lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 10, 256)           5127936   
_________________________________________________________________
lstm_3 (LSTM)                (None, 10, 128)           197120    
_________________________________________________________________
dropout_5 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dropout_6 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 20031)             1302015   
Total params: 6,676,479
Trainable params: 6,676,479
Non-trainable params: 0
____________________________________________

### Configure Checkpoints

In [0]:
BATCH_SIZE =256
file_path = "/content/drive/My Drive/checkpoints/Shakespeare-LSTM-epoch{epoch:03d}-words%d-sequence%d-batchsize%d-" \
            "loss{loss:.4f}-acc{acc:.4f}-val_loss{val_loss:.4f}-val_acc{val_acc:.4f}.hdf5" % \
            (VOCAB_LEN, SEQ_LEN, BATCH_SIZE)

checkpoint = ModelCheckpoint(file_path, monitor='val_acc', save_best_only=True)
early_stopping = EarlyStopping(monitor='val_acc', patience=10)
callbacks_list = [checkpoint, early_stopping]

###Train the LSTM Model

The model is trained for 30 epochs, but training could stop early due to the callback. 

In [0]:
lstm_model.fit_generator(generator(train_input, train_output, BATCH_SIZE),
                    steps_per_epoch=int(len(train_input)/BATCH_SIZE) + 1,
                    epochs=30,
                    callbacks=callbacks_list,
                    validation_data=generator(test_input, test_output, BATCH_SIZE),
                    validation_steps=int(len(test_input)/BATCH_SIZE) + 1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100

###Generate Text 

####Restore the last checkpoint

In [0]:
model.load_weights('/content/drive/My Drive/Colab Notebooks/Sherlock-epoch001-words14834-sequence10-loss3.9755-acc0.3642-val_loss4.1913-val_acc0.3557.hdf5')

####Unconditional Text Generation

Starting from a random seed.

In [0]:
generate_uncond_samples(no_of_words = 500, temp=0.5)

####Conditional Text Generation

Enter a starting prompt.

In [0]:
generate_cond_samples(no_of_words = 500, temp=0.5)