<a href="https://colab.research.google.com/github/dimtr/PyDataEHV_workshop/blob/master/TextGeneration/Shakespeare_PyData.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Text Generation with Recurrent Neural Networks
In this workshop, we see how recurrent Neural Networks could be used as Generative Models. They can learn the sequences of a problem and generate entirely new plausible sequences for the problem domain.

We will discover how to create a simple text generation model using Python in [Keras](https://keras.io/) that generates text, word-by-word. We will work with the dataset of Shakespeare's writing Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

Given a sequence of words, the model trained on our dataset will predict the next most probable word. We will call the model repeatedly to generate longer sequences.  



##Setup

####Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


####Import Keras and other libraries

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout,SimpleRNN, LSTM, Embedding
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras.callbacks import LambdaCallback, EarlyStopping
from keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle
from glob import glob

import numpy as np


Using TensorFlow backend.


#### Clone GitHub Repository

In [3]:
!git clone https://github.com/dimtr/PyDataEHV_workshop/

Cloning into 'PyDataEHV_workshop'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (134/134), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 250 (delta 72), reused 89 (delta 32), pack-reused 116[K
Receiving objects: 100% (250/250), 297.65 MiB | 14.37 MiB/s, done.
Resolving deltas: 100% (95/95), done.
Checking out files: 100% (77/77), done.


####Reading the data

We load the Shakespeare text file and take a look at a part of the data

In [4]:
with open('/content/PyDataEHV_workshop/TextGeneration/datasets/shakespeare/shakespeare.txt', encoding='utf-8') as f:
   story = f.readlines()
   
print(''.join(story[:20]))

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.



##Process the Text
Before training the model, we need to process the text in a form that is interpretable by the model.  

#### Story in words

First, we convert the story data into chunks of words or tokens. Since we want our model to also recognize puntuations as words, we use [replace()](https://docs.python.org/2/library/string.html#string.replace) to add white spaces around them and then use [split()](https://docs.python.org/2/library/stdtypes.html#str.split) to split the story into word chunks. 

In [6]:
story_in_words = []
for i, line in enumerate(story):
  story[i] = line.lower().replace('.', ' . ').\
                          replace(',', ' , ').\
                          replace('?', ' ? ').\
                          replace('"', ' " ').\
                          replace('!', ' ! ').\
                          replace(':', ' : ').\
                          replace(';', ' ; ').\
                          replace('--', ' ').\
                          replace('-', ' ').\
                          replace(',', ' , ')
  story_in_words.extend(story[i].split(' '))

print("Total number of tokens in story: %d" % len(story_in_words))
print("Unique words in tokens: %d" %len(set(story_in_words)))

Total number of tokens in story: 475963
Unique words in tokens: 14833


####Creating sequences

The next step is to split the entire text into sequences of a certain length. We specify this length by using the *hyper parameter* SEQ_LEN. Sequence length is the number of words that the generative model would take as input to predict the next word. 

We go over the story by, shifting by one word at each step and take SEQ_LEN + 1 words in a sequence at a time.



NOTE: You can play around with different values of the SEQ_LEN hyper parameter to investigate how it affects the performance of your model. This method of trying out different values of hyper parameters in order to find an optimal set for a learning algorithm is called *Hyperparameter Tuning*. 

In [0]:
SEQ_LEN = 10
step = 1
sentences = []
for i in range(0, len(story_in_words) - SEQ_LEN, step):
  sentences.append(story_in_words[i : i + SEQ_LEN + 1])


####Tokenizer

Now we need to map the strings to numeric representation. We want each unique word to be represented by a unique integer number. We use the [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) class by Keras for this task. The class then calls a function that fits the tokenizer on the our sequences of text and builds an internal vocabulary. 

Note: 0 is a reserved index in this class that won't be assigned to any word.

In [0]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)  # transforms each text sequence into a sequence of integer, where each integer represents a unique word
sequences = np.asarray(sequences)                    # converting a list to a numpy array
vocab = tokenizer.word_counts                        # Dict object of vocabulary with frequency count

VOCAB_LEN = len(tokenizer.word_counts) + 1
total = sequences.shape[0]


We take an example of a sentance and see how the Tokenizer class transforms it into a sequence of integers.

In [12]:
temp = 'You are all resolved rather to die than to famish ?'

tokens = tokenizer.texts_to_sequences([temp.split()])
print("Numeric representation of the sentence: 'You are all resolved rather to die than to famish ?' is --> ", tokens)

Numeric representation of the sentence: 'You are all resolved rather to die than to famish ?' is -->  [[13, 48, 41, 1535, 378, 8, 198, 71, 8, 3461, 16]]


####Training and Test data split

Data is shuffled and split into training and test dataset with 15% of the data in the test split. 

The first *SEQ_LEN* words of the sequence make the input data and the remaining last word is the target data or the next probable word that the model should output.

In [13]:
sequences = shuffle(sequences)

train_input = sequences[int(total * 0.15):, :-1]
train_output = sequences[int(total * 0.15):, -1]

test_input = sequences[:int(total * 0.15), :-1]
test_output = sequences[:int(total * 0.15), -1]


print("Input Training Data Shape:", train_input.shape)
print("Target Training Data Shape:", train_output.shape)

print("Input Test Data Shape:", test_input.shape)
print("Target Test Data Shape:", test_output.shape)

Input Training Data Shape: (404561, 10)
Target Training Data Shape: (404561,)
Input Test Data Shape: (71392, 10)
Target Test Data Shape: (71392,)


####Build the RNN Model

We use the [Keras Sequential Model](https://keras.io/getting-started/sequential-model-guide/) to define the model. To build our simple RNN text generation model, we use four keras layers:


*   [keras.layers.Embedding](https://keras.io/layers/embeddings/): used to train a dense representation of words and their relative meanings.
*   [keras.layers.SimpleRNN](https://keras.io/layers/recurrent/#simplernn): fully connected RNN layer where the output is fed back to the input.
*   [keras.layers.Dropout](https://keras.io/layers/core/#dropout): applies a regularization technique where randomly selected neurons are ignored or "dropped-out" during training.
*   [keras.layers.Dense](https://keras.io/layers/core/#dense): regular densely connected neural network layer with output size equal to the vocabulary size (number of unique words). This layer is added at the end and uses the softmax activation to output the probablities (that add up to one) for each word. 



In [14]:
model = Sequential()
model.add(Embedding(VOCAB_LEN, 256, input_length=SEQ_LEN))
model.add(SimpleRNN(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(SimpleRNN(64))
model.add(Dropout(0.2))
model.add(Dense(VOCAB_LEN, activation='softmax'))




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


We compile the model to configure it for training and use categorical crossentropy as our loss function. 

In [15]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 10, 256)           3797504   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 10, 128)           49280     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 64)                12352     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 14834)             964210    
Total params: 4,823,346
Trainable params: 4,823,346
Non-trainable params: 0
__________________________________________

###Helper Functions

We create a generator function that generates data batch-by-batch. The generator is run in parallel to the model, for efficiency. 

In [0]:
def generator(sent, word, batch_size):

  global index
  index = 0
  while True:
    x = np.zeros((batch_size, SEQ_LEN), dtype=np.int)
    y = np.zeros((batch_size, VOCAB_LEN), dtype=np.bool)

    for i in range(batch_size):
      x[i] = sent[index % len(sent)]
      y[i] = to_categorical(word[index % len(word)], num_classes=VOCAB_LEN) #convert integers to one-hot encoded vectors
      index = index + 1
    yield x,y

This function samples an index from a softmax probablity array based on the temperature. This technique is called temperature sampling and is used to improve the quality of samples from language models.

Note: The high temperature sample displays greater linguistic variety, but the low temperature sample is more grammatically correct. Lowering the temperature allows you to focus on higher probability output sequences and smooth over deficiencies of the model.

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.flip(np.argsort(probas))[0] ##np.argmax(probas)

We make a simple display function that prints the stories 

In [0]:
def display(story):
  full = ''
  for word in story:
    if word == '|newline|':
      full = full + '\n'
    elif word in ',.;:?!':
      full = full + word
    else:
      full = full + ' ' + word
  
  print(full)


####Configure Checkpoints

We use two types of checkpoints for our model:


*   [ModelCheckpoint](https://keras.io/callbacks/#modelcheckpoint): to ensure that checkpoints are saved during training by monitoring a quality (validation accuracy in this case)
*   [EarlyStopping](https://keras.io/callbacks/#EarlyStopping): stops training when the monitored quality (validation accuracy) has not improved for a certain number of epochs. This threshold is set by the *patience* argument.



In [0]:
BATCH_SIZE =256
file_path = "/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-RNN-epoch{epoch:03d}-words%d-sequence%d-batchsize%d-" \
            "loss{loss:.4f}-acc{acc:.4f}-val_loss{val_loss:.4f}-val_acc{val_acc:.4f}.hdf5" % \
            (VOCAB_LEN, SEQ_LEN, BATCH_SIZE)

checkpoint = ModelCheckpoint(file_path, monitor='val_acc', save_best_only=True) # latest best model according to the val_acc monitored will not be overwritten
early_stopping = EarlyStopping(monitor='val_acc', patience=10)
callbacks_list = [checkpoint, early_stopping]

Uncomment if you do not want to train the model from scratch.

In [0]:
#model.load_weights('/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-RNN-epoch008-words14834-sequence10-batchsize256-loss3.6602-acc0.3875-val_loss4.1272-val_acc0.3578.hdf5')

####Train the Model

The model is trained for 30 epochs, but training could stop early due to the callback. 

In [0]:
model.fit_generator(generator(train_input, train_output, BATCH_SIZE),
                    steps_per_epoch=int(len(train_input)/BATCH_SIZE) + 1,
                    epochs=30,
                    callbacks=callbacks_list,
                    validation_data=generator(test_input, test_output, BATCH_SIZE),
                    validation_steps=int(len(test_input)/BATCH_SIZE) + 1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30


<keras.callbacks.History at 0x7f3e303273c8>

###Generate Text

####Restore the latest checkpoint


In [16]:
model.load_weights('/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-RNN-epoch008-words14834-sequence10-batchsize256-loss3.6602-acc0.3875-val_loss4.1272-val_acc0.3578.hdf5')









####Conditional Samples

Here, the story is generated word-by-word, based on a prompt provided. If the number of words in the prompt exceed SEQ_LEN, then the prompt is truncated from the beginning to fit the sequence length. If the length of prompt is less than SEQ_LEN, then zeros are padded in the beginning. The whole story is genrated by predicting the next probable word in a loop.

Note: Since 0 is used to pad sequences, it is important that the Tokenizer does not use 0 as an index. 



In [0]:
def generate_cond_samples(no_of_words, temp):
    
    print('Enter prompt: \n')
    seed = input()
    sentence = tokenizer.texts_to_sequences([seed])[0]

    sentence = list(pad_sequences([sentence], maxlen=SEQ_LEN, padding='pre', truncating='pre')[0])


    gen_story = []
    gen_story.extend(seed.split())

    for i in range(no_of_words):
        x_pred = np.expand_dims(sentence, axis=0) 
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_indices = sample(preds, temp)

        for ix in next_indices:
          if ix == 0:
            continue
          else:
            next_word = tokenizer.index_word[ix]
            sentence = sentence[1:]  ## removing the least frequent word from the input for the next predictiom
            sentence.append(ix)  ## adds the predicted word in the input for the next prediction
            break

        gen_story.append(next_word)
        
    display(gen_story)
        
       
    

In [0]:
generate_cond_samples(no_of_words = 500, temp=0.5)

Enter prompt: 

King: what are you doing here Sybil?
 King: what are you doing here Sybil?. 
 
 first citizen: 
 if i do not, he should be at home. 
 
 gloucester: 
 what, this is all to die. 
 
 gloucester: 
 i will not enter, and take a fire; 
 and i will have a woman's proud rebel, 
 and therefore i mean, and thou shalt not be so. 
 
 duke vincentio: 
 o, i am going to save those that have his government: 
 the time of the father of his life, 
 that would be gone, my lord, sir, i am sure, and not a word. 
 
 sicinius: 
 nay, father, i will not be so. 
 
 biondello: 
 if you do yield. 
 
 king henry vi: 
 good night, and go, now, my lord, my lords, i am not to your love. 
 
 king richard ii: 
 my lord, my lord, sir? 
 
 duke vincentio: 
 what is a man, my lord, i know not not a word: 
 to all the world, by your fortune! 
 
 second servingman: 
 what, sir, i have heard you speak. 
 
 romeo: 
 o, to say it, i see your leave. 
 
 gloucester: 
 my lord, if thou art not. 
 
 leontes: 
 o,

####Unconditional Samples

Here, the story is generated word-by-word by starting with a random seed. One integer is randomly sampled from the index of words and story is generated by prediciting the next probable work in a loop.

In [0]:
def generate_uncond_samples(no_of_words, temp):
    np.random.seed(0)
    seed = np.random.randint(1, VOCAB_LEN, size=SEQ_LEN)

    sentence = seed

    sentence = list(pad_sequences([sentence], maxlen=SEQ_LEN, padding='pre')[0])

    gen_story = []

    gen_story.extend(tokenizer.index_word[w] for w in seed)

    for i in range(no_of_words):
        x_pred = np.expand_dims(sentence, axis=0)

        preds = model.predict(x_pred, verbose=0)[0]
        next_indices = sample(preds, temp)

        for ix in next_indices:
            if ix == 0:
                continue

            else:
                next_word = tokenizer.index_word[ix]
                sentence = sentence[1:]
                sentence.append(ix)
                break


        gen_story.append(next_word)

    display(gen_story)

In [0]:
generate_uncond_samples(no_of_words = 500, temp=0.5)

 conclude idles compromise
 watery edge
 greets presumes majestical fidiused magistrate. 
 where is the law, and bid his life. 
 
 gloucester: 
 i will not well the king, 
 and with the man of all a company. 
 
 first murderer: 
 what, as i see, though i should die: 
 i do not hear, my son, the measure of a noble. 
 
 king richard iii: 
 i have been a suitor to the king, 
 and not not not to my daughter? 
 
 miranda: 
 i am a king, and each to be a thief, to the one of the king: 
 but i will be a man? 
 
 menenius: 
 i am a man, and all her well? 
 
 first citizen: 
 i am so? 
 
 second servingman: 
 my lord, i do not be a woman. 
 
 king richard iii: 
 why, sir, my lord! 
 
 leontes: 
 i pray you, sir! 
 
 volumnia: 
 i have a poor man's life. who shall be so? 
 
 juliet: 
 how now, my lord, my lord, sir, the lord. 
 
 leontes: 
 i am a good gentleman. 
 
 aufidius: 
 o, sir? 
 
 leontes: 
 o, sir? 
 
 gonzalo: 
 i have no more more than the fool. 
 that i am all to have a language. 


Looking at the generated text, you'll see the model knows when to use punctuations, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.



###Build an LSTM Model

We keep the structure for the model similar but use the Keras LSTM layer instead of the SimpleRNN layer. We again use the [Keras Sequential Model](https://keras.io/getting-started/sequential-model-guide/) to define the LSTM based generative model by using four keras layers:


*   [keras.layers.Embedding](https://keras.io/layers/embeddings/): used to train a dense representation of words and their relative meanings.
*   [keras.layers.LSTM](https://keras.io/layers/recurrent/#lstm): long-short term memory layer composed of a *cell*, an *input gate*, an *output gate* and a *forget gate*.
*   [keras.layers.Dropout](https://keras.io/layers/core/#dropout): applies a regularization technique where randomly selected neurons are ignored or "dropped-out" during training.
*   [keras.layers.Dense](https://keras.io/layers/core/#dense): regular densely connected neural network layer with output size equal to the vocabulary size (number of unique words). This layer is added at the end and uses the softmax activation to output the probablities (that add up to one) for each word. 



In [0]:
lstm_model = Sequential()
lstm_model.add(Embedding(VOCAB_LEN, 256, input_length=SEQ_LEN))
lstm_model.add(LSTM(128, return_sequences=True))
lstm_model.add(Dropout(0.2))
lstm_model.add(LSTM(64))
lstm_model.add(Dropout(0.2))
lstm_model.add(Dense(VOCAB_LEN, activation='softmax'))

We compile the model to configure it for training and use categorical crossentropy as our loss function. 

In [18]:
lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 10, 256)           3797504   
_________________________________________________________________
lstm_1 (LSTM)                (None, 10, 128)           197120    
_________________________________________________________________
dropout_3 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dropout_4 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 14834)             964210    
Total params: 5,008,242
Trainable params: 5,008,242
Non-trainable params: 0
____________________________________________

### Configure Checkpoints

In [0]:
BATCH_SIZE =256
file_path = "/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-LSTM-epoch{epoch:03d}-words%d-sequence%d-batchsize%d-" \
            "loss{loss:.4f}-acc{acc:.4f}-val_loss{val_loss:.4f}-val_acc{val_acc:.4f}.hdf5" % \
            (VOCAB_LEN, SEQ_LEN, BATCH_SIZE)

checkpoint = ModelCheckpoint(file_path, monitor='val_acc', save_best_only=True)
early_stopping = EarlyStopping(monitor='val_acc', patience=10)
callbacks_list = [checkpoint, early_stopping]

Uncomment if you want do not want to train the model from scratch.

In [0]:
#lstm_model.load_weights('/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-LSTM-epoch011-words14834-sequence10-batchsize256-loss3.6790-acc0.3917-val_loss4.1451-val_acc0.3607.hdf5')

###Train the LSTM Model

The model is trained for 30 epochs, but training could stop early due to the callback. 

In [0]:
lstm_model.fit_generator(generator(train_input, train_output, BATCH_SIZE),
                    steps_per_epoch=int(len(train_input)/BATCH_SIZE) + 1,
                    epochs=30,
                    callbacks=callbacks_list,
                    validation_data=generator(test_input, test_output, BATCH_SIZE),
                    validation_steps=int(len(test_input)/BATCH_SIZE) + 1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30


<keras.callbacks.History at 0x7f3e478b7898>

###Generate Text 

####Restore the last checkpoint

In [0]:
lstm_model.load_weights('/content/PyDataEHV_workshop/TextGeneration/checkpoints/Shakespeare/Shakespeare-LSTM-epoch011-words14834-sequence10-batchsize256-loss3.6790-acc0.3917-val_loss4.1451-val_acc0.3607.hdf5')

####Unconditional Text Generation

Starting from a random seed.

In [0]:
generate_uncond_samples(no_of_words = 500, temp=0.5)

####Conditional Text Generation

Enter a starting prompt.

In [0]:
generate_cond_samples(no_of_words = 500, temp=0.5)