# Text generation using RNN - Word Level

To generate text using RNN, we need a to convert raw text to a supervised learning problem format.

Take, for example, the following corpus:

"Her brother shook his head incredulously. He was not aware of the situation at all."

First we need to divide the data into tabular format containing input (X) and output (y) sequences. In case of a character level model, the X and y will look like this:

|      X                |  Y      |
|-----------------------|---------|
|    < word1 >< word2 > | < word3 > |
|    Her brother        |  shook  |
|    brother shook      |  his    |
|    shook his          |  head   |
|    his head           | incredulously |
|    head incredulously |    .    |
|    ..                 |    .    |
|    situation at       |  all    |
|    at all             |    .    |

Note that in the above problem, the sequence length of **X is two words** and that of **y is one word**. Hence, this is a many-to-one architecture. We can, however, change the number of input words to any number depending on the problem.

A model is trained on such data. To generate text, we simply give the model any two words using which it predicts the next word. Then it appends the predicted word to the input sequence (to the extreme right of the sequence) and discards the first word (word on extreme left of the sequence). Then it predicts again using the new sequence and the cycle continues until a fix number of iterations. An example is shown below:

Seed text: "Did I"

|      X                                            |  Y                       |
|---------------------------------------------------|--------------------------|
|                        Did I                      |    < predicted word 1 >  |
|               I < predicted word 1 >              |    < predicted word 2 >  |
|       < predicted word 1 > < predicted word 2 >   |    < predicted word 3 >  |
|       < predicted word 2 > < predicted word 3 >   |    < predicted word 4 >  |
|                      ...                          |            ...           | 

# Notebook Overview
1. Preprocess data
2. Build LSTM model
3. Generate text

In [1]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import re
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import requests
from nltk.tokenize import word_tokenize

from gensim.models import KeyedVectors

from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# 1. Preprocess data

In [2]:
# download ebook
url = "https://www.gutenberg.org/files/24869/24869-0.txt"
book = requests.get(url)
data = book.text

In [3]:
# let's look at the text
print(data[:500])

ï»¿The Project Gutenberg EBook of The Ramayana



This eBook is for the use of anyone anywhere at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it under
the terms of the Project Gutenberg License included with this eBook or
online at http://www.gutenberg.org/license



Title: The Ramayana



Release Date: March 18, 2008 [Ebook #24869]

Language: English

Character set encoding: UTF-8


***START OF THE PROJECT GUTENBERG EBOOK THE


In [4]:
# subset the book from the first chapter, that ism INVOCATION - everything before first chapter is irrelevant data
start_index = re.search("invocation.\(1\)", data, re.I)
print(start_index.start())

19177


In [5]:
# Let's see how does the text look like
data = data[start_index.start():]

In [6]:
# let's look at the text
print(data[:500])

INVOCATION.(1)


Praise to VÃ¡lmÃ­ki,(2)bird of charming song,(3)
  Who mounts on Poesyâs sublimest spray,
And sweetly sings with accent clear and strong
  RÃ¡ma, aye RÃ¡ma, in his deathless lay.

Where breathes the man can listen to the strain
  That flows in music from VÃ¡lmÃ­kiâs tongue,
Nor feel his feet the path of bliss attain
  When RÃ¡maâs glory by the saint is sung!

The stream RÃ¡mÃ¡yan leaves its sacred fount
  The whole wide world from sin and stain to free.(4)
T


In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Clean text

In [6]:
# define a function to clean text data
def clean_document(document, char_filter = r"[^\w]"):
    '''
    input:
    document          :  string
    char_filter       :  regex pattern - removes those characters from the text that match the pattern

    output: clean document
    '''
    
    # convert words to lower case
    document = document.lower()

    # tokenise words
    words = word_tokenize(document)

    # strip whitespace from all words
    words = [word.strip() for word in words]

    # join back words to get document
    document = " ".join(words)

    # remove unwanted characters
    document = re.sub(char_filter, " ", document)

    # replace multiple whitespaces with single whitespace
    document = re.sub(r"\s+", " ", document)

    # strip whitespace from document
    document = document.strip()

    return document

data = clean_document(data)

In [7]:
# length of text
words = word_tokenize(data)
print("Number of words in document: {}".format(len(words)))

Number of words in document: 427239


## Convert characters to integers

In [8]:
# use Keras' Tokenizer() function to encode text to integers
word_tokeniser = Tokenizer()
word_tokeniser.fit_on_texts([data])
encoded_words = word_tokeniser.texts_to_sequences([data])[0]

In [9]:
# check the size of the vocabulary
VOCABULARY_SIZE = len(word_tokeniser.word_index) + 1
print('Vocabulary Size: {}'.format(VOCABULARY_SIZE))

Vocabulary Size: 18454


In [10]:
print('Vocabulary Size: {}'.format(word_tokeniser.word_index))



## Divide data in X and y

### Create sequences

In each training sample, X will have a sequence of 5 words and y will have the sixth word. In other words, this means that use previous five words of a sequence to predict next word.

In [11]:
sequences = []
MAX_SEQ_LENGTH = 5  # X will have five words, y will have the sixth word

for i in range(MAX_SEQ_LENGTH, len(encoded_words)):
    sequence = encoded_words[i-MAX_SEQ_LENGTH:i+1]
    sequences.append(sequence)
sequences = np.array(sequences)

In [13]:
print('Total number of training samples: {}'.format(len(sequences)))
print('\nSample sequences: \n{}'.format(sequences[0:3]))

Total number of training samples: 427068

Sample sequences: 
[[7130  841  886    4   54 1096]
 [ 841  886    4   54 1096 1354]
 [ 886    4   54 1096 1354 2000]]


In [12]:
# divide the sequence into X and y
sequences = np.array(sequences)

X = sequences[:,:-1]  # assign all but last words of a sequence to X
y = sequences[:,-1]   # assign last word of each sequence to y

In [28]:
# Look at the first training example
print("Input of the first data point:", X[0], "\n")
print("Output of the first data point: [", y[0], "]")

Input of the first data point: [7130  841  886    4   54] 

Output of the first data point: [ 1096 ]


### One-hot encode y

In [13]:
y.shape

(427068,)

In [14]:
y = to_categorical(y, num_classes=VOCABULARY_SIZE)

In [15]:
print("Output of the first data point: [", y[0], "]")

Output of the first data point: [ [0. 0. 0. ... 0. 0. 0.] ]


In [16]:
print(X.shape)
print(y.shape)

(427068, 5)
(427068, 18454)


There are 410241 sequences (data points) in total.

Remember that to use an RNN data has to be of the shape (#samples, #timesteps, #features)

In X, the third dimension, that is, number of features is missing because we're going to use the Keras' Embedding Layer. Hence we don't need to explicitly reshape the data to incorporate the third dimension. That will be done automatically by Keras.

In y, the second dimension is missing, that is, the number of timesteps because y is not a sequence, it's just a single word. The number of features are represented by a one-hot encoded vector whose length is the VOCABULARY_SIZE.

### Pad sequences

In [17]:
X = pad_sequences(X, maxlen=MAX_SEQ_LENGTH, padding='pre')
print('Input sequence length: {}'.format(MAX_SEQ_LENGTH))

Input sequence length: 5


# 2. LSTM

In [18]:
# create model architecture

EMBEDDING_SIZE = 100


model = Sequential()

# embedding layer
model.add(Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length = MAX_SEQ_LENGTH))

# lstm layer 1
model.add(LSTM(128, return_sequences=True))

# lstm layer 2
model.add(LSTM(128))

# output layer
model.add(Dense(VOCABULARY_SIZE, activation='softmax'))

In [19]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  
# summarize defined model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 100)            1845400   
                                                                 
 lstm (LSTM)                 (None, 5, 128)            117248    
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 18454)             2380566   
                                                                 
Total params: 4,474,798
Trainable params: 4,474,798
Non-trainable params: 0
_________________________________________________________________


In [20]:
# fit network
model.fit(X, y, epochs=10, verbose=2, batch_size=256)

Epoch 1/10
1669/1669 - 111s - loss: 6.8509 - accuracy: 0.0749 - 111s/epoch - 67ms/step
Epoch 2/10
1669/1669 - 111s - loss: 6.1106 - accuracy: 0.1109 - 111s/epoch - 67ms/step
Epoch 3/10
1669/1669 - 111s - loss: 5.7634 - accuracy: 0.1274 - 111s/epoch - 66ms/step
Epoch 4/10
1669/1669 - 111s - loss: 5.5293 - accuracy: 0.1385 - 111s/epoch - 67ms/step
Epoch 5/10
1669/1669 - 112s - loss: 5.3439 - accuracy: 0.1477 - 112s/epoch - 67ms/step
Epoch 6/10
1669/1669 - 112s - loss: 5.1816 - accuracy: 0.1559 - 112s/epoch - 67ms/step
Epoch 7/10
1669/1669 - 111s - loss: 5.0330 - accuracy: 0.1637 - 111s/epoch - 66ms/step
Epoch 8/10
1669/1669 - 112s - loss: 4.8963 - accuracy: 0.1715 - 112s/epoch - 67ms/step
Epoch 9/10
1669/1669 - 113s - loss: 4.7672 - accuracy: 0.1796 - 113s/epoch - 68ms/step
Epoch 10/10
1669/1669 - 112s - loss: 4.6455 - accuracy: 0.1883 - 112s/epoch - 67ms/step


<keras.callbacks.History at 0x7f8456e7e950>

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
!cp /content/drive/MyDrive/ML/CNN/DataSet/GoogleNews-vectors-negative300.bin.gz .

In [34]:
!gunzip GoogleNews-vectors-negative300.bin.gz .

gzip: . is a directory -- ignored


### Load word embeddings to represent the input words

In [35]:
# word2vec download link (Size ~ 1.5GB): https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

path = 'GoogleNews-vectors-negative300.bin'

# load word2vec using the following function present in the gensim library
word2vec = KeyedVectors.load_word2vec_format(path, binary=True)

In [36]:
# assign word vectors from word2vec model

EMBEDDING_SIZE  = 300  # each word in word2vec model is represented using a 300 dimensional vector
VOCABULARY_SIZE = len(word_tokeniser.word_index) + 1

# create an empty embedding matix
embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

# create a word to index dictionary mapping
word2id = word_tokeniser.word_index

# copy vectors from word2vec model to the words present in corpus
for word, index in word2id.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

In [37]:
# check embedding dimension
print("Embeddings shape: {}".format(embedding_weights.shape))

Embeddings shape: (18454, 300)


In [38]:
# create model architecture

model_wv = Sequential()

# embedding layer
model_wv.add(Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length = MAX_SEQ_LENGTH,
                    weights = [embedding_weights], trainable=True))

# lstm layer 1
model_wv.add(LSTM(128, return_sequences=True))

# lstm layer 2
# when using multiple LSTM layers, set return_sequences to True at the previous layer
# because the current layer expects a sequential intput rather than a single input
model_wv.add(LSTM(128))

# output layer
model_wv.add(Dense(VOCABULARY_SIZE, activation='softmax'))

In [39]:
# compile network
model_wv.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  
# summarize defined model
model_wv.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 5, 300)            5536200   
                                                                 
 lstm_2 (LSTM)               (None, 5, 128)            219648    
                                                                 
 lstm_3 (LSTM)               (None, 128)               131584    
                                                                 
 dense_1 (Dense)             (None, 18454)             2380566   
                                                                 
Total params: 8,267,998
Trainable params: 8,267,998
Non-trainable params: 0
_________________________________________________________________


In [None]:
y = to_categorical(y, num_classes=VOCABULARY_SIZE)

In [40]:
# fit network
model_wv.fit(X, y, epochs=5, batch_size=256)

Epoch 1/5


ValueError: ignored

# 3. Generate text

In [25]:
# generate a sequence from a language model
def generate_words(model, word_tokeniser, MAX_SEQ_LENGTH, seed, n_words):
    
    text = seed
    
    # generate n_words
    for _ in range(n_words):
        
        # encode text as integers
        encoded_words = word_tokeniser.texts_to_sequences([text])[0]
        
        # pad sequences
        padded_words = pad_sequences([encoded_words], maxlen=MAX_SEQ_LENGTH, padding='pre')
        
        # predict next word
        #prediction = model.predict_classes(padded_words, verbose=0)

        prediction = np.argmax(model.predict(padded_words), axis=-1)
        
        # convert predicted index to its word
        next_word = ""
        for word, i in word_tokeniser.word_index.items():
            if i == prediction:
                next_word = word
                break
        
        # append predicted word to text
        text += " " + next_word
        
    return text

### Let's look at some text generations

In [26]:
# text generation using first model - model without word embeddings
seed_text = "rama never told anyone about"
num_words = 100
print(generate_words(model, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

rama never told anyone about the world in the wild worlds that i will be the cause of all the world to day the king of men and all the gods he passed and to the king he said â o rã ma to the fight he bent and thus the king of men he spoke and thus the king addressed â o rã ma to the fight he bent and thus the king of men he spoke and thus the king addressed â o rã ma to the fight he bent and thus the king of men he spoke and thus the king addressed


In [None]:
# text generation using second model - model with word embeddings
seed_text = "rama never told anyone about"
num_words = 100
print(generate_words(model_wv, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

In [27]:
# text generation using first model - model without word embeddings
seed_text = "how are you doing"
num_words = 100
print(generate_words(model, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))

how are you doing to be understood to the rã mã yan is the same to the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the same is the


In [None]:
# text generation using second model - model with word embeddings
seed_text = "how are you doing"
num_words = 100
print(generate_words(model_wv, word_tokeniser, MAX_SEQ_LENGTH, seed_text, num_words))