<a href="https://colab.research.google.com/github/chaitanya-kh/text_generation_lstm/blob/master/Text_generation_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with LSTM

## Introduction

Following is covered i this book:


1.   Getting the text data which is a novel 'Tom Sawyer'
2.   Tokenizing the text data at character level using keras pre-processing
3.    Preparing the data for training
4.   Building model with Embed, LSTM and Dense layers
5.   Training the model and checking the result



In [0]:
#Imports
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras

## Getting the text data

In [2]:
#Uploading the novel 'The adventured of Tom Sawyer' from https://www.gutenberg.org/ebooks/74
from google.colab import files
uploaded = files.upload()

In [3]:
f = open('tom.txt', 'r')
text = ''.join(f.readlines())
print ('Length of text: ',len(text))
off = np.random.randint(0, len(text)-5000, size=(4,), )

#Prepare dataset as a list with entire text as first entry and 
#next entries as random 2000 length sequences from the text which can be used as input for text generation
dataset = [text, text[off[0]:off[0]+2000], text[off[1]:off[1]+2000], text[off[2]:off[2]+2000], text[off[3]:off[3]+2000]]

Length of text:  386669


## Tokenizing the text data at character level using keras pre-processing

In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [0]:
#Vectorize the entire text
t = Tokenizer(lower=True, char_level=True)
t.fit_on_texts(dataset)

In [6]:
#Checkout the character encoding
t.word_index

{'\n': 14,
 ' ': 1,
 '!': 31,
 '&': 48,
 "'": 26,
 '(': 42,
 ')': 43,
 '*': 45,
 ',': 22,
 '-': 28,
 '.': 24,
 '2': 44,
 ':': 37,
 ';': 33,
 '?': 34,
 '[': 40,
 ']': 41,
 '_': 35,
 'a': 4,
 'b': 21,
 'c': 19,
 'd': 11,
 'e': 2,
 'f': 20,
 'g': 18,
 'h': 7,
 'i': 8,
 'j': 32,
 'k': 25,
 'l': 12,
 'm': 16,
 'n': 6,
 'o': 5,
 'p': 23,
 'q': 38,
 'r': 10,
 's': 9,
 't': 3,
 'u': 13,
 'v': 27,
 'w': 15,
 'x': 36,
 'y': 17,
 'z': 39,
 '\xa0': 46,
 '“': 29,
 '”': 30,
 '\ufeff': 47}

In [7]:
vocab_size = len(t.word_index)+1 #0 is not used in word index
print (vocab_size)

49


In [0]:
#Convert dataset to tokens
tokens = t.texts_to_sequences(dataset)

In [9]:
dataset_len = len(tokens[0])
print (dataset_len)

386669


## Preparing the data for training

In [10]:
#Define batch size to be passed in .fit()
batch_size=32

#Define samples per batch
samp_per_batch = 128

#Calculate number of batches based on above
num_batches = dataset_len - samp_per_batch - 1 #int((dataset_len-1)/samp_per_batch) 
num_batches = num_batches - (num_batches%batch_size)
#max_l = samp_per_batch * num_batches

print ('Text length: ', dataset_len)
print ('Samples per batch: ', samp_per_batch)
print ('Number of batches: ', num_batches)
#print ('Text length considered for training: ', max_l)

Text length:  386669
Samples per batch:  128
Number of batches:  386528


In [11]:
#Prepare input and output according to the samples per batch and total number of batches

x_inp = []
y_op = []

for i in range(num_batches):
  x_inp.append(tokens[0][i:i+samp_per_batch])
  y_op.append(tokens[0][i+samp_per_batch+1])

#One hot encode the y_op
y_op = np.array(keras.utils.to_categorical(y_op, vocab_size))

#Reshape x_inp to make it compatible for Embed layer
x_inp = np.array(x_inp).reshape(num_batches, samp_per_batch) 

#Feature scaling
#x_inp = x_inp/vocab_size

print ('Input shape: ', x_inp.shape)
print ('Output shape: ', y_op.shape)

Input shape:  (386528, 128)
Output shape:  (386528, 49)


## Building model with Embed, LSTM and Dense layers

In [0]:
from keras.models import Sequential
from keras.layers import Embedding, GRU, Dense, CuDNNGRU, TimeDistributed, CuDNNLSTM, Dropout

In [13]:
tf.reset_default_graph()

#Keras Sequential model
model = Sequential()

#Add embed layer which translates our sample vector to an embedding. Embedding is constructed such that 
#characters closer to each other wil be placed at lesser distances compared to others
model.add(Embedding(vocab_size, 8, input_length=samp_per_batch, batch_input_shape=(batch_size, samp_per_batch)))

#Add an GPU based LSTM layer with 750 units
#stateful to be set as True so that cell retains the state of previous batch
#return_sequences to be set as True
model.add(CuDNNGRU(units=256, return_sequences=True, stateful=True))
model.add(Dropout(0.2))

#Add an GPU based LSTM layer with 750 units
model.add(CuDNNGRU(units=256, return_sequences=True, stateful=True))
model.add(Dropout(0.2))

#Add an GPU based LSTM layer with 750 units
model.add(CuDNNGRU(units=256, stateful=True))
model.add(Dropout(0.2))

model.add(Dense(units=vocab_size, activation='softmax', kernel_initializer='he_normal'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

W0717 21:55:57.559721 139893707749248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0717 21:55:57.577175 139893707749248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0717 21:55:57.579714 139893707749248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0717 21:55:58.472845 139893707749248 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0717 21:55:58.483350 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (32, 128, 8)              392       
_________________________________________________________________
cu_dnngru_1 (CuDNNGRU)       (32, 128, 256)            204288    
_________________________________________________________________
dropout_1 (Dropout)          (32, 128, 256)            0         
_________________________________________________________________
cu_dnngru_2 (CuDNNGRU)       (32, 128, 256)            394752    
_________________________________________________________________
dropout_2 (Dropout)          (32, 128, 256)            0         
_________________________________________________________________
cu_dnngru_3 (CuDNNGRU)       (32, 256)                 394752    
_________________________________________________________________
dropout_3 (Dropout)          (32, 256)                 0         
__________

## Training the model and checking the result

In [14]:
#Fit the input training dataset to output
history = model.fit(x_inp, y_op, batch_size=batch_size, epochs=10)

W0717 21:55:59.117868 139893707749248 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [0]:
#Helper function to generate text from the model based on the input text
def generate_text(pred_text_len=500, inp_arr=[]):
  pred_text = np.array([])
  inp = np.array(inp_arr[:samp_per_batch]).reshape(1, samp_per_batch)
  stuff = np.zeros(shape=(batch_size-1, samp_per_batch))  
  inp = np.append(inp, stuff, axis=0)
  
  for i in range(pred_text_len):
    pred = model.predict(inp)
    #print (np.argmax(pred[0]))
    inp[0] = np.append(inp[0], np.argmax(pred[0]))[1:].reshape(1, samp_per_batch) 
    pred_text = np.append(pred_text, inp[0][-1]) 
    
    
  tx1 = t.sequences_to_texts(np.array(inp_arr[:samp_per_batch]).reshape(-1, 1))
  tx1 = ''.join(tx1)

  tx = t.sequences_to_texts(pred_text.reshape(-1,1))
  tx = ''.join(tx)

  tx2 = t.sequences_to_texts(np.array(inp_arr[samp_per_batch:samp_per_batch+pred_text_len]).reshape(-1,1))
  tx2 = ''.join(tx2)

  print ('----- Input text below -----  ', '\n', tx1, '\n ----- Input text end -----\n\n')
  print ('----- Expected text below -----  ', '\n', tx2, '\n ----- Expected text end -----\n\n')
  print ('----- Predicted text below ----- ', '\n', tx, '\n ----- Predicted text end -----\n\n')

In [16]:
generate_text(500, np.array(tokens[1]))

----- Input text below -----   
 us charm. he did not venture again until he
had found it, and by that time the other boys were tired and ready to
rest. they gra 
 ----- Input text end -----


----- Expected text below -----   
 dually wandered apart, dropped into the “dumps,” and
fell to gazing longingly across the wide river to where the village lay
drowsing in the sun. tom found himself writing “becky” in the sand with
his big toe; he scratched it out, and was angry with himself for his
weakness. but he wrote it again, nevertheless; he could not help it. he
erased it once more and then took himself out of temptation by driving
the other boys together and joining them.

but joe's spirits had gone down almost beyond re 
 ----- Expected text end -----


----- Predicted text below -----  
  adan o eee o eee o eee o eee o eee o eee o eee o eee o eeon o o o o o o o o o o o o o o o o o o o o o o o o o o o
o oe oe hmr o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee o

In [17]:
generate_text(500, np.array(tokens[2]))

----- Input text below -----   
 s between ache the harder.

becky thatcher was gone to her constantinople home to stay with her
parents during vacation--so ther 
 ----- Input text end -----


----- Expected text below -----   
 e was no bright side to life anywhere.

the dreadful secret of the murder was a chronic misery. it was a very
cancer for permanency and pain.

then came the measles.

during two long weeks tom lay a prisoner, dead to the world and its
happenings. he was very ill, he was interested in nothing. when he got
upon his feet at last and moved feebly downtown, a melancholy change had
come over everything and every creature. there had been a “revival,” and
everybody had “got religion,” not only the adult 
 ----- Expected text end -----


----- Predicted text below -----  
  ao hsr oe o o o eeodtn o o o o o o o o o o o o oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe
hset o o o oe hmr o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee o eeon o o o o o o o o o o

In [18]:
generate_text(500, np.array(tokens[3]))

----- Input text below -----   
 d wretchedness rise superior to fears in the long run.
another tedious wait at the spring and another long sleep brought
changes 
 ----- Input text end -----


----- Expected text below -----   
 . the children awoke tortured with a raging hunger. tom believed
that it must be wednesday or thursday or even friday or saturday, now,
and that the search had been given over. he proposed to explore another
passage. he felt willing to risk injun joe and all other terrors. but
becky was very weak. she had sunk into a dreary apathy and would not be
roused. she said she would wait, now, where she was, and die--it would
not be long. she told tom to go with the kite-line and explore if he
chose; but 
 ----- Expected text end -----


----- Predicted text below -----  
  frm hupe ntee o eeon o o o eee o eee o eee o eee o eee o eee o eee o eee o eee o eeon o o o o o o o o o o o o o o o o o o
o oe oe hmr o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee o eee

In [19]:
generate_text(500, np.array(tokens[4]))

----- Input text below -----   
 s of his great adventure, he
noticed that they seemed curiously subdued and far away--somewhat as if
they had happened in anothe 
 ----- Input text end -----


----- Expected text below -----   
 r world, or in a time long gone by. then it
occurred to him that the great adventure itself must be a dream! there
was one very strong argument in favor of this idea--namely, that the
quantity of coin he had seen was too vast to be real. he had never seen
as much as fifty dollars in one mass before, and he was like all boys of
his age and station in life, in that he imagined that all references to
“hundreds” and “thousands” were mere fanciful forms of speech, and that
no such sums really existed 
 ----- Expected text end -----


----- Predicted text below -----  
  ao hs o oe o eee o eeon o o o o o o o eee o eee o eee o eee o eee o
eeon o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oe oe hmr o eee o eee o eee o eeon o o o o o o