# Text generation using a RNN

Given a sequence of words from this data, train a model to predict the next word in the sequence. Longer sequences of text can be generated by calling the model repeatedly.

**Mount your Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import Keras and other libraries

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras

## Download data
Reference: Data is collected from http://www.gutenberg.org

For the lab purpose, you can load the dataset provided by Great Learning

### Load the Oscar Wilde dataset

Store all the ".txt" file names in a list

In [3]:
import os
files = []
for i in os.listdir('/content/drive/My Drive/Colab Notebooks/R9/data'):
    if i.endswith('.txt'):
        files.append(i)
files

['For Love of the King.txt',
 'Salomé A tragedy in one act.txt',
 'Impressions of America.txt',
 'The Canterville Ghost.txt',
 'A House of Pomegranates.txt',
 'Miscellaneous Aphorisms_ The Soul of Man.txt',
 'A Woman of No Importance a play.txt',
 'Essays and Lectures.txt',
 'The Happy Prince and other tales.txt',
 'Rose Leaf and Apple Leaf.txt',
 'Vera or, The Nihilists.txt',
 'Lord Arthur Savile_s Crime.txt',
 'Poems with the Ballad of Reading Gaol.txt',
 'Selected poems of oscar wilde including The Ballad of Reading Gaol.txt',
 'Charmides and Other Poems.txt',
 'An Ideal Husband.txt',
 'The Duchess of Padua.txt',
 'Oscar Wilde Miscellaneous.txt',
 'Shorter Prose Pieces.txt',
 'The Ballad of Reading Gaol.txt',
 'Children in Prison and Other Cruelties of Prison Life.txt',
 'Reviews.txt',
 'De Profundis.txt',
 'A Critic in Pall Mall.txt',
 'Miscellanies.txt',
 'The Importance of Being Earnest.txt',
 'Selected prose of oscar wilde with a Preface by Robert Ross.txt',
 'The Soul of Man.tx

### Read the data

Read contents of every file from the list and append the text in a new list

In [4]:
new_list = []
for files in os.listdir('/content/drive/My Drive/Colab Notebooks/R9/data'):
        if files.endswith('.txt'):
            with open(os.path.join('/content/drive/My Drive/Colab Notebooks/R9/data', files), 'r') as f:
                text = f.read()
                new_list.append(text)

In [5]:
new_list = new_list[0:5]
new_list

['\ufeffThe Project Gutenberg eBook, For Love of the King, by Oscar Wilde\n\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\n\n\n\nTitle: For Love of the King\n       a Burmese Masque\n\n\nAuthor: Oscar Wilde\n\n\n\nRelease Date: October 28, 2007  [eBook #23229]\n\nLanguage: English\n\nCharacter set encoding: ISO-646-US (US-ASCII)\n\n\n***START OF THE PROJECT GUTENBERG EBOOK FOR LOVE OF THE KING***\n\n\n\n\nTranscribed from the [1922] Methuen and Co./Jarrold and Sons edition by\nDavid Price, email ccx074@pglaf.org\n\n\n\n\n\nFOR\nLOVE OF THE KING\n\n\nA BURMESE MASQUE\n\nBY\nOSCAR WILDE\n\nMETHUEN & CO. LTD.\n36 ESSEX STREET W.C.\nLONDON\n\n_First Published by Methuen & Co. Ltd. in 1922_\n\n_This Edition on handmade paper is limited to 1000 copies_\n\n\n\n\nINTRODUCTORY NOTE\n\n

## Process the text
Initialize and fit the tokenizer

In [6]:
token = tf.keras.preprocessing.text.Tokenizer()

token.fit_on_texts(new_list)

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping words to numbers, and another for numbers to words.

Convert text to sequence of numbers

In [7]:
text_to_number = token.texts_to_sequences(new_list)
number_to_text = dict((i,c) for c, i in token.word_index.items())

Get the word count for every word and also get the total number of words.

In [8]:
print('Word Count for each word:','\n')
token.word_counts

Word Count for each word: 



OrderedDict([('\ufeffthe', 5),
             ('project', 435),
             ('gutenberg', 487),
             ('ebook', 54),
             ('for', 617),
             ('love', 61),
             ('of', 2832),
             ('the', 5629),
             ('king', 111),
             ('by', 359),
             ('oscar', 33),
             ('wilde', 60),
             ('this', 451),
             ('is', 874),
             ('use', 77),
             ('anyone', 25),
             ('anywhere', 16),
             ('at', 525),
             ('no', 257),
             ('cost', 16),
             ('and', 3365),
             ('with', 902),
             ('almost', 26),
             ('restrictions', 10),
             ('whatsoever', 16),
             ('you', 682),
             ('may', 146),
             ('copy', 61),
             ('it', 843),
             ('give', 92),
             ('away', 93),
             ('or', 473),
             ('re', 10),
             ('under', 59),
             ('terms', 107),
             ('li

In [9]:
print('Total Number of words:','\n')
len(token.word_counts)

Total Number of words: 



7651

In [10]:
vocab_size = len(token.word_index)
print('Number of unique words: ', vocab_size)

Number of unique words:  7651


### Generate Features and Labels

In [11]:
sequence_len = 40
x_seq = []
y_seq = []
for i in range(0,len(text_to_number)):
  for j in range(0, len(text_to_number[i])-sequence_len):
    x = text_to_number[i][j: j + sequence_len]
    y = text_to_number[i][j+sequence_len]
    x_seq.append(x)
    y_seq.append(y)  

In [12]:
print(len(x_seq))
print(x_seq[0])
print(len(y_seq))
print(y_seq[0])

83821
[1582, 26, 21, 207, 17, 174, 3, 1, 88, 36, 319, 179, 24, 207, 11, 17, 1, 132, 3, 396, 606, 19, 42, 607, 2, 9, 385, 42, 928, 608, 15, 72, 175, 12, 106, 12, 102, 23, 929, 132]
83821
12


### The prediction task

Given a word, or a sequence of words, what is the most probable next word? This is the task we're training the model to perform. The input to the model will be a sequence of words, and we train the model to predict the output—the following word at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the words computed until this moment, what is the next word?

In [13]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x_seq,y_seq, test_size=0.3, random_state=10)

In [14]:
print("Training Data:") #Yielded one by one for each time step within a epoch
print(len(xtrain))
print(len(ytrain))
print("Test Data:") # User later for model evaluation
print(len(xtest))
print(len(ytest))

Training Data:
58674
58674
Test Data:
25147
25147


In [15]:
def get_test_sequence_fixedWidth(fixedLength):
  test_seq_local = ''
  any_rand_book = np.random.randint(0,high=(len(text_to_number)-1))
  start_pos = np.random.randint(0, high=(len(text_to_number[any_rand_book]) - fixedLength))
  test_seq_local =  xtrain[start_pos : start_pos+fixedLength][0]
  return test_seq_local,any_rand_book,start_pos

In [16]:
test_seq,book_index,starting_position = get_test_sequence_fixedWidth(sequence_len)
print(test_seq)

[44, 1671, 508, 80, 1393, 1, 104, 26, 21, 40, 127, 25, 979, 6, 206, 48, 83, 48, 48, 83, 553, 53, 16, 321, 5, 271, 17, 218, 4, 1672, 505, 674, 474, 23, 303, 44, 26, 21, 40, 67]


In [17]:
def get_text_from_sequence(sequenceSample):
  test_seq_text = ''
  for w in sequenceSample:
    test_seq_text += number_to_text[w]
    test_seq_text += " "
  return test_seq_text
print(get_text_from_sequence(test_seq))

any alternate format must include the full project gutenberg tm license as specified in paragraph 1 e 1 1 e 7 do not charge a fee for access to viewing displaying performing copying or distributing any project gutenberg tm works 


In [18]:
def predict_seq(epoch, logs):
    #Initialize predicted output
    predicted_output = ''
    found_complete_word = False
    #lets predict 10 next words
    current_seq = np.copy(test_seq)
    for i in range(10):
        current_seq_one_hot = tf.keras.utils.to_categorical(current_seq, num_classes=vocab_size+1)
        data_input = np.reshape(current_seq_one_hot,(1,
                                                     current_seq_one_hot.shape[0],
                                                     current_seq_one_hot.shape[1]))
        #Get the word int with maximum probability
        predicted_word_int = np.argmax(model.predict(data_input)[0])
        #Add to the predicted out, convert int to word
        predicted_output = predicted_output + " " + number_to_text[predicted_word_int]
        #Update seq with new value at the end
        current_seq = np.roll(current_seq, -1)
        current_seq[current_seq.shape[0]-1] = predicted_word_int
    print('\nOutput sequence is: ')
    print(predicted_output)

### Generate training and testing data

In [19]:
def batch_generator(batch_size=1000):
    #Will update batch_num
    record_offset = 0 #starting batch number
    #Empty list for input and output data
    while True:
      input_data = []
      output_data = []
      for i in range(batch_size):
        input_seq = xtrain[record_offset + i]
        #Output sequence
        output_seq= ytrain[record_offset + i]
        input_data.append(input_seq)
        output_data.append(output_seq)
      record_offset = record_offset + batch_size # starting point for the next yield
      if((record_offset + batch_size) >= len(xtrain)):
          record_offset = 0
      #Input data one hot encoding
      input_data = tf.keras.utils.to_categorical(input_data,num_classes=vocab_size+1)

      #Output data one hot encoding
      output_data = tf.keras.utils.to_categorical(output_data,num_classes=vocab_size+1)

      #Reshape input data into 3 dimensional numpy array
      #batch_size x sequence_length x vocab_size+1
      input_data = np.reshape(input_data,
                                (len(input_data),
                                 sequence_len,
                                 vocab_size+1))
      #print(record_num)
      #print(input_seq)
      print(len(input_seq)) 
      yield input_data, output_data

## Build The Model

Use `keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `keras.layers.LSTM`: A type of RNN with size `units=rnn_units` (You can also use a GRU layer here.)
* `keras.layers.Dense`: The output layer, with `num_words` outputs.

In [20]:
model = tf.keras.models.Sequential()

#LSTM
model.add(tf.keras.layers.LSTM(256, input_shape=(sequence_len,vocab_size+1)))

model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.Dense(vocab_size+1, activation='softmax'))

model.compile(optimizer='adam',
              loss='categorical_crossentropy')

For each word the model looks up the embedding, runs the LSTM one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-liklihood of the next word.

## Train the model

In [21]:
lambda_checkpoint = tf.keras.callbacks.LambdaCallback(on_epoch_end=predict_seq)
model_checkpoint = tf.keras.callbacks.ModelCheckpoint('text_rnn.h5',monitor='loss',save_best_only=True)
batch_size = 1000
train_generator = batch_generator(batch_size=batch_size)
model.fit(train_generator,epochs=5,steps_per_epoch = (len(xtrain)- sequence_len)// batch_size,callbacks=[model_checkpoint, lambda_checkpoint])

40
Epoch 1/5
40
 2/58 [>.............................] - ETA: 20s - loss: 8.941940
40
 3/58 [>.............................] - ETA: 40s - loss: 8.941040
 4/58 [=>............................] - ETA: 39s - loss: 8.939940
 5/58 [=>............................] - ETA: 44s - loss: 8.938640
 6/58 [==>...........................] - ETA: 42s - loss: 8.936740
 7/58 [==>...........................] - ETA: 44s - loss: 8.933040
 8/58 [===>..........................] - ETA: 44s - loss: 8.926040
 9/58 [===>..........................] - ETA: 44s - loss: 8.906040
10/58 [====>.........................] - ETA: 44s - loss: 8.864440
11/58 [====>.........................] - ETA: 43s - loss: 8.808040
12/58 [=====>........................] - ETA: 43s - loss: 8.742040
13/58 [=====>........................] - ETA: 42s - loss: 8.672240
Output sequence is: 
 the the the the the the the the the the
Epoch 2/5
40
 1/58 [..............................] - ETA: 0s - loss: 6.685940
 2/58 [>............................

<tensorflow.python.keras.callbacks.History at 0x7fdf3e93a0b8>

In [22]:
def get_actual_text(start,bookIndex,length):
  actual_seq_text = ''
  actual_seq_num =  text_to_number[bookIndex][start: start + length]
  for w in actual_seq_num:
    actual_seq_text += number_to_text[w] + " "
  return actual_seq_text
get_actual_text(starting_position + sequence_len,book_index,10)

'6 a french writer m joseph renaud recently described wilde’s '

### Save Model

In [23]:
import pickle
saved = "text_pred_model.h5"
model.save(saved)

## If you have already trained the model and saved it, you can load a pretrained model

In [24]:
model_saved = tf.keras.models.load_model('text_pred_model.h5')

### Note: After loading the model run  model.fit()  to continue training form there, if required.

In [25]:
model_saved.fit_generator(train_generator,
                    epochs=5,
                    steps_per_epoch = (len(xtrain)- sequence_len)// batch_size,                    
                    callbacks=[model_checkpoint, lambda_checkpoint])

Instructions for updating:
Please use Model.fit, which supports generators.
40
Epoch 1/5
40
 1/58 [..............................] - ETA: 0s - loss: 6.513440
 2/58 [>.............................] - ETA: 53s - loss: 6.452140
 3/58 [>.............................] - ETA: 54s - loss: 6.452740
 4/58 [=>............................] - ETA: 54s - loss: 6.500540
 5/58 [=>............................] - ETA: 53s - loss: 6.523640
 6/58 [==>...........................] - ETA: 50s - loss: 6.507640
 7/58 [==>...........................] - ETA: 50s - loss: 6.528740
 8/58 [===>..........................] - ETA: 48s - loss: 6.508340
 9/58 [===>..........................] - ETA: 47s - loss: 6.499240
10/58 [====>.........................] - ETA: 46s - loss: 6.495840
11/58 [====>.........................] - ETA: 46s - loss: 6.490740
12/58 [=====>........................] - ETA: 44s - loss: 6.494140
13/58 [=====>........................] - ETA: 43s - loss: 6.489240
Output sequence is: 
 the the the the 

<tensorflow.python.keras.callbacks.History at 0x7fded80a45f8>

## Evaluation

In [26]:
from sklearn.metrics import accuracy_score
def predict_from_model(mod,testFeature):
  pred_list =[]
  for j in range(len(testFeature)):
    test_seq_one_hot = tf.keras.utils.to_categorical(testFeature[j], num_classes=vocab_size+1)
    test_data_input = np.reshape(test_seq_one_hot,(1,test_seq_one_hot.shape[0],test_seq_one_hot.shape[1]))
    #Get the char int with maximum probability
    predicted_test_word_int = np.argmax(mod.predict(test_data_input))
    pred_list.append(predicted_test_word_int)
  return pred_list

pred_list2 = predict_from_model(model_saved,xtest[0:5000])
score2 = accuracy_score(ytest[0:5000],pred_list2)
print(score2)

0.0984


## Generate text

In [27]:
gen_input_text = xtest [100]

def generate_text(input_text,output_length):
    
    #Initialize predicted output
    predicted_output = ''
 
    
    #lets predict next <output_length> words
    current_seq = np.copy(input_text)
    for i in range(output_length):
        current_seq_one_hot = tf.keras.utils.to_categorical(current_seq, num_classes=vocab_size+1)
        data_input = np.reshape(current_seq_one_hot,(1,
                                                     current_seq_one_hot.shape[0],
                                                     current_seq_one_hot.shape[1]))
        #Get the word int with maximum probability
        predicted_word_int = np.argmax(model_saved.predict(data_input))
        #Add to the predicted out, convert int to word
        predicted_output = predicted_output + " " + number_to_text[predicted_word_int]
        #Update seq with new value at the end
        current_seq = np.roll(current_seq, -1)
        current_seq[current_seq.shape[0]-1] = predicted_word_int
    return predicted_output

In [28]:
print(generate_text(gen_input_text,5))

 the and the the and


##### Copyright 2019 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [29]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.



---



---

