<a href="https://colab.research.google.com/github/abalaji-blr/EIP_P2/blob/master/GenTextWithLSTM_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate Text With LSTM

**Objective:**
  * First, learn the sequence of characters from the book - Alice's Adventure in Wonderland.
  * Second, generate new text based on the learning.

## Develop model to learn sequence of characters

Use the sequence of characters from the book - *Alice's Adventure in Wonderland*. The text version of book can be downloaded from **[Project Gutenberg](http://www.gutenberg.org/cache/epub/11/pg11.txt)**



### Get Data

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
 !cp '/content/gdrive/My Drive/App/EIP_Phase2/AliceInWonderland.txt' /content

In [0]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [0]:
filename = "/content/AliceInWonderland.txt"

In [0]:
!ls -l $filename

-rw------- 1 root root 147739 Jul 25 06:56 /content/AliceInWonderland.txt


In [0]:
# load ascii text and covert to lowercase
raw_text = open(filename).read()
raw_text = raw_text.lower()

In [0]:
raw_text[0:500]

"alice's adventures in wonderland\n\nlewis carroll\n\nthe millennium fulcrum edition 3.0\n\nchapter i. down the rabbit-hole\n\nalice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought alice 'without pictures or\nconversations?'\n\nso she was considering in her own mind (as well as she could, for the\nhot day "

### Remove Punctuation

In [0]:
import string

In [0]:
#?raw_text.translate
?str.maketrans

In [0]:
# remove all punctation from the source text
# maketrans - builds a mapping table used by str.translate
#   when thrid arg present in maketrans(), those characters will be mapped to
#   None in the result.
trans_text = raw_text.translate(str.maketrans('', '', string.punctuation))

In [0]:
trans_text[0:500]

'alices adventures in wonderland\n\nlewis carroll\n\nthe millennium fulcrum edition 30\n\nchapter i down the rabbithole\n\nalice was beginning to get very tired of sitting by her sister on the\nbank and of having nothing to do once or twice she had peeped into the\nbook her sister was reading but it had no pictures or conversations in\nit and what is the use of a book thought alice without pictures or\nconversations\n\nso she was considering in her own mind as well as she could for the\nhot day made her feel ve'

### Build Dictionaries for mapping input charas to integer

In [0]:
chars = sorted(list(set(trans_text)))  # get unique chars, so use set
print(chars)
print(len(chars))

['\n', ' ', '0', '3', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
30


In [0]:
# create mapping of unique chars to integers
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict( (i, c) for i, c in enumerate(chars))

In [0]:
def get_string(arr_int):
  str = ''.join([int_to_char[c] for c in arr_int])
  return(str)

In [0]:
# summarize the loaded data
n_chars = len(trans_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)

Total Characters:  136085
Total Vocab:  30


### Build dataset

To build dataset, use the following steps:
  * To mimic variable length input, try **sliding window of variable length**
  * In the case of variable length input, use pad_sequences.
  * Pad in the front/back (pre/post), experiment.

In [0]:
# let's start with fixed length
seq_length = 100
dataX = []
dataY = []

In [0]:
''' 
for i in range(0, n_chars - seq_length, 1):
  # get the input and expected output (next char)
  seq_in = trans_text[i:i+seq_length]
  seq_out = trans_text[i+seq_length]
  
  # gen. dataset
  # note instead of chars, it has to be integers, use populated dictionary.
  dataX.append([char_to_int[c] for c in seq_in])
  dataY.append(char_to_int[seq_out])
  
'''

### Use variable sliding window to mimic variable length input

In [0]:
# use variable sliding window to mimic variable length input
total = 0
cnt = 0
max_len = n_chars - seq_length
while (cnt < max_len):
  # generate variable length from 80 thru 100
  num = numpy.random.randint(80,100)
  #print('num:', num)
  if ( cnt + num > max_len ):
    break
    
  # create the training input and output
  seq_in = trans_text[cnt:cnt+num]
  seq_out = trans_text[cnt+num]
  
  #gen dataset
  # note: input should be int instead of chars, use dictionary
  dataX.append([char_to_int[c] for c in seq_in])
  dataY.append(char_to_int[seq_out])
  
  total += num
  #print(total)
  cnt += 1
  

In [60]:
n_patterns = len(dataX)
print(n_patterns)
len(dataX) == len(dataY)

135896


True

In [61]:
print(get_string(dataX[5]))

s adventures in wonderland

lewis carroll

the millennium fulcrum edition 30

chapter


In [62]:
print(get_string(dataX[6]))

 adventures in wonderland

lewis carroll

the millennium fulcrum edition 30

chapter i d


In [63]:
print(int_to_char[dataY[5]])

 


In [64]:
print(get_string(dataX[100]))

e rabbithole

alice was beginning to get very tired of sitting by her sister on the



In [65]:
print(int_to_char[dataY[100]])

b


### Handle variable length input using pad_sequences

In [0]:
from keras.preprocessing.sequence import pad_sequences

In [0]:
?pad_sequences

In [0]:
# pad at the front side as we are predicting the next character and trainY has next character
dataX_pad = pad_sequences(dataX, maxlen=100)

### Prepare the data for LSTM

* LSTM expects the data in [samples, time_steps, features]

In [0]:
?numpy.reshape

In [0]:
# data, new-shape (num_samples = len(dataX_pad)), maxlen=100, num_featurs = 1
trainX = numpy.reshape(dataX_pad, (len(dataX_pad), 100, 1))

In [69]:
trainX.shape

(135896, 100, 1)

In [0]:
# normalize the data, convert the data in the range 0 to 1, as sigmoid activation function is used by default
trainX = trainX / float(n_vocab)

In [0]:
#trainX[6]

In [0]:
# one hot encode the output variable
trainY = np_utils.to_categorical(dataY)

In [72]:
print(len(trainY))
print(len(trainY[0]))
print(trainY[0])

135896
30
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]


In [73]:
trainX.shape

(135896, 100, 1)

In [74]:
trainY.shape

(135896, 30)

### Build Model using LSTM

In [0]:
?LSTM

In [75]:
# define the LSTM model
model = Sequential()
# dropout 10 percent of inputs to LSTM
model.add(LSTM(256, input_shape=(trainX.shape[1], trainX.shape[2]), return_sequences=True, dropout=0.1))
#model.add(Dropout(0.1))
model.add(LSTM(256))
#model.add(Dropout(0.2))
model.add(Dense(trainY.shape[1], activation='softmax'))

W0725 08:05:59.250150 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0725 08:05:59.300413 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0725 08:05:59.307514 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0725 08:05:59.542425 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0725 08:05:59.560908 

In [76]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

W0725 08:06:05.395848 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0725 08:06:05.429776 139779465680768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.



In [77]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 30)                7710      
Total params: 797,214
Trainable params: 797,214
Non-trainable params: 0
_________________________________________________________________


In [0]:
# define the checkpoint
filepath="best_weights_vl_v4.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [79]:
model.fit(trainX, trainY, epochs=20, batch_size=512, callbacks=callbacks_list)

W0725 08:07:06.575146 139779465680768 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/20

Epoch 00001: loss improved from inf to 2.88584, saving model to best_weights_vl_v4.hdf5
Epoch 2/20

Epoch 00002: loss improved from 2.88584 to 2.76416, saving model to best_weights_vl_v4.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.76416 to 2.55015, saving model to best_weights_vl_v4.hdf5
Epoch 4/20

Epoch 00004: loss improved from 2.55015 to 2.38302, saving model to best_weights_vl_v4.hdf5
Epoch 5/20

Epoch 00005: loss improved from 2.38302 to 2.25508, saving model to best_weights_vl_v4.hdf5
Epoch 6/20

Epoch 00006: loss improved from 2.25508 to 2.14821, saving model to best_weights_vl_v4.hdf5
Epoch 7/20

Epoch 00007: loss improved from 2.14821 to 2.05899, saving model to best_weights_vl_v4.hdf5
Epoch 8/20

Epoch 00008: loss improved from 2.05899 to 1.97967, saving model to best_weights_vl_v4.hdf5
Epoch 9/20

Epoch 00009: loss improved from 1.97967 to 1.91061, saving model to best_weights_vl_v4.hdf5
Epoch 10/20

Epoch 00010: loss improved from 1.91061 to 1.84838, sav

<keras.callbacks.History at 0x7f20a29a77f0>

In [0]:
!cp /content/best_weights_vl_v4.hdf5 '/content/gdrive/My Drive/App/EIP_Phase2/lstm_vl_v4.hdf5'

### Train 30 more epochs

In [0]:
# define the checkpoint
filepath2="/content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5"
checkpoint2 = ModelCheckpoint(filepath2, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list2 = [checkpoint2]

In [0]:
?model.fit

In [92]:
H2 = model.fit(trainX, trainY, epochs=50, initial_epoch = 20, batch_size=512, callbacks=callbacks_list2)

Epoch 21/50

Epoch 00021: loss improved from inf to 1.17898, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 22/50

Epoch 00022: loss improved from 1.17898 to 1.13363, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 23/50

Epoch 00023: loss improved from 1.13363 to 1.08777, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 24/50

Epoch 00024: loss improved from 1.08777 to 1.04423, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 25/50

Epoch 00025: loss improved from 1.04423 to 1.00513, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 26/50

Epoch 00026: loss improved from 1.00513 to 0.96385, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t2_plus.hdf5
Epoch 27/50

Epoch 00027: loss improved from 0.96385 to 0.92561, saving model to /cont

### Train some more - Epochs 51 thru 80

In [0]:
# define the checkpoint
filepath3="/content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5"
checkpoint3 = ModelCheckpoint(filepath3, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list3 = [checkpoint3]

In [99]:
H3 = model.fit(trainX, trainY, epochs=80, initial_epoch=50, batch_size=512, callbacks=callbacks_list3)

Epoch 51/80

Epoch 00051: loss improved from inf to 0.37426, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 52/80

Epoch 00052: loss improved from 0.37426 to 0.36476, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 53/80

Epoch 00053: loss did not improve from 0.36476
Epoch 54/80

Epoch 00054: loss improved from 0.36476 to 0.34272, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 55/80

Epoch 00055: loss improved from 0.34272 to 0.33860, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 56/80

Epoch 00056: loss improved from 0.33860 to 0.33079, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 57/80

Epoch 00057: loss improved from 0.33079 to 0.32803, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t3.hdf5
Epoch 58/80

Epoch 00058: loss improved from 0.32803 to 

### Training Epochs 80 thru 100

In [0]:
# define the checkpoint
filepath4="/content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t4.hdf5"
checkpoint4 = ModelCheckpoint(filepath4, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list4 = [checkpoint4]

In [101]:
H4 = model.fit(trainX, trainY, epochs=100, initial_epoch=80, batch_size=512, callbacks=callbacks_list4)

Epoch 81/100

Epoch 00081: loss improved from inf to 0.20215, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t4.hdf5
Epoch 82/100

Epoch 00082: loss did not improve from 0.20215
Epoch 83/100

Epoch 00083: loss did not improve from 0.20215
Epoch 84/100

Epoch 00084: loss did not improve from 0.20215
Epoch 85/100

Epoch 00085: loss did not improve from 0.20215
Epoch 86/100

Epoch 00086: loss did not improve from 0.20215
Epoch 87/100

Epoch 00087: loss did not improve from 0.20215
Epoch 88/100

Epoch 00088: loss improved from 0.20215 to 0.19148, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t4.hdf5
Epoch 89/100

Epoch 00089: loss did not improve from 0.19148
Epoch 90/100

Epoch 00090: loss did not improve from 0.19148
Epoch 91/100

Epoch 00091: loss improved from 0.19148 to 0.19119, saving model to /content/gdrive/My Drive/App/EIP_Phase2/lstm_weights_vl_v4_t4.hdf5
Epoch 92/100

Epoch 00092: loss did not improve from 0.19119
Epoch 93

## Generate new sequence of characters from the model

In [0]:
# load weights file
#filename = "best_weights.hdf5"
#model.load_weights(filename)
#model.compile(loss='categorical_crossentropy', optimizer='adam')

In [102]:
print(trainX.shape)
print(len(trainX))

(135896, 100, 1)
135896


In [0]:
import sys

In [0]:
def sample_prediction(prediction):
  X = prediction[0] # sum(X) is approx 1
  rnd_idx = numpy.random.choice(len(X), p=X)
  return rnd_idx

In [0]:
def get_next_500_chars_with_sample_prediction(seed_str):
  pattern = list(seed_str)
  result_str = ''
  for i in range(0,500):
    # for predict, seed the initial pattern
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
  
    #predict new char
    pred = model.predict(x, verbose=0)
  
    #index = numpy.argmax(pred)
    index = sample_prediction(pred)
    new_char = int_to_char[index]
  
    #output the new char
    sys.stdout.write(new_char)
    result_str += new_char
  
    #update the input sequence to the model by eliminating the first character and
    #appending the new char.
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
  
  return result_str


In [0]:
# generate next 500 characters
def get_next_500_chars(seed_str):
  pattern = seed_str
  result_str = ''
  for i in range(0,500):
    # for predict, seed the initial pattern
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
  
    #predict new char
    pred = model.predict(x, verbose=0)
  
    index = numpy.argmax(pred)
    new_char = int_to_char[index]
  
    #output the new char
    sys.stdout.write(new_char)
    result_str += new_char
  
    #update the input sequence to the model by eliminating the first character and
    #appending the new char.
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
  
  return result_str


In [103]:
# to predict, we need to seed the model with past history, in our case, text pattern.
# pick one text pattern from the corpse
seed = numpy.random.randint(0, len(trainX)-1)
pattern = list(dataX_pad[seed])

print(len(pattern))
print('Seed:', get_string(pattern))

100
Seed: 



hroom said the caterpillar just as if she had asked it
aloud and in another moment it was out of


In [104]:
# predict
res_str = get_next_500_chars(pattern)

 thg way 
teryargf of uhe whore celtr sas re like

htude herele of her on
the way alice teptle hir hear
ii i vaid tig dackeed the wo blice 
vole so the soor the hou tomet of the helert wert inr sarder it whe sighcoddeytryle 
shersesp of the hear of oert ard
the ltrker thall her in thaml tie whole pifelng sut it wurn thment forbidcr
the waid to her erceped to tuch where when anl herching to hersenf io ins bitaid bnice pood

whe qock turtle noatdh and nrsking
ang then a great murtyed tng fonv bprt

In [105]:
print('Seed:', get_string(dataX_pad[750]))

Seed: 







 think it so
very much out of the way to hear the rabbit say to itself oh dear
oh dear i sha


In [106]:
pattern2 = list(dataX_pad[750])
get_next_500_chars(pattern2)

ll oevs
alice repelbered sogecsry all hu anloe so oertonnn toeerhe
autty alice tone
uhme tog waid to herchnch
sror the wuoue dowl dowahes to lerpen thal and a little het oe whem oistle rebnly thamln oersronn anice hed ltog het wonce sff lntd
anice vepled puerened out 
theye waid th pef and hor soment alice cuppeutiog togrieed it arxious fownd thapled at lepprr and taid to the helt brneuuing

fagf of tame gld to the beardeslt far fow tome
th puce sffncod anice
and ooe fiit waid the whisg wases el

'll oevs\nalice repelbered sogecsry all hu anloe so oertonnn toeerhe\nautty alice tone\nuhme tog waid to herchnch\nsror the wuoue dowl dowahes to lerpen thal and a little het oe whem oistle rebnly thamln oersronn anice hed ltog het wonce sff lntd\nanice vepled puerened out \ntheye waid th pef and hor soment alice cuppeutiog togrieed it arxious fownd thapled at lepprr and taid to the helt brneuuing\n\nfagf of tame gld to the beardeslt far fow tome\nth puce sffncod anice\nand ooe fiit waid the whisg wases el'

In [110]:
pattern2 = list(dataX_pad[750])
get_next_500_chars_with_sample_prediction(pattern2)

ll oevs
alice reptle his waid the whilgd tuiee hardlpg togeply wald whe ooment
soease when tuch a heml oftsock cnd jeel io surtess

whats 
io rometwine tolest
sh soeacle
au again to peaveny 
goot gitier io a   wogacing onth tiami b good a cuckn someth cack ol the ltscfoet ehggiedt clice qut whe whi boom tuat ttt atd toment soead
cewserfrl tmment atd fombhd ruize eolking oo uhe sasty sww at sruse stam anice comsaler eottldirsates apd mersonns
anice qextoef dowi cnntercrlamistlng to her hftter thn

'll oevs\nalice reptle his waid the whilgd tuiee hardlpg togeply wald whe ooment\nsoease when tuch a heml oftsock cnd jeel io surtess\n\nwhats \nio rometwine tolest\nsh soeacle\nau again to peaveny \ngoot gitier io a   wogacing onth tiami b good a cuckn someth cack ol the ltscfoet ehggiedt clice qut whe whi boom tuat ttt atd toment soead\ncewserfrl tmment atd fombhd ruize eolking oo uhe sasty sww at sruse stam anice comsaler eottldirsates apd mersonns\nanice qextoef dowi cnntercrlamistlng to her hftter thn'