# Generating Dr Seuss writing with 'gigantic' model

### Data:
Dr. Seuss Books and poems

Char-Sequence Length = 100

### Model: 
3-layer LSTM, 700 hidden states, dropout ratio = 0.2
Initialised the weights of the model to those obtained from training on Alice in Wonderland for 20 epochs.

### Training:
15 epochs in total, batch size of 128

---

## An example of generated text:
The seed is provided in square brackets:
[today is your day.
you're off to great places!
you're off and away!

you have brains in your head.
you have feet in your shoes.]
you can learn about trees...
and bees...
and knees.
and knees on trees!
and bees on threes!

you can read about anchors.
and all about ants.
you can go by cow.

----

## Importing Dependencies

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import RNN
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Loading of Data

In [2]:
total = ""
data = ""

In [3]:
directory = "seuss_texts/individual/"
files = []
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open (directory+filename, "r") as myfile:
            data = myfile.read().lower()                   #replace('\n', ' ')
        files.append(filename)
    total = total + data

In [4]:
text = total

In [6]:
print("Downloaded Dr Seuss data with {} characters.".format(len(text)))
print("FIRST 100 CHARACTERS: ")
print(text[:100])

Downloaded Dr Seuss data with 84917 characters.
FIRST 100 CHARACTERS: 
the sun did not shine.
it was too wet to play.
so we sat in the house
all that cold, cold, wet day.



## Pre-processing

### Remove any unwanted characters in Dr Seuss text that are not present in Alice in Wonderland text

In [7]:
orig_chars = sorted(list(set(text)))
orig_vocab_size = len(orig_chars)
print('Number of unique characters: ', orig_vocab_size)
print(orig_chars)

Number of unique characters:  60
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '8', '9', ':', ';', '?', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xad', ';', '—', '‘', '’', '“', '”', '…', '\ufeff', '�']


In [10]:
# characters present in Alice in Wonderland text
alice_chars = ['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
print(len(alice_chars))

42


In [11]:
unwanted_seuss_chars = []
for element in orig_chars:
    if element not in alice_chars:
        unwanted_seuss_chars.append(element)

print(len(unwanted_seuss_chars))
print(unwanted_seuss_chars)

20
['/', '0', '1', '2', '3', '4', '5', '6', '8', '9', '\xad', ';', '—', '‘', '’', '“', '”', '…', '\ufeff', '�']


In [12]:
# first replace some obvious ones - that are already in text but different formatting
new_seuss_text = text.replace(';', ';')
new_seuss_text = new_seuss_text.replace('—', '-')
new_seuss_text = new_seuss_text.replace('‘', "'")
new_seuss_text = new_seuss_text.replace('’', "'")
new_seuss_text = new_seuss_text.replace('“', '"')
new_seuss_text = new_seuss_text.replace('”', '"')
new_seuss_text = new_seuss_text.replace('…', '...')

In [13]:
unwanted_seuss_chars2 = []
for element in sorted(list(set(new_seuss_text))):
    if element not in alice_chars:
        unwanted_seuss_chars2.append(element)

print(len(unwanted_seuss_chars2))
print(unwanted_seuss_chars2)

13
['/', '0', '1', '2', '3', '4', '5', '6', '8', '9', '\xad', '\ufeff', '�']


In [11]:
# replace remaining unwanted characters with "" using this function

def replaceMultiple(mainString, toBeReplaces, newString):
    # Iterate over the strings to be replaced
    for elem in toBeReplaces :
        # Check if string is in the main string
        if elem in mainString :
            # Replace the string
            mainString = mainString.replace(elem, newString)
    
    return  mainString

In [12]:
# Replace all the occurrences of characters not in alice characters by ""

new_seuss_text2 = replaceMultiple(new_seuss_text, unwanted_seuss_chars2 , "")

In [13]:
new_seuss_chars = sorted(list(set(new_seuss_text2)))
print(len(new_seuss_chars))

40


In [14]:
missing_seuss_chars = []
for element in alice_chars:
    if element not in sorted(list(set(new_seuss_text2))):
        missing_seuss_chars.append(element)
print(missing_seuss_chars)      
# two square brackets don't occur in dr seuss text - add them to the character list

['[', ']']


## Create character mappings consistent with Alice in Wonderland

Create mapping of unique chars to integers, and a reverse mapping: use alice mapping

In [17]:
characters = alice_chars

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

Summarise the loaded data:

In [18]:
vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  42
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## Data pre-processing

In [65]:
text = new_seuss_text2

In [20]:
X = []   # extracted sequences
Y = []   # the target - the follow up character
length = len(text)
seq_length = 100

In [21]:
for i in range(0, length - seq_length, 1):
    sequence = text[i:i + seq_length]
    label = text[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 88822


Here, X is our train array, and Y is our target array.

seq_length is the length of the sequence of characters that we want to consider before predicting a particular character.

The for loop is used to iterate over the entire length of the text and create such sequences (stored in X) and their true values (stored in Y). Now, it’s difficult to visualize the concept of true values here. Let’s understand this with an example:

For a sequence length of 4 and the text “hello india”, we would have our X and Y (not encoded as numbers for ease of understanding) as below:

|       X      |  Y  |
|:------------:|:---:|
| [h, e, l, l] | [o] |
| [e, l, l, o] | [ ] |
| [l, l, o,  ] | [i] |
| [l, o,  , i] | [n] |
| ...          | ... |


Now, LSTMs accept input in the form of (number_of_sequences, length_of_sequence, number_of_features) which is not the current format of the arrays. Also, we need to transform the array Y into a one-hot encoded format.

In [22]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

We first reshape the array X into our required dimensions. Then, we scale the values of our X_modified so that our neural network can train faster and there is a lesser chance of getting stuck in a local minima. Also, our Y_modified is one-hot encoded to remove any ordinal relationship that may have been introduced in the process of mapping the characters. That is, ‘a’ might be assigned a lower number as compared to ‘z’, but that doesn’t signify any relationship between the two.

In [23]:
X_modified.shape, Y_modified.shape

((88822, 100, 1), (88822, 42))

Let's take a look at the first example:

In [24]:
print("X[0].shape = {}, Y[0].shape = {}".format(X_modified[0].shape, Y_modified[0].shape))
print("X[0]: ", X_modified[0])
print("Y[0]: ", Y_modified[0])

X[0].shape = (100, 1), Y[0].shape = (42,)
X[0]:  [[0.85714286]
 [0.73809524]
 [0.02380952]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.57142857]
 [0.80952381]
 [0.02380952]
 [0.85714286]
 [0.73809524]
 [0.21428571]
 [0.        ]
 [0.42857143]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.57142857]
 [0.69047619]
 [0.02380952]
 [0.42857143]
 [0.85714286]
 [0.73809524]
 [0.21428571]
 [0.        ]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.42857143]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.42857143]
 [0.85714286]
 [0.73809524]
 [0.02380952]
 [0.71428571]
 [0.69047619]
 [0.02380952]
 [0.73809524]
 [0.85714286]
 [0.73809524]
 [0.21428571]
 [0.        ]
 [0.66666667]
 [0.71428571]
 [0.85714286]
 [0.80952381]
 [0.47619048]
 [0.02380952]
 [0.54761905]
 [0.71428571]
 [0.85714286]
 [0.80952381]

## The 'gigantic' model

In [25]:
# change all three LSTM unit size from 400 - 700
model = Sequential()
model.add(LSTM(700, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(700, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(700))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

Load model weights before compiling:

In [95]:
# load the network weights
filename = "model_weights/seuss-gigantic-improvement-ctd06-10-0.4032.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Define the model checkpoint:

In [96]:
filepath="model_weights/seuss-gigantic-improvement-ctd06-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [97]:
model.fit(X_modified, Y_modified, epochs=10, batch_size=128, callbacks = callbacks_list)

Epoch 1/10

Epoch 00001: loss improved from inf to 0.97901, saving model to model_weights/seuss-gigantic-improvement-ctd06-01-0.9790.hdf5
Epoch 2/10

Epoch 00002: loss improved from 0.97901 to 0.85824, saving model to model_weights/seuss-gigantic-improvement-ctd06-02-0.8582.hdf5
Epoch 3/10

Epoch 00003: loss improved from 0.85824 to 0.76934, saving model to model_weights/seuss-gigantic-improvement-ctd06-03-0.7693.hdf5
Epoch 4/10

Epoch 00004: loss improved from 0.76934 to 0.69781, saving model to model_weights/seuss-gigantic-improvement-ctd06-04-0.6978.hdf5
Epoch 5/10

Epoch 00005: loss improved from 0.69781 to 0.62485, saving model to model_weights/seuss-gigantic-improvement-ctd06-05-0.6249.hdf5
Epoch 6/10

Epoch 00006: loss improved from 0.62485 to 0.56581, saving model to model_weights/seuss-gigantic-improvement-ctd06-06-0.5658.hdf5
Epoch 7/10

Epoch 00007: loss improved from 0.56581 to 0.51999, saving model to model_weights/seuss-gigantic-improvement-ctd06-07-0.5200.hdf5
Epoch 8/10

<keras.callbacks.History at 0x7f7f48742390>

## Generating Text

In [115]:
#start = 70823 #'the more you read...''
#start = 18057 #'today is your day...''
start = np.random.randint(0, len(X)-1) # or generate random start

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")

Seed:
"  night without stop
making gluppity-glupp. also schloppity-schlopp.
and what do you do with this lef "


In [116]:
# generating characters
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    full_string.append(n_to_char[pred_index])

    string_mapped.append(pred_index)
    string_mapped = string_mapped[1:len(string_mapped)]

In [117]:
# combining text
txt=""
for char in full_string:
    txt = txt+char

In [118]:
print(start)
print(txt)

82613
 night without stop
making gluppity-glupp. also schloppity-schlopp.
and what do you do with this leftover goo...

when the star-belly sneetches had frankfurter roasts

or picnics or parties or marshmallow toasts,

they never invited the plain-belly sneetches.

they left them out cold, in the dark of the beaches.

they kept paying money. they kad she onast band those things
yete wacky things more!

then i looked up. and i saw three.

i went down the hall
and i said "hey!"
thre the grinch thought 
