# The baseline model to generate text from Alice in Wonderland

### Data:
Alice in Wonderland (full book)

Char-Sequence Length = 100

### Model: 
2-layer LSTM, 400 hidden states, dropout ratio = 0.2
weights randomly initialised

### Training:
28 epochs in total, batch size of 128

---

## Importing Dependencies

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import RNN
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


## Loading of Data

In [2]:
text = (open("wonderland.txt").read())
text = text.lower()

In [3]:
print("Downloaded Alice in Wonderland data with {} characters.".format(len(text)))
print("FIRST 1000 CHARACTERS: ")
print(text[:1000])

Downloaded Alice in Wonderland data with 143552 characters.
FIRST 1000 CHARACTERS: 
alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversations?'

so she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a white rabbit with pink eyes ran
close by her.

there was nothing so very remarkable in that; nor did alice think it so
very much out of the way to hear the rabbit say to itself, 'oh dear!
oh dear! i shall be late!' (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when 

## Create character mappings and data pre-processing

Create mapping of unique chars to integers, and a reverse mapping:

In [4]:
characters = sorted(list(set(text)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

Summarise the loaded data:

In [5]:
vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  42
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Now we need to get the data in the right format for the required model structure in Keras:

In [7]:
X = []   # extracted sequences
Y = []   # the target - the follow up character
length = len(text)
seq_length = 100   #number of characters to consider before predicting the following character

In [9]:
for i in range(0, length - seq_length, 1):
    sequence = text[i:i + seq_length]
    label = text[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 286904


Here, X is our train array, and Y is our target array.

seq_length is the length of the sequence of characters that we want to consider before predicting a particular character.

The for loop is used to iterate over the entire length of the text and create such sequences (stored in X) and their true values (stored in Y). Now, it’s difficult to visualize the concept of true values here. Let’s understand this with an example:

For a sequence length of 4 and the text “hello india”, we would have our X and Y (not encoded as numbers for ease of understanding) as below:

|       X      |  Y  |
|:------------:|:---:|
| [h, e, l, l] | [o] |
| [e, l, l, o] | [ ] |
| [l, l, o,  ] | [i] |
| [l, o,  , i] | [n] |
| ...          | ... |


Now, LSTMs accept input in the form of (number_of_sequences, length_of_sequence, number_of_features) which is not the current format of the arrays. Also, we need to transform the array Y into a one-hot encoded format.

In [10]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

We first reshape the array X into our required dimensions. Then, we scale the values of our X_modified so that our neural network can train faster and there is a lesser chance of getting stuck in a local minima. Also, our Y_modified is one-hot encoded to remove any ordinal relationship that may have been introduced in the process of mapping the characters. That is, ‘a’ might be assigned a lower number as compared to ‘z’, but that doesn’t signify any relationship between the two.

In [15]:
X_modified.shape, Y_modified.shape

((286904, 100, 1), (286904, 42))

Let's take a look at the first example:

In [16]:
print("X[0].shape = {}, Y[0].shape = {}".format(X_modified[0].shape, Y_modified[0].shape))
print("X[0]: ", X_modified[0])
print("Y[0]: ", Y_modified[0])

X[0].shape = (100, 1), Y[0].shape = (42,)
X[0]:  [[0.38095238]
 [0.64285714]
 [0.57142857]
 [0.42857143]
 [0.47619048]
 [0.02380952]
 [0.9047619 ]
 [0.38095238]
 [0.80952381]
 [0.02380952]
 [0.4047619 ]
 [0.47619048]
 [0.52380952]
 [0.57142857]
 [0.69047619]
 [0.69047619]
 [0.57142857]
 [0.69047619]
 [0.52380952]
 [0.02380952]
 [0.83333333]
 [0.71428571]
 [0.02380952]
 [0.52380952]
 [0.47619048]
 [0.83333333]
 [0.02380952]
 [0.88095238]
 [0.47619048]
 [0.78571429]
 [0.95238095]
 [0.02380952]
 [0.83333333]
 [0.57142857]
 [0.78571429]
 [0.47619048]
 [0.45238095]
 [0.02380952]
 [0.71428571]
 [0.5       ]
 [0.02380952]
 [0.80952381]
 [0.57142857]
 [0.83333333]
 [0.83333333]
 [0.57142857]
 [0.69047619]
 [0.52380952]
 [0.02380952]
 [0.4047619 ]
 [0.95238095]
 [0.02380952]
 [0.54761905]
 [0.47619048]
 [0.78571429]
 [0.02380952]
 [0.80952381]
 [0.57142857]
 [0.80952381]
 [0.83333333]
 [0.47619048]
 [0.78571429]
 [0.02380952]
 [0.71428571]
 [0.69047619]
 [0.02380952]
 [0.83333333]
 [0.54761905]

## A baseline model

In [17]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

Load model weights before compiling:

In [38]:
# load the network weights
filename = "model_weights/baseline-improvement-06-0.9927.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Define the model checkpoint:

In [42]:
filepath="model_weights/baseline-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [43]:
model.fit(X_modified, Y_modified, epochs=6, batch_size=128, callbacks = callbacks_list)

Epoch 1/6

Epoch 00001: loss improved from inf to 1.23540, saving model to model_weights/baseline-improvement-01-1.2354.hdf5
Epoch 2/6

Epoch 00002: loss improved from 1.23540 to 1.16853, saving model to model_weights/baseline-improvement-02-1.1685.hdf5
Epoch 3/6

Epoch 00003: loss improved from 1.16853 to 1.11433, saving model to model_weights/baseline-improvement-03-1.1143.hdf5
Epoch 4/6

Epoch 00004: loss improved from 1.11433 to 1.06508, saving model to model_weights/baseline-improvement-04-1.0651.hdf5
Epoch 5/6

Epoch 00005: loss did not improve from 1.06508
Epoch 6/6

Epoch 00006: loss improved from 1.06508 to 0.99274, saving model to model_weights/baseline-improvement-06-0.9927.hdf5


<keras.callbacks.History at 0x7fe2a12337f0>

## Generating Text

In [None]:
start = np.random.randint(0, len(X)-1) # or generate random start

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")

In [126]:
# generating characters
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    full_string.append(n_to_char[pred_index])
    
    string_mapped.append(pred_index)  # add the predicted character to the end
    string_mapped = string_mapped[1:len(string_mapped)] # shift the string one character forward by removing pos. 0

In [127]:
# combining text
txt=""
for char in full_string:
    txt = txt+char

"d such things that make children\nsweet-tempered. i only wish people knew that: then they wouldn't be so dear iikd and taying tie way of the white. but it was a little partering of fead on the thace of the table, serea and she was low and then, and the three gardeners instantly to the eege of the mock turtle in the distance, and the three gardeners instantly wo her her aht anything more time the had seedi in her life, and was going to det note than alice could hear the rest of the court, was suim"

In [None]:
print(start)
print(txt)