# Generating Text with Python and Keras
## "Alice's Adventures in Wonderland" by Lewis Carroll

### What Is A LSTM?

From https://en.wikipedia.org/wiki/Long_short-term_memory

> Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Each of the three gates can be thought of as a "conventional" artificial neuron, as in a multi-layer (or feedforward) neural network: that is, they compute an activation (using an activation function) of a weighted sum. Intuitively, they can be thought as regulators of the flow of values that goes through the connections of the LSTM; hence the denotation "gate". There are connections between these gates and the cell.

![LSTM Diagram](https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Peephole_Long_Short-Term_Memory.svg/300px-Peephole_Long_Short-Term_Memory.svg.png "LSTM Diagram")



### Generating Text With LSTMs

Below we will write a script, that uses these powerful LSTMs to generate text. The first step is importing all the required libraries. You will need python 3.x (I'm using 3.6.2) installed with keras (2.2.0) and numpy (1.14.4) set up.

 - https://www.python.org/
 - https://www.tensorflow.org/install/

After installing required dependencies, install Keras with pip

`pip install keras`


In [22]:
# Import all the required libraries
import urllib.request
import keras
from keras import Input
from keras.callbacks import LambdaCallback, ReduceLROnPlateau
from keras.engine import Model
from keras.layers import Dense, Activation, Dropout, GRU, Concatenate, concatenate, regularizers, LSTM
import numpy as np
import random
import string

path = 'https://raw.githubusercontent.com/dakrone/corpus/master/data/alice-in-wonderland.txt'
with urllib.request.urlopen(path) as response:
    text = response.read().decode("utf-8").lower()
len(text)

167497

### Tokenize the Text
We have created a quick tokenizer, which will split the text into words. There is also some code to split punctuation marks into their own words. The class provided is not too important to dive deeply into it.


In [17]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


class OneHotScaler:
    """
        Converts the corpus into "word" vectors. One Hot encoding

        Note that text should have proper UTF quotation marks.
        Otherwise you may need to change this class a bit (untested, see specials below)
    """
    def __init__(self):
        self.char_indices = None
        self.indices_char = None
        self.allowed = ' abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",.“”‘’!?…-;/():'
        self.specials = '",.“”‘’!?…\';():'  # These get separated into their own token

    def fit_transform(self, text):
        self.fit(text)
        return self.transform(text)

    def transform(self, text):
        # pre-filter text
        # collapse multi newline into one newline
        text = text.replace('\n', ' ')
        token_text = ''.join([c for c in text if c in self.allowed])
        # lets split it by words (text, )
        # first lets make punctuation marks separate words
        for special in self.specials:
            token_text = token_text.replace(special, ' {} '.format(special))
        words = [w for w in token_text.split(' ') if w != '']
        idx_arr = np.array([self.char_indices[w] for w in words], dtype='int64')
        one_hot = np.eye(len(self.char_indices))[idx_arr]
        return one_hot

    def fit(self, text):
        # pre-filter text
        # collapse multi newline into one newline
        text = text.replace('\n', ' ')
        token_text = ''.join([c for c in text if c in self.allowed])
        # lets split it by words (text, )
        # first lets make punctuation marks separate words
        for special in self.specials:
            token_text = token_text.replace(special, ' {} '.format(special))
        words = [w for w in token_text.split(' ') if w != '']
        chars = sorted(list(set(words)))
        self.char_indices = dict((c, i) for i, c in enumerate(chars))
        self.indices_char = dict((i, c) for i, c in enumerate(chars))

    def inverse_transform(self, vec_list, temperature=1.0):
        l1 = [self.indices_char[sample(vec, temperature)] for vec in vec_list]
        reconstr = "".join([" {}".format(i) if i not in string.punctuation else i for i in l1]).strip()
        return reconstr.replace(' ’ ', '’').replace('“ ', '\n\n“').replace(' ”', '”')

### Formatting the Data
We use a "One Hot" method, to encode words into vectors, the dimensions of the vector will be the size of the dictionary. Each vector will be of magnitude 1.0.

A small example, if a dictionary looks like so:

```
dictionary = ["The", "owl", "ate"]
```

The word "The" will be converted to a vector.

```
wordvec_the = [1.0, 0.0, 0.0]
```


In [18]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 5
# skip over n sequences. 1 = do not skip over any sequences
step = 1
# how many words to generate after each epoch of training
generate = 250

b = OneHotScaler()
# Convert the text into an array of vectors. Each vector represents a word
one_hot = b.fit_transform(text)

# Convert above array, into an array of sequences.
x = np.array([one_hot[i: i + maxlen] for i in range(0, len(one_hot) - maxlen, step)])
y = np.array([one_hot[i + maxlen] for i in range(0, len(one_hot) - maxlen, step)])

# Lets take a look at the data. X is an array of input vectors, Y is the desired output we train to predict
x[0], y[0]

(array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]]),
 array([ 0.,  0.,  0., ...,  0.,  0.,  0.]))

### Building the Keras Model
We build a simple, one layer LSTM network. We specify the input shape from variables used above.
The LSTM takes that same shape, and uses 128 hidden cells. Return sequences must be sent to False in order to 
return a flat array, not an array of vectors.

In [19]:
# Build a Keras model
inputs = Input(shape=(maxlen, x.shape[2]))
lstm_long = LSTM(128, input_shape=(maxlen, x.shape[2]), return_sequences=False)(inputs)
lstm_long = Dropout(0.2)(lstm_long)

# And finally we add the main logistic regression layer
main_output = Dense(y.shape[1], activation='softmax', name='main_output')(lstm_long)

# Using the Adam optimizer
optimizer = keras.optimizers.Adam()

model = Model(inputs=inputs, outputs=main_output)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['acc', ])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 5, 3278)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               1744384   
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
main_output (Dense)          (None, 3278)              422862    
Total params: 2,167,246
Trainable params: 2,167,246
Non-trainable params: 0
_________________________________________________________________


### Function Which Runs on Each "Epoch"
Each "epoch", we run a function. The logic is a bit over-engineered probably, but it allows us to be general and can be modified for other ways to represent the input data. For now, we are just using the "word tokenization" method we worked through above.

The logic, is that it selects a random string of words from the corpus, of length `maxlen`, and uses that as a "seed" phrase to generate more text, and allow its imagination run wild. Each iteration, the generated word is tagged on the end of the seed, and the first word is removed, and a new prediction is made. This runs `generate` times which is a variable specified above.

In [20]:
def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, one_hot.shape[0] - maxlen - 1)
    for diversity in [1.0, ]:
        print('----- diversity:', diversity)

        sentence = np.zeros((generate+maxlen, y.shape[1]))
        for n in range(maxlen):
            sentence[n, :] = np.copy(one_hot[start_index+n, :])

        print('----- Generating with seed: "{}"'.format(b.inverse_transform(sentence[0:maxlen], diversity)))

        for i in range(generate):
            model_input = np.zeros((1, maxlen, y.shape[1]))
            for n in range(maxlen):
                model_input[0, n, :] = np.copy(sentence[n+i, :])

            preds = model.predict(model_input, verbose=0)[0]
            char = b.inverse_transform(np.array([preds, ]), diversity)
            sentence[i+maxlen] = np.array(b.transform(char)[0])

        print(b.inverse_transform(np.array(sentence), diversity))

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

### Generate Some Text

Now we just call the fit function, and generate some text!

In [21]:
model.fit(x, y, batch_size=128, epochs=1, callbacks=[print_callback, ], shuffle=True)

Epoch 1/1
----- diversity: 1.0
----- Generating with seed: "so it was indeed:"


  after removing the cwd from sys.path.


so it was indeed: did out state some is, in that ill looked) lobsters rather to alice rose said set at things queens of at lieu with deny head, of she put said cheshire from she king and at said seemed and" to party which grin you? to back king said she into you shes. went who bird other, these meekly the. ill distributing her frog all obliged middle visit her end lobsters this nurse jurymen thinking whisper all there, here once butter snail. centre that alice your queen dogs then--i that invented at offended silence know the hatter two cant up you watch paragraph the of as outside, his three and know--and used. this ground been dog she works said cut she no got and a will nothing poor to the and way queen know a manage simply: much my for put the look the its lobsters who into this curious to lory" but a alice after the she with down to like the -- out elegant clubs last to; the their the more. i, this half shower said! eyes jury master of back great has look their remarked which drin

<keras.callbacks.History at 0x1202e0b8>

### How Can We Improve This?

In order to improve there are many things that can be done

- You should definitely run for more epochs. I have only run 1, you should run at least 40.
- Edit the corpus, and remove identifiers and non-necessary text like credits, etc.
- Use a different word representation, like Word2Vec
- Modify the network to use more layers and / or more hidden layers
- Add some more features to the input. One idea I had was "in_quote" feature, 1.0 for inside a quote, and 0.0 for outside


### Example after 20 Epochs

We can see for the most part, the grammer is getting pretty good! Try it out yourself and let me know how you do!


> your verdict, he said to the jury, in a low, trembling voice, like majesty said: the hatter. i hurried forgotten it, a white moment day like of deep exactly grown she did, she was a right mind i had then half it? the pieces: interrupted the dormouse, been, and hold and there to see, it would did your mostly she had going to them to herself, but she did not dropped to she began off. but when so serpents down, while being could pinched never afraid i must be managing; thought she said quite eyes; but she thought he, but she began them at last in sea case of first gloves dropped cats! i dont dont her him the whiting shouldnt came about, said alice. dr cats down to, said alice. over only first opened a just out a think! ive went! in the politely tone, she said to herself. the gardeners so found, my her without--maybe morning! we give set much to day. it had read are in a undertone down: the came man and two about, and would not hardly maintaining it must rustled. so it was a very french to some of the last time, and nothing suddenly running up going too. becoming a taken zigzag in soup, that she had considering with as looking twist at her,( three-legged neck; but second

Thank you!