# Text Generation

Using our moview review text data, we made two new models: one that takes in all text from the first 500 positive reviews, and another that takes in text from the first 500 negative reviews. Using both these models we are then able to generate text character by character in order to create a computer generated movie review. Most of the code is adapted from http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/.

These models differ from the ones we used earlier in the lab as these are trying to predict character by character. We use a step size of 3 throughout the text and get an accuracy of roughly 50% for 5 epochs for both models. We then generate text by using several different diversity rates (the higher the rate, the less redundant the text generation becomes). We can see that at a low diversity, we get very repetative, but at a high diversity rate, the text becomes less readable and more like giberish. So the ideal diversity rate is somewhere in the middle ~ 0.5.  

In [2]:
# Model adapted from http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
'''Example script to generate text from Nietzsche's writings.
At least 20 epochs are required before the generated text
starts sounding coherent.
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.
If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

In [4]:
ourdata = np.load('data/ourdata.npy')
text = ' '.join(ourdata[0][:500])
PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-,.;: ()&?!" 
text = "".join(c for c in text if c in PERMITTED_CHARS)
text = text.lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

corpus length: 677165
total chars: 47
nb sequences: 225709


In [56]:
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
pos_model = Sequential()
pos_model.add(LSTM(128, input_shape=(maxlen, len(chars))))
pos_model.add(Dense(len(chars)))
pos_model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
pos_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

pos_model.fit(X, y, batch_size=128, epochs=5)

corpus length: 677165
total chars: 47
nb sequences: 225709
Vectorization...
Build model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff4b07d4550>

In [57]:
pos_model.save('data/pos_model.h5')

In [49]:
ourdata = np.load('data/ourdata.npy')
text = ' '.join(ourdata[1][:500])
PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-,.;: ()&?!" 
text = "".join(c for c in text if c in PERMITTED_CHARS)
text = text.lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

corpus length: 642048
total chars: 47
nb sequences: 214003


In [50]:
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.fit(X, y, batch_size=128, epochs=5)


Vectorization...
Build model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff4b33c8160>

In [51]:
model.save('data/neg_model.h5')

## Results

Here we generate reviews of 500 characters in length for both the positive review model and negative review model. We provide both models the same seed value and compare the texts that are generated by their respective models. It is interesting to note that you can tell which generated review is generated by a positive review model and the other by the negative review model. Enjoy!

In [10]:
from keras.models import load_model
neg_model = load_model('data/neg_model.h5')
pos_model = load_model('data/pos_model.h5')

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

start_index = random.randint(0, len(text) - maxlen - 1)
review_length=500

for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)
        

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        print('----- Positive Review Model')
        sys.stdout.write(generated)       

        for i in range(review_length):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = pos_model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()
        print('----- Negative Review Model')
        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        sys.stdout.write(generated)
        

        for i in range(review_length):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = neg_model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


----- diversity: 0.2
----- Generating with seed: " is unable to adequately express his lov"
----- Positive Review Model
 is unable to adequately express his lover the movie is a starts and stars to the stars the story of the screen of the film is a stars and the stars and the story and the story is a solver and the story is a stars of the stars of the story and spart with the movie is a consider and the story of the part of the fine was a stars and the film is a good and who is the stars of the way the stars and the story and the story and the story and the stars and the story of the story and the story of the film and the story and the story of the st
----- Negative Review Model
 is unable to adequately express his love and the movie is a movie is a comment of the movie was a comments and the camera and the film is a lot to a comments and the bad that the film is a film and the can a seems to make the way are the camera that the film is a movie all the film is a seems to see the film