# Generative Models for Text

In [2]:
import numpy
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.models import Sequential
import sys

Using TensorFlow backend.


### (a) In this problem, we are trying to build a generative model to mimic the writing
### style of prominent British Mathematician, Philosopher, prolific writer, and
### political activist, Bertrand Russell.
### (b) Download the following books from Project Gutenberg http://www.gutenberg.
### org/ebooks/author/355 in text format:
### i. The Problems of Philosophy
### ii. The Analysis of Mind
### iii. Mysticism and Logic and Other Essays
### iv. Our Knowledge of the External World as a Field for Scientific Method in
### Philosophy
### Project Gutenberg adds a standard header and footer to each book and this is
### not part of the original text. Open the file in a text editor and delete the header
### and footer.
### The header is obvious and ends with the text:
### *** START OF THIS PROJECT GUTENBERG EBOOK AN INQUIRY INTO
### MEANING AND TRUTH ***
### The footer is all of the text after the line of text that says:
### THE END
### To have a better model, it is strongly recommended that you download the following
### books from The Library of Congress https://archive.org and convert
### them to text files:
### i. The History of Western Philosophy
### https://archive.org/details/westernphilosophy4
### ii. The Analysis of Matter
### https://archive.org/details/in.ernet.dli.2015.221533
### iii. An Inquiry into Meaning and Truth
### https://archive.org/details/BertrandRussell-AnInquaryIntoMeaningAndTruth
### Try to only use the text of the books and throw away unwanted text before and
### after the text, although in a large corpus, these are considered as noise and should
### not make big problems.1
### (c) LSTM: Train an LSTM to mimic Russell’s style and thoughts:
### i. Concatenate your text files to create a corpus of Russell’s writings.

In [4]:
text_data = open('corpus_raw.txt').read()
text_data = text_data.lower()

### ii. Use a character-level representation for this model by using extended ASCII
### that has N = 256 characters. Each character will be encoded into a an integer
### using its ASCII code. Rescale the integers to the range [0, 1], because LSTM
### uses a sigmoid activation function. LSTM will receive the rescaled integers
### as its input.2

In [5]:
characters = sorted(list(set(text_data)))
int_char = dict((c, i) for i, c in enumerate(characters))
characters_number = len(text_data)
vocab_number = len(characters)

### iii. Choose a window size, e.g., W = 100.
### iv. Inputs to the network will be the first W −1 = 99 characters of each sequence,
### and the output of the network will be the Wth character of the sequence.
### Basically, we are training the network to predict each character using the 99
### characters that precede it. Slide the window in strides of S = 1 on the text.
### For example, if W = 5 and S = 1 and we want to train the network with the
### sequence ABRACADABRA, The first input to the network will be ABRA
### and the corresponding output will be C. The second input will be BRAC and
### the second output will be A, etc.
### v. Note that the output has to be encoded using a one-hot encoding scheme with
### N = 256 (or less) elements. This means that the network reads integers, but
### outputs a vector of N = 256 (or less) elements.
### vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units.
### vii. Use a Softmax output layer to yield a probability prediction for each of the
### characters between 0 and 1. This is actually a character classification problem
### with N classes. Choose log loss (cross entropy) as the objective function for
### the network (research what it means).3
### viii. We do not use a test dataset. We are using the whole training dataset to
### learn the probability of each character in a sequence. We are not seeking for
### a very accurate model. Instead we are interested in a generalization of the
### dataset that can mimic the gist of the text.

In [188]:
sequence_window = 100
X = []
Y = []
for i in range(0, characters_number - sequence_window, 1):
    seq_in = text_data[i:i + sequence_window]
    seq_out = text_data[i + sequence_window]
    X.append([int_char[char] for char in seq_in])
    Y.append(int_char[seq_out])
number_of_patterns = len(X)

In [18]:
y = np_utils.to_categorical(Y)
X = numpy.reshape(X, (number_of_patterns, sequence_window, 1))
X = X / float(vocab_number)

### ix. Choose a reasonable number of epochs4
### for training, considering your computational
### power (e.g., 30, although the network will need more epochs to yield
### a better model).
### x. Use model checkpointing to keep the network weights to determine each time
### an improvement in loss is observed at the end of the epoch. Find the best set
### of weights in terms of loss

In [21]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
model.fit(X, y, epochs=10, batch_size=4096, callbacks=callbacks_list)

Epoch 1/10

Epoch 00001: loss improved from inf to 3.09463, saving model to weights-improvement-01-3.0946.hdf5
Epoch 2/10

Epoch 00002: loss improved from 3.09463 to 2.98274, saving model to weights-improvement-02-2.9827.hdf5
Epoch 3/10

Epoch 00003: loss improved from 2.98274 to 2.90442, saving model to weights-improvement-03-2.9044.hdf5
Epoch 4/10

Epoch 00004: loss improved from 2.90442 to 2.86781, saving model to weights-improvement-04-2.8678.hdf5
Epoch 5/10

Epoch 00005: loss improved from 2.86781 to 2.83347, saving model to weights-improvement-05-2.8335.hdf5
Epoch 6/10

Epoch 00006: loss improved from 2.83347 to 2.80833, saving model to weights-improvement-06-2.8083.hdf5
Epoch 7/10

Epoch 00007: loss improved from 2.80833 to 2.78792, saving model to weights-improvement-07-2.7879.hdf5
Epoch 8/10

Epoch 00008: loss improved from 2.78792 to 2.76769, saving model to weights-improvement-08-2.7677.hdf5
Epoch 9/10

Epoch 00009: loss improved from 2.76769 to 2.75118, saving model to weig

<keras.callbacks.History at 0xb22eb3588>

### xi. Use the network with the best weights to generate 1000 characters, using the
### following text as initialization of the network:
### There are those who take mental phenomena naively, just as they
### would physical phenomena. This school of psychologists tends not to
### emphasize the object.

In [23]:
# load the network weights
filename = "weights-improvement-10-2.7349.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
int_to_char = dict((i, c) for i, c in enumerate(characters))

In [12]:
p = "There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.".lower()
pattern = [char_to_int[char] for char in p]
pattern = pattern[0:100]
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(vocab_number)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    result = int_to_char[index]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

" there are those who take mental phenomena naively, just as they would physical phenomena. this schoo "
little tare gereen to be a gentle of the there are those who take mental phenomena naively, just as they would physical phenomena. this schoo tabdit  soenee the gad  ouw ie the tay a tirt of toiet at the was a little  anonersen, and thiu had been woite io a lott of tueh a tiie  and taede bot her aeain  she cere thth the bene tith the tere bane to tee toaete to tee the harter was a little tire the same oare cade an anl ano the garee and the was so seat the was a little gareen and the sabdit, and the white rabbit wese tilel an the caoe and the sabbit se teeteer, and the white rabbit wese tilel an the cade in a lonk tfne the sabdi ano aroing to tea the was sf teet whitg the was a little tane oo thete the sabeit  she was a little tartig to the tar there are those who take mental phenomena naively, just as they would physical phenomena. this schoo tf tee the tame of the cagd, and the whi