Author: Shiyue Cao

USCID: 8583755038

Email: shiyuec@usc.edu

# 
1. Generative Models for Text

# (a) In this problem, we are trying to build a generative model to mimic the writing style of prominent British Mathematician, Philosopher, prolific writer, and political activist, Bertrand Russell.

# (b) Download the following books from Project Gutenberg http://www.gutenberg.org/ebooks/author/355 in text format

# (c) LSTM: Train an LSTM to mimic Russell’s style and thoughts:

## i. Concatenate your text files to create a corpus of Russell’s writings

In [None]:
import pandas as pd
import numpy as np
import os

corpus = open("./src/corpus", mode="a")
for filename in os.listdir("./data"):
    with open("./data/"+filename, encoding='ascii', errors='ignore') as book:
        for line in book:
            corpus.write(line)


## ii. Use a character-level representation for this model by using extended ASCII that has N = 256 characters. Each character will be encoded into a an integer using its ASCII code. Rescale the integers to the range [0, 1], because LSTM uses a sigmoid activation function. LSTM will receive the rescaled integers as its input.

In [None]:
char_set = set(open('./src/corpus').read().lower())
char_set = sorted(list(char_set))
char_2_float = dict()
char_2_int = dict()
int_2_char = dict()
i = 0
for c in char_set:
    char_2_float[c] = i/len(char_set)
    i += 1
i = 0
for c in char_set:
    char_2_int[c] = i
    i += 1
i = 0
for c in char_2_float:
    int_2_char[i] = c
    i += 1


print("encode char to float in [0,1]" )
print(char_2_float)



## iii. Choose a window size, e.g., W = 100
## iv. Inputs to the network will be the first W −1 = 99 characters of each sequence, and the output of the network will be the Wth character of the sequence. Basically, we are training the network to predict each character using the 99 characters that precede it. Slide the window in strides of S = 1 on the text. For example, if W = 5 and S = 1 and we want to train the network with the sequence ABRACADABRA, The first input to the network will be ABRA and the corresponding output will be C. The second input will be BRAC and the second output will be A, etc.

## v. Note that the output has to be encoded using a one-hot encoding scheme with N = 256 (or less) elements. This means that the network reads integers, but outputs a vector of N = 256 (or less) elements.¶

## vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units.
## vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (cross entropy) as the objective function for the network (research what it means).
## viii. We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model. Instead we are interested in a generalization of the dataset that can mimic the gist of the text.

## ix. Choose a reasonable number of epochs for training, considering your computational power (e.g., 30, although the network will need more epochs to yield a better model).


In [None]:
import pandas as pd
import numpy as np
import os
from keras.utils import np_utils
from keras.layers import Dense
from keras.layers import LSTM
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint, Callback
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dropout, Activation

char_set = set(open('./src/corpus').read().lower())
char_set = sorted(list(char_set))
char_2_float = dict()
char_2_int = dict()
int_2_char = dict()
i = 0
for c in char_set:
    char_2_float[c] = i/len(char_set)
    i += 1
i = 0
for c in char_set:
    char_2_int[c] = i
    i += 1
i = 0
for c in char_2_float:
    int_2_char[i] = c
    i += 1


chars = open('./src/corpus').read().lower()
total_chars = len(chars)

W = 99
train_data = []
train_target = []
for i in range(0, total_chars - W):
    input_char = chars[i:i + W]
    output_char = chars[i + W]
    p = []
    for c in input_char:
        p.append(char_2_float[c])
    train_data.append(p)
    train_target.append(char_2_int[output_char])

train_data = np.reshape(train_data, (len(train_data), W, 1))
train_target = np_utils.to_categorical(train_target)


print(train_data.shape)
print(train_target.shape)

model = Sequential()
model.add(LSTM(256, input_shape=(train_data.shape[1], train_data.shape[2])))
model.add(Dense(train_target.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model.summary())

checkpointer = ModelCheckpoint(
    filepath='./checkpoint/{epoch:02d}-{loss:.2f}.hdf5', monitor='loss',  save_best_only=True, mode='min', verbose=0)
model.fit(train_data, train_target, batch_size=512,
          epochs=30, verbose=1, callbacks=[checkpointer])

## x. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.



## xi. Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network: