### USC ID: 7184-0277-30
### Name: Xiyue Wang

## Introduction
In this problem, we are trying to build a generative model to mimic the writing style of prominent British Mathematician, Philosopher, prolific writer, and
political activist, Bertrand Russell. 

The text materials will be used include: 
- The Problems of Philosophy
- The Analysis of Mind
- Mysticism and Logic and Other Essays
- Our Knowledge of the External World as a Field for Scientific Method in
Philosophy
- The Analysis of Matter

In [2]:
# import packages
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from keras.utils import to_categorical
import keras
from tensorflow import keras
import os
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import save_model
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
import requests
import re

Using TensorFlow backend.
  import pandas.util.testing as tm


## (b) Download the books

In [2]:
# request txt files
filepaths = ['http://www.gutenberg.org/cache/epub/5827/pg5827.txt', 'http://www.gutenberg.org/cache/epub/2529/pg2529.txt', 
             'http://www.gutenberg.org/files/25447/25447-0.txt', 'http://www.gutenberg.org/files/37090/37090-0.txt', 
             'https://archive.org/stream/in.ernet.dli.2015.221533/2015.221533.The-Analysis_djvu.txt']
for filepath in filepaths:
    r = requests.get(filepath)
    with open('data.txt', 'a') as file:
    file.write(r.text)

In [3]:
# read the text 
with open('data.txt', 'r') as file:
    text = file.read()

In [4]:
print('The lenth of the five books is: {} characters.'.format(len(text)))

The lenth of the five books is: 2635771 characters.


## (c) LSTM: Train an LSTM to mimic Russell’s style and thoughts:

### i. Concatenate your text files to create a corpus of Russell’s writings.

In [5]:
# remove punctuation and change to lowercase
text = re.sub(r'[^a-zA-Z0-9]', ' ', text).lower().strip()

### ii. Use a character-level representation for this model by using extended ASCII that has N = 256 characters.

In [6]:
# get the unique characters in the original text
unique_char = set(text)
encoded_char = {}
for i, char in enumerate(unique_char):
    encoded_char[char] = i
print('The text has {} unique character.\n'.format(len(unique_char)))

# normalize the encoders between 0 and 1
scaled_char = {}
scaler = MinMaxScaler()
norm_value = scaler.fit_transform(np.array(list(encoded_char.values())).reshape(-1, 1))
for i in range(len(norm_value)):
    scaled_char[list(encoded_char.keys())[i]] = norm_value[i][0]
print('The scaled encoder is: {}'.format(scaled_char))

The text has 37 unique character.

The scaled encoder is: {'x': 0.0, '1': 0.027777777777777776, 'f': 0.05555555555555555, '2': 0.08333333333333333, 'o': 0.1111111111111111, 'h': 0.1388888888888889, 'e': 0.16666666666666666, '8': 0.19444444444444442, 'p': 0.2222222222222222, 'y': 0.25, 'i': 0.2777777777777778, 'c': 0.3055555555555555, '0': 0.3333333333333333, 'u': 0.3611111111111111, 'a': 0.38888888888888884, '9': 0.41666666666666663, 't': 0.4444444444444444, 'r': 0.4722222222222222, 'n': 0.5, 'b': 0.5277777777777778, 'v': 0.5555555555555556, '3': 0.5833333333333333, 's': 0.611111111111111, 'm': 0.6388888888888888, 'z': 0.6666666666666666, 'l': 0.6944444444444444, '5': 0.7222222222222222, 'd': 0.75, '7': 0.7777777777777777, 'g': 0.8055555555555555, 'q': 0.8333333333333333, 'j': 0.861111111111111, 'k': 0.8888888888888888, 'w': 0.9166666666666666, '6': 0.9444444444444444, '4': 0.9722222222222222, ' ': 1.0}


### iii. Choose a window size, e.g., W = 100.
### iv. Inputs to the network will be the first W −1 = 99 characters of each sequence, and the output of the network will be the Wth character of the sequence. Basically, we are training the network to predict each character using the 99 characters that precede it. Slide the window in strides of S = 1 on the text. For example, if W = 5 and S = 1 and we want to train the network with the sequence ABRACADABRA, The first input to the network will be ABRA and the corresponding output will be C. The second input will be BRAC and the second output will be A, etc.


In [7]:
## choose a window size of 100
w = 100
X = []
y = []
for i in range(0, len(text)-w):
    temp = text[i:i+w-1]
    temp_label = text[i+w]
    X.append([scaled_char[char] for char in temp])
    y.append(encoded_char[temp_label])

In [8]:
# reshape the dataset and change the label to one hot label
n = len(X)
X_trans = np.reshape(X, (n, w-1, 1))
y_trans = to_categorical(y) # keras don't take multilabel target, has to do one hot encoding first
X_trans.shape, y_trans.shape

((2635664, 99, 1), (2635664, 37))

In [9]:
# store the X and y data
with open('X.pickle', 'wb') as file:
    pickle.dump(X, file)
with open('y.pickle', 'wb') as file:
    pickle.dump(y, file)

### v. Note that the output has to be encoded using a one-hot encoding scheme with N = 256 (or less) elements. This means that the network reads integers, but outputs a vector of N = 256 (or less) elements.
### vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units.
### vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (cross entropy) as the objective function for the network.
### viii. We do not use a test dataset. We are using the whole training dataset to learn the probability of each character in a sequence. We are not seeking for a very accurate model. Instead we are interested in a generalization of the dataset that can mimic the gist of the text.

In [29]:
# build the model
memory_units = 256

model = Sequential()
model.add(LSTM(units=memory_units, input_shape=(X_trans.shape[1], X_trans.shape[2])))
model.add(Dense(y_trans.shape[1], activation='softmax'))

# compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(1e-4), loss='categorical_crossentropy', metrics=['accuracy'])

### iX. Choose a reasonable number of epochs for training, considering your computational power (e.g., 30, although the network will need more epochs to yield a better model).
### X. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

In [30]:
# adapted from https://machinelearningmastery.com/check-point-deep-learning-models-keras/
output_dir = "./checkpoints"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    
checkpoint_filepath = os.path.join(output_dir, 'ck_{epoch:02d}.hdf5')
checkpoint = ModelCheckpoint(
    filepath = checkpoint_filepath,
    monitor='loss',
    save_weights_only= True,
    mode='min'
)
early_stopping = EarlyStopping(patience = 5, min_delta = 1e-4)

callbacks_list = [checkpoint, early_stopping]

# initiate model
check_points = model.fit(X_trans, y_trans, epochs=30, batch_size= 500, callbacks=callbacks_list, verbose=1)

Epoch 1/30
Epoch 2/30




Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### v. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

In [35]:
best_model = output_dir +'/ck_30.hdf5'
model.load_weights(best_model)

### vi. Use the network with the best weights to generate 1000 characters, using the following text as initialization of the network:

In [91]:
text = 'There are those who take mental phenomena naively, just as they would physical phenomena.This school of psychologists tends not to emphasize the object.'

In [92]:
# remove punctuation and change to lowercase
text = re.sub(r'[^a-zA-Z0-9]', ' ', text).lower().strip()

In [94]:
# predict the characters
while len(text) <= 1000:
  # transfer the text into scaled floats
    X_test = []
    for i in range(0, len(text)-99):
    temp = text[i:i+99]
    X_test.append([scaled_char[char] for char in temp])
    X_test_trans = np.reshape(X_test, (len(X_test), 99, 1))
    pred=model.predict(X_test_trans)
    for j in range(len(X_test)):
        pred_char = list(encoded_char.keys())[pred[j, :].argmax()] # get the corresponding char
        text+=pred_char

In [97]:
print(text)

there are those who take mental phenomena naively  just as they would physical phenomena this school of psychologists tends not to emphasize the object     a  a   a    a  h  a  he  a  het  a  hete  a  heter  a  heterl  a  heterlo  a  heterlog  a  heterlogy  a  heterlogy   a  heterlogy a  a  heterlogy a   a  heterlogy a    a  heterlogy a     a  heterlogy a   h  a  heterlogy a   he  a  heterlogy a   he   a  heterlogy a   he    a  heterlogy a   he     a  heterlogy a   he      a  heterlogy a   he    a  a  heterlogy a   he    a   a  heterlogy a   he    a    a  heterlogy a   he    a     a  heterlogy a   he    a      a  heterlogy a   he    a       a  heterlogy a   he    a        a  heterlogy a   he    a      h  a  heterlogy a   he    a      h   a  heterlogy a   he    a      h e  a  heterlogy a   he    a      h e   a  heterlogy a   he    a      h e h  a  heterlogy a   he    a      h e hc  a  heterlogy a   he    a      h e hc   a  heterlogy a   he    a      h e hc h  a  heterlogy a   he    a   

### vii. Extra Practice: Use one-hot encoding for the input sequence. Use a large number of epochs, e.g., 150. Add dropout to the network, and use a deeper LSTM (e.g. with 3 or more layers). Generate 3000 characters using the above initialization and report if you get more meaningful text.

In [3]:
# try one book for computation power 
filepath = 'http://www.gutenberg.org/cache/epub/5827/pg5827.txt'
r = requests.get(filepath)
with open('phi.text', 'w') as file:
    file.write(r.text)
with open('phi.text', 'r') as file:
    text = file.read()

In [4]:
# remove punctuation and change to lowercase
text = re.sub(r'[^a-zA-Z0-9]', ' ', text).lower().strip()
# get the unique characters in the original text
unique_char = set(text)
encoded_char = {}
for i, char in enumerate(unique_char):
    encoded_char[char] = i
print('The text has {} unique character.'.format(len(unique_char)))

The text has 37 unique character.


In [5]:
## choose a window size of 100
w = 100
X = []
y = []
for i in range(0, len(text)-w):
    temp = text[i:i+w-1]
    temp_label = text[i+w]
    X.append([encoded_char[char] for char in temp])
    y.append(encoded_char[temp_label])

In [6]:
# reshape the dataset and change the label to one hot label
n = len(X)
X_trans = np.reshape(X, (n, w-1, 1))
y_trans = to_categorical(y) # keras don't take multilabel target, has to do one hot encoding first
X_trans.shape, y_trans.shape

((264982, 99, 1), (264982, 37))

In [7]:
# build the model with 3 layers of SMTM 
memory_units = 256

model2 = Sequential()
model2.add(LSTM(units=memory_units, input_shape=(X_trans.shape[1], X_trans.shape[2]), return_sequences=True))
model2.add(LSTM(units=memory_units, return_sequences=True))
model2.add(LSTM(units=memory_units, return_sequences=False))
model2.add(Dense(y_trans.shape[1], activation='softmax'))

# compile the model
model2.compile(optimizer=tf.keras.optimizers.Adam(1e-4), loss='categorical_crossentropy', metrics=['accuracy'])

In [8]:
# fit the model with 100 ephochs
# adapted from https://machinelearningmastery.com/check-point-deep-learning-models-keras/
output_dir = "./checkpoints2"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    
checkpoint_filepath = os.path.join(output_dir, 'ck_{epoch:02d}.hdf5')
checkpoint = ModelCheckpoint(
    filepath = checkpoint_filepath,
    monitor='loss',
    save_weights_only= True,
    mode='min'
)
early_stopping = EarlyStopping(patience = 5, min_delta = 1e-4)

callbacks_list = [checkpoint, early_stopping]

# initiate model
check_points = model2.fit(X_trans, y_trans, epochs=100, batch_size=2400, callbacks=callbacks_list, verbose=1)

Epoch 1/100
Epoch 2/100




Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 

In [9]:
# save the model
best_model = output_dir +'/ck_100.hdf5'
model2.load_weights(best_model)

In [36]:
# predict the characters
text = 'There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.'
# remove punctuation and change to lowercase
text = re.sub(r'[^a-zA-Z0-9]', ' ', text).lower().strip()

In [38]:
while (len(text) <= 3000):
  # transfer the text into scaled floats
    X_test = []
    for i in range(0, len(text)-99):
    temp = text[i:i+99]
    X_test.append([encoded_char[char] for char in temp])
    X_test_trans = np.reshape(X_test, (len(X_test), 99, 1))
    pred=model2.predict(X_test_trans)
    for j in range(len(X_test)):
        pred_char = list(encoded_char.keys())[pred[j, :].argmax()] # get the corresponding char
        text+=pred_char

In [39]:
print(text)

there are those who take mental phenomena naively  just as they would physical phenomena  this school of psychologists tends not to emphasize the object         a   a    a t   a th   a thi   a thit   a thitr   a thitr    a thitr     a thitr      a thitr       a thitr        a thitr         a thitr      a   a thitr      a    a thitr      a     a thitr      a      a thitr      a   f   a thitr      a   fe   a thitr      a   fes   a thitr      a   fese   a thitr      a   fese    a thitr      a   fese a   a thitr      a   fese ao   a thitr      a   fese ao    a thitr      a   fese ao     a thitr      a   fese ao  b   a thitr      a   fese ao  be   a thitr      a   fese ao  bee   a thitr      a   fese ao  beet   a thitr      a   fese ao  beete   a thitr      a   fese ao  beeteo   a thitr      a   fese ao  beeteot   a thitr      a   fese ao  beeteotr   a thitr      a   fese ao  beeteotr    a thitr      a   fese ao  beeteotr     a thitr      a   fese ao  beeteotr  b   a thitr      a   fese ao 