1. Generative Models for Text

(a) In this problem, we are trying to build a generative model to mimic the writing style of prominent British Mathematician, Philosopher, prolific writer, and
political activist, Bertrand Russell.

(b) Download the following books from Project Gutenberg http://www.gutenberg.
org/ebooks/author/355 in text format:

# They are already in data/books

(c) LSTM: Train an LSTM to mimic Russell’s style and thoughts:

i. Concatenate your text files to create a corpus of Russell’s writings.

In [1]:
import os
def createCorpus(data_dir):
    corpus = []
    for root, dirs, files in os.walk(data_dir):
        for f in files:
            with open(root + f, encoding='ascii', errors='ignore') as book:
                # read all string in a txt 
                cur_corpus = book.read().lower()
                corpus.append(cur_corpus)
            print('Read book: {}, string length: {}'.format(f, len(corpus[-1])))
    return corpus

concatCorpus = createCorpus('../data/books/')

Read book: TPP.txt, string length: 244306
Read book: MLOE.txt, string length: 412226
Read book: AIIMAT.txt, string length: 746219
Read book: THWP.txt, string length: 2005566
Read book: TAMatter.txt, string length: 766542
Read book: TAM.txt, string length: 514652
Read book: OKEWFSMP.txt, string length: 405741


ii. Use a character-level representation for this model by using extended ASCII
that has N = 256 characters. Each character will be encoded into a an integer
using its ASCII code. Rescale the integers to the range [0, 1], because LSTMuses a sigmoid activation function. LSTM will receive the rescaled integers
as its input.2

In [2]:
def charRepresent(corpus):
    chars = set()
    for book in corpus:
        cur = set(book)
        # print(cur)
        chars.update(cur)
        # print(chars)
    chars = sorted(list(chars))
    # Rescale the integers to the range [0, 1]
    n = len(chars)-1
    char2int = dict((c, i) for i, c in enumerate(chars))
    scaled_char2int = dict((c, i/n) for i, c in enumerate(chars))
    int2char = dict((i, c) for i, c in enumerate(chars))
    return char2int, scaled_char2int, int2char
char2int, scaled_char2int, int2char = charRepresent(concatCorpus)
print(char2int, scaled_char2int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, '$': 5, '%': 6, '&': 7, "'": 8, '(': 9, ')': 10, '*': 11, '+': 12, ',': 13, '-': 14, '.': 15, '/': 16, '0': 17, '1': 18, '2': 19, '3': 20, '4': 21, '5': 22, '6': 23, '7': 24, '8': 25, '9': 26, ':': 27, ';': 28, '<': 29, '=': 30, '>': 31, '?': 32, '[': 33, '\\': 34, ']': 35, '^': 36, '_': 37, 'a': 38, 'b': 39, 'c': 40, 'd': 41, 'e': 42, 'f': 43, 'g': 44, 'h': 45, 'i': 46, 'j': 47, 'k': 48, 'l': 49, 'm': 50, 'n': 51, 'o': 52, 'p': 53, 'q': 54, 'r': 55, 's': 56, 't': 57, 'u': 58, 'v': 59, 'w': 60, 'x': 61, 'y': 62, 'z': 63, '{': 64, '|': 65, '}': 66, '~': 67} {'\n': 0.0, ' ': 0.014925373134328358, '!': 0.029850746268656716, '"': 0.04477611940298507, '#': 0.05970149253731343, '$': 0.07462686567164178, '%': 0.08955223880597014, '&': 0.1044776119402985, "'": 0.11940298507462686, '(': 0.13432835820895522, ')': 0.14925373134328357, '*': 0.16417910447761194, '+': 0.1791044776119403, ',': 0.19402985074626866, '-': 0.208955223880597, '.': 0.22388059701492

iii. Choose a window size, e.g., W = 100.

In [3]:
from tqdm import tqdm
def windowCorpus(corpus, win_size):
    input = []
    output = []
    for w in range(0, len(corpus)-win_size+1, 1):
        seqIn = corpus[w : w + win_size - 1]
        seqOut = corpus[w + win_size - 1]
        input.append([scaled_char2int[c] for c in seqIn])
        output.append(char2int[seqOut])
    return input, output

def dataGenerate(corpus):
  win_size = 100
  inSeq, outChar = [], []
  for book in tqdm(corpus):
    cur_in, cur_out = windowCorpus(book, win_size)
    inSeq.extend(cur_in)
    outChar.extend(cur_out)
  return inSeq, outChar

inSeq, outChar = dataGenerate(concatCorpus)

100%|██████████| 7/7 [00:40<00:00,  5.80s/it]


iv. Inputs to the network will be the first W W 1 = 99 characters of each sequence,
and the output of the network will be the Wth character of the sequence.
Basically, we are training the network to predict each character using the 99
characters that precede it. Slide the window in strides of S = 1 on the text.
For example, if W = 5 and S = 1 and we want to train the network with the
sequence ABRACADABRA, The first input to the network will be ABRA
and the corresponding output will be C. The second input will be BRAC and
the second output will be A, etc.

v. Note that the output has to be encoded using a one-hot encoding scheme with
N = 256 (or less) elements. This means that the network reads integers, but
outputs a vector of N = 256 (or less) elements.

In [4]:
import numpy as np
lstm_input = np.reshape(inSeq, (len(inSeq), 99, 1))
lstm_output = np.eye(len(char2int))[outChar]

In [5]:
print(lstm_input.shape, lstm_output.shape)

(5094559, 99, 1) (5094559, 68)


vi. Use a single hidden layer for the LSTM with N = 256 (or less) memory units.

vii. Use a Softmax output layer to yield a probability prediction for each of the characters between 0 and 1. This is actually a character classification problem with N classes. Choose log loss (cross entropy) as the objective function for the network (research what it means).3

viii. We do not use a test dataset. We are using the whole training dataset to
learn the probability of each character in a sequence. We are not seeking for
a very accurate model. Instead we are interested in a generalization of the
dataset that can mimic the gist of the text.

ix. Choose a reasonable number of epochs4
for training, considering your computational power (e.g., 30, although the network will need more epochs to yield
a better model).

x. Use model checkpointing to keep the network weights to determine each time an improvement in loss is observed at the end of the epoch. Find the best set of weights in terms of loss.

# cross entropy can measure the difference between probality distributions. According to information theory, low probabity event has more information. Information of events can be caculated for given the probabity of events, in our LSTM model, we use softmax to generate probabities, then we use cross entropy loss function to evaluate the distance between model outputs and true values.

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,LSTM
from tensorflow.keras.callbacks import ModelCheckpoint

# build LSTM model
LSTMmodel = Sequential()
LSTMmodel.add(LSTM(68, input_shape=(99, 1)))
LSTMmodel.add(Dropout(0.2))
LSTMmodel.add(Dense(68, activation='softmax'))
print(LSTMmodel.summary())
LSTMmodel.compile(loss='categorical_crossentropy', optimizer='adam')

# save checkpoint
filepath = "./checkpoints/LSTM-best-weights-{epoch:02d}-{loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# train model
LSTMmodel.fit(lstm_input, lstm_output, epochs=30, batch_size=512, callbacks=callbacks_list)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 68)                19040     
                                                                 
 dropout (Dropout)           (None, 68)                0         
                                                                 
 dense (Dense)               (None, 68)                4692      
                                                                 
Total params: 23,732
Trainable params: 23,732
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/30
Epoch 00001: loss improved from inf to 2.77486, saving model to ./checkpoints/LSTM-best-weights-01-2.77.hdf5
Epoch 2/30
Epoch 00002: loss improved from 2.77486 to 2.62287, saving model to ./checkpoints/LSTM-best-weights-02-2.62.hdf5
Epoch 3/30
Epoch 00003: loss improved from 2.62287 to 2.55835, savi

<keras.callbacks.History at 0x7fd4b83944c0>

xi. Use the network with the best weights to generate 1000 characters, using the
following text as initialization of the network:

There are those who take mental phenomena naively, just as they
would physical phenomena. This school of psychologists tends not to
emphasize the object.

In [9]:
LSTMmodel.load_weights('./checkpoints/LSTM-best-weights-15-2.32.hdf5')

In [24]:
from tqdm import tqdm

txt = 'There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object.'

LSTM_input_seq = [scaled_char2int[c] for c in txt[-99:].lower()]
for i in tqdm(range(1000)):
  seq = np.reshape(LSTM_input_seq, (1, len(LSTM_input_seq), 1))
  # predict the next character
  predictChar = LSTMmodel.predict(seq)
  predictIdx = np.argmax(predictChar) 
  txt += int2char[predictIdx]
  # make new input sequence
  LSTM_input_seq.append(predictIdx/(len(char2int)-1))
  LSTM_input_seq = LSTM_input_seq[1:len(LSTM_input_seq)]

print(txt)

100%|██████████| 1000/1000 [00:36<00:00, 27.17it/s]There are those who take mental phenomena naively, just as they would physical phenomena. This school of psychologists tends not to emphasize the object. 
the soace of the soace of the soace of the soace of the soace of the 
aeloe the soace of the porpent of the porpint of the soace of the soace of the soace of the 
aelos thete oe the porpint of the porpent of the soace of the soace of the soace of the soace 
the soace of the porpent of the porpent of the porpint of the soace of the soace of the soace of 

































































































#######{ 8








\(4


\
\














































\(9




















\
 






\48




1 1 



1 














\1"
1111



1













1
1
1 1 









\
 






1 






















}(1(




911


\




11 








115


)



\



1 



1

















1












\(\



)






\1"



11


1
1
