# Language Model Challenge

## Loading the Libraies

In [51]:
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.models import Sequential
import keras.utils as ku
from tqdm import tqdm
import re
import requests
import math
from keras.models import load_model
from tqdm import tqdm
import tensorflow as tf

## Loading the training data

I will be using the "hound-train.txt" provided with this challenge. 

In [52]:
with open('hound-train.txt', encoding="utf-8") as f:
    train_data = f.readlines()

In [53]:
train_data[:100]

['\n',
 "Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle\n",
 '\n',
 'This eBook is for the use of anyone anywhere at no cost and with\n',
 'almost no restrictions whatsoever.  You may copy it, give it away or\n',
 're-use it under the terms of the Project Gutenberg License included\n',
 'with this eBook or online at www.gutenberg.net\n',
 '\n',
 '\n',
 'Title: The Adventures of Sherlock Holmes\n',
 '\n',
 'Author: Arthur Conan Doyle\n',
 '\n',
 'Release Date: November 29, 2002 [EBook #1661]\n',
 'Last Updated: May 20, 2019\n',
 '\n',
 'Language: English\n',
 '\n',
 'Character set encoding: UTF-8\n',
 '\n',
 '*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\n',
 '\n',
 '\n',
 '\n',
 'Produced by an anonymous Project Gutenberg volunteer and Jose Menendez\n',
 '\n',
 '\n',
 '\n',
 'cover\n',
 '\n',
 '\n',
 '\n',
 'The Adventures of Sherlock Holmes\n',
 '\n',
 '\n',
 '\n',
 'by Arthur Conan Doyle\n',
 '\n',
 '\n',
 '\n',
 'Conten

I will only be using the content from the dataset that are relevant to the training of the language model by finding the starting and ending indexes containing the first sentence and the last sentence of the book. 

In [54]:
startString = 'To Sherlock Holmes she is always'
startIndex = [i for i, s in enumerate(train_data) if startString in s][0]
endString = 'Walsall, where I believe that she has met with considerable success.'
endtIndex = [i for i, s in enumerate(train_data) if endString in s][0]
train_data = ' '.join(train_data[startIndex:endtIndex+1])

In [55]:
train_data[:5000]

'To Sherlock Holmes she is always _the_ woman. I have seldom heard him\n mention her under any other name. In his eyes she eclipses and\n predominates the whole of her sex. It was not that he felt any emotion\n akin to love for Irene Adler. All emotions, and that one particularly,\n were abhorrent to his cold, precise but admirably balanced mind. He\n was, I take it, the most perfect reasoning and observing machine that\n the world has seen, but as a lover he would have placed himself in a\n false position. He never spoke of the softer passions, save with a gibe\n and a sneer. They were admirable things for the observer—excellent for\n drawing the veil from men’s motives and actions. But for the trained\n reasoner to admit such intrusions into his own delicate and finely\n adjusted temperament was to introduce a distracting factor which might\n throw a doubt upon all his mental results. Grit in a sensitive\n instrument, or a crack in one of his own high-power lenses, would not\n be mor

## Loading the test data

In [56]:
with open('hound-test.txt', encoding="utf-8") as f:
    test_data = f.read()

## Data Processing

To generate realistic sentences, I need to preserve the punctuations in the dataset. However, Keras' Tokenizer will remove all punctuations by default. To get around that, I will have to add a space to any punctuations and remove the desired punctuations from the Tokenizer's filter. I will also capitalize all letters to match what is in the test set. 

I will then organize the tokens into sequences of 30 input words and 1 output word. That is, sequences of 31 words.

In [57]:
tokenizer = Tokenizer(filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', oov_token='OOV', lower=False)

def train_data_processing(text_data, seq_length=50):
    #Capitalize all letters
    text = text_data.upper()
    #Isolate punctuations; except the periods
    cleanup_dict = {",":" ,"
               ,"!":" !"
               ,"?":" ?"
               ,"\n":""
               ,"_":""
               ,"“":""
               ,"”":""
               ,"(":"( "
               ,")":" )"
               ,"II.":""
               ,"III.":""
               ,"IV.":""
               ,"V.":""
               ,"VI.":"" 
               ,"VII.":""
               ,"VIII.":"" 
               ,"IX.":"" 
               ,"X.":""
               ,"XI.":"" 
               ,"XII.":""                    
                }
    for from_this, to_this in cleanup_dict.items():
        text = text.replace(from_this, to_this)
    #I want to preserve periods that designate accronyms (ones that are not followed by any space)
    text = re.sub('\.\s+',' . ', text)
    #Tokenize all words in the text
    text = text_to_word_sequence(text, filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', lower=False)
    tokenizer.fit_on_texts(text)
    tokens = tokenizer.texts_to_sequences(text)
    word_index = tokenizer.word_index
    words_count = len(word_index)+1
    #Turn the tokenized text into sequences of the specified length and append to the "sequences" list
    sequences = []
    length = seq_length +1
    #For the first portion of words from position 0 to the chosen length
    for i in range(1, length):
        # select sequence of tokens
        seq = tokens[0:i+1]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
    #For the rest of the data 
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
        
    #Finding the maximum length in this dataset (just in case...)
    max_sequence_length = max(len(x) for x in sequences)    
    #Make sure all sequences are of the same length
    sequences = pad_sequences(sequences,maxlen=max_sequence_length,padding='pre')
    X = sequences[:,:-1]
    y = sequences[:,-1]
    return X,y,max_sequence_length,words_count, word_index

def test_data_processing(text_data, tokenizer, seq_length=50):
    text = text_data 
    text = text.replace("\n", " ")
    #Tokenize all words in the text
    text = text_to_word_sequence(text, filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', lower=False)
    tokens = tokenizer.texts_to_sequences(text)
    #Turn the tokenized text into sequences of the specified length and append to the "sequences" list
    sequences = []
    length = seq_length +1
    
    #For the first portion of words from position 0 to the chosen length
    for i in range(1, length):
        # select sequence of tokens
        seq = tokens[0:i+1]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
    #For the rest of the data 
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
        
    #Finding the maximum length in this dataset (just in case...)
    max_sequence_length = max(len(x) for x in sequences)    
    #Make sure all sequences are of the same length
    sequences = pad_sequences(sequences,maxlen=length,padding='pre')
    X = sequences[:,:-1]
    y = sequences[:,-1]

    return X,y

In [66]:
#Decide on a sequence length
sq_len = 30
#Data Processing of the training set
X, y, max_seq_length, total_words_count, word_index = train_data_processing(train_data,sq_len )
#Data Processing of the test set
Xtest, ytest = test_data_processing(test_data, tokenizer,sq_len)

I will combine  "y" and "ytest" to transform them into categorical data because this is essentially a (word) classification problem; after they are transformed, I will split them back again. 

In [67]:
#Mark the index to split the y-data after transformation
split_index = len(y)
#Combine the 2 y-data 
y_combined = np.concatenate((y, ytest))
y_combined = ku.to_categorical(y_combined, num_classes=total_words_count)
#Split the 2 sets
y = y_combined[0:split_index]
ytest = y_combined[split_index:]

I will directly use the test set as the validation set when training the model; that way, I will be able to observe the improvement on the perplexity measure at each epoch.

In [68]:
validation_set = (Xtest, ytest)

Save the tokenizer for later use

In [69]:
import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

## Defining the Model

For this experiment, I will train a neural network with two Bi-directional LSTM hidden layers with 50 neurons each. I'm also including a Dropout layer to  prevent overfitting in the model. A dense fully connected layer with 100 neurons connects to the LSTM hidden layers. The model is validated on the perplexity metric at each epoch directly using the test set  as the validation set.   

In [70]:
# this will be used by Keras to report perplexity during training
def perplexity(y_true, y_pred):
    cross_entropy = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
    perplexity = tf.exp(tf.reduce_mean(cross_entropy))
    return perplexity

In [81]:
model = Sequential()
model.add(Embedding(total_words_count, 100, input_length=max_seq_length - 1))
model.add(Dropout(rate=0.1))
#model.add(LSTM(50, return_sequences=True))
model.add(Bidirectional(LSTM(50, dropout=0.6, recurrent_dropout=0.6,return_sequences=True)))
model.add(Bidirectional(LSTM(50)))
model.add(Dropout(rate=0.1))
model.add(Dense(total_words_count, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[perplexity])
model.summary()

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 30, 100)           819100    
_________________________________________________________________
dropout_17 (Dropout)         (None, 30, 100)           0         
_________________________________________________________________
bidirectional_23 (Bidirectio (None, 30, 100)           60400     
_________________________________________________________________
bidirectional_24 (Bidirectio (None, 100)               60400     
_________________________________________________________________
dropout_18 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 8191)              827291    
Total params: 1,767,191
Trainable params: 1,767,191
Non-trainable params: 0
___________________________________________

## Training the Model

In [82]:
from keras.callbacks import History 
from keras.callbacks import EarlyStopping
batch_size = 50
epochs = 100

In [83]:
from keras.callbacks import ModelCheckpoint
# set earlystop and checkpoint
filepath="weights-improvement-{epoch:02d}-{perplexity:.2f}.h5"
earlystop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
checkpoint = ModelCheckpoint(filepath, monitor=' val_perplexity', verbose=1, save_best_only=True, mode='min')
callbacks_list = [earlystop, checkpoint]

#fit model to the data
history = model.fit(X, y, epochs=epochs, verbose = 1
                      , batch_size=batch_size
                      , callbacks=callbacks_list
                      ,validation_data=validation_set
                     )
print("Training completed!")
model.save('model.h5') 

Train on 120454 samples, validate on 66729 samples
Epoch 1/100
Epoch 2/100




Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 00012: early stopping
Training completed!


## Re-load the Saved Model

In [None]:
model = load_model('model.h5',compile=False)

## Evaluation of Model

The perplexity metric was defined earlier in the "Defining the Model" section. It is calculated by exponentiating the mean cross-entropy.

In [85]:
scores = model.evaluate(Xtest, ytest, verbose=1)



In [89]:
print(f"The model's {model.metrics_names[0]} is {scores[0]} and {model.metrics_names[1]} is {scores[1]}.")

The model's loss is 6.502826203003522 and perplexity is 2192.33984375.


## Summary 

My time was mainly spent on comparing the training set to the test set and coming up with data processing steps that would make the training set to ressemble the test set. The following is list of the data processing steps:
- Capitalize all letters
- Add space on both sides of the punctuation such as "!", "?", ",", and "." (except for accronyms) to mimic the test set
- Remove any punctuations that do not exist in the test set
- Removing irrelevant content from the training set
- Transform the training set into sequences of 31 words as input data to the LSTM Language Model

*************** **update 2021 Jan 14th** ***************

The following changes were made to the model:
- Shortened the sequence length from 51 to 31
- Added dropout layers to lessen the effect of overfitting
- Monitored the model on perplexity rather than accuracy during the training
- Used the test corpus as the validation set during the training to evaluate how well the model generalize at each epoch

Since the validation perplexity did not improve within 10 epochs, the training had an early stop at epoch # 12, which resulted in a **perplexity of 78.30 on the training set**. Had I let the model to continue training until 100 epochs, the model will certainly reach a much lower perplexity on the training set, but it will also widen the gap between the training perplexity and the validation perplexity (i.e. the model will not generalize well on the test set). 

The final model evaluated on the evaluation set gives **a perplexity of 2192.34** which is still a pretty large value. The large gap between the training perplexity and the test perplexity likely implied that there were overfitting occuring. This could mean that the training corpus does not represent the evaluation corpus well in its current state. Some factors could be contributing to that:

- Amount of OOV in the evaluation corpus
- Change of writing style (Could a 10 years gap between the release of "The Adventures of Sherlock Holmes" and "The Hound of the Baskervilles" possibly changed the writing style of Sir Arthur Conan Doyle?)


