# Language Model Challenge

## Loading the Libraies

In [68]:
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.models import Sequential
import keras.utils as ku
from tqdm import tqdm
import re
import requests
import math
from keras.models import load_model
from tqdm import tqdm

## Loading the data

I will be using the "hound-train.txt" provided with this challenge. 

In [3]:
with open('hound-train.txt', encoding="utf-8") as f:
    data = f.readlines()

In [111]:
data[:100]

['\n',
 "Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle\n",
 '\n',
 'This eBook is for the use of anyone anywhere at no cost and with\n',
 'almost no restrictions whatsoever.  You may copy it, give it away or\n',
 're-use it under the terms of the Project Gutenberg License included\n',
 'with this eBook or online at www.gutenberg.net\n',
 '\n',
 '\n',
 'Title: The Adventures of Sherlock Holmes\n',
 '\n',
 'Author: Arthur Conan Doyle\n',
 '\n',
 'Release Date: November 29, 2002 [EBook #1661]\n',
 'Last Updated: May 20, 2019\n',
 '\n',
 'Language: English\n',
 '\n',
 'Character set encoding: UTF-8\n',
 '\n',
 '*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\n',
 '\n',
 '\n',
 '\n',
 'Produced by an anonymous Project Gutenberg volunteer and Jose Menendez\n',
 '\n',
 '\n',
 '\n',
 'cover\n',
 '\n',
 '\n',
 '\n',
 'The Adventures of Sherlock Holmes\n',
 '\n',
 '\n',
 '\n',
 'by Arthur Conan Doyle\n',
 '\n',
 '\n',
 '\n',
 'Conten

I will only be using the content from the dataset that are relevant to the training of the language model by finding the starting and ending indexes containing the first sentence and the last sentence of the book. 

In [5]:
startString = 'To Sherlock Holmes she is always'
startIndex = [i for i, s in enumerate(data) if startString in s][0]
endString = 'Walsall, where I believe that she has met with considerable success.'
endtIndex = [i for i, s in enumerate(data) if endString in s][0]
text_data = ' '.join(data[startIndex:endtIndex+1])

In [107]:
text_data[:5000]

'To Sherlock Holmes she is always _the_ woman. I have seldom heard him\n mention her under any other name. In his eyes she eclipses and\n predominates the whole of her sex. It was not that he felt any emotion\n akin to love for Irene Adler. All emotions, and that one particularly,\n were abhorrent to his cold, precise but admirably balanced mind. He\n was, I take it, the most perfect reasoning and observing machine that\n the world has seen, but as a lover he would have placed himself in a\n false position. He never spoke of the softer passions, save with a gibe\n and a sneer. They were admirable things for the observer—excellent for\n drawing the veil from men’s motives and actions. But for the trained\n reasoner to admit such intrusions into his own delicate and finely\n adjusted temperament was to introduce a distracting factor which might\n throw a doubt upon all his mental results. Grit in a sensitive\n instrument, or a crack in one of his own high-power lenses, would not\n be mor

## Data Processing

To generate realistic sentences, I need to preserve the punctuations in the dataset. However, Keras' Tokenizer will remove all punctuations by default. To get around that, I will have to add a space to any punctuations and remove the desired punctuations from the Tokenizer's filter. I will also capitalize all letters to match what is in the test set. 

I will then organize the tokens into sequences of 50 input words and 1 output word. That is, sequences of 51 words.

In [7]:
tokenizer = Tokenizer(filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', oov_token='OOV', lower=False)

def data_processing(text_data, seq_length=50):
    #Capitalize all letters
    text = text_data.upper()
    #Isolate punctuations; except the periods
    cleanup_dict = {",":" ,"
               ,"!":" !"
               ,"?":" ?"
               ,"\n":""
               ,"_":""
               ,"“":""
               ,"”":""
               ,"(":"( "
               ,")":" )"
               ,"II.":""
               ,"III.":""
               ,"IV.":""
               ,"V.":""
               ,"VI.":"" 
               ,"VII.":""
               ,"VIII.":"" 
               ,"IX.":"" 
               ,"X.":""
               ,"XI.":"" 
               ,"XII.":""                    
                }
    for from_this, to_this in cleanup_dict.items():
        text = text.replace(from_this, to_this)
    #I want to preserve periods that designate accronyms (ones that are not followed by any space)
    text = re.sub('\.\s+',' . ', text)
    #Tokenize all words in the text
    text = text_to_word_sequence(text, filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', lower=False)
    tokenizer.fit_on_texts(text)
    tokens = tokenizer.texts_to_sequences(text)
    word_index = tokenizer.word_index
    words_count = len(word_index)+1
    #Turn the tokenized text into sequences of the specified length and append to the "sequences" list
    sequences = []
    length = seq_length +1
    #For the first 50 words
    for i in range(1, length):
        # select sequence of tokens
        seq = tokens[0:i+1]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
    #For the rest of the data 
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
        
    #Finding the maximum length in this dataset (just in case...)
    max_sequence_length = max(len(x) for x in sequences)    
    #Make sure all sequences are of the same length
    sequences = pad_sequences(sequences,maxlen=max_sequence_length,padding='pre')
    X = sequences[:,:-1]
    y = sequences[:,-1]
    y = ku.to_categorical(y, num_classes=words_count)

    return X,y,max_sequence_length,words_count, word_index

In [8]:
X, y, max_seq_length, total_words_count, word_index = data_processing(text_data)

Save the tokenizer since we will need it for the test data

In [9]:
import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

## Defining the Model

For this experiment, I will train a neural network with two LSTM hidden layers with 100 neurons each. A dense fully connected layer with 100 neurons connects to the LSTM hidden layers. 

In [18]:
model = Sequential()
model.add(Embedding(total_words_count, 10, input_length=max_seq_length - 1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(total_words_count, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 50, 10)            81910     
_________________________________________________________________
lstm_8 (LSTM)                (None, 50, 100)           44400     
_________________________________________________________________
lstm_9 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_5 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_6 (Dense)              (None, 8191)              827291    
Total params: 1,044,101
Trainable params: 1,044,101
Non-trainable params: 0
_________________________________________________________________


## Training the Model

In [19]:
from keras.callbacks import History 
from keras.callbacks import EarlyStopping
batch_size = 128
epochs = 50

In [20]:
from keras.callbacks import ModelCheckpoint
# checkpoint
filepath="weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor=' val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

history = model.fit(X, y, epochs=epochs, verbose = 1
                      , batch_size=batch_size
                      , callbacks=callbacks_list
                      ,validation_split=0.2
                      #,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)]
                     )
print("Training completed!")
model.save('model.h5') 

Train on 96363 samples, validate on 24091 samples
Epoch 1/50
Epoch 2/50




Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Training completed!


## Re-load the Saved Model

In [22]:
model = load_model('model.h5',compile=False)

## Evaluation of Model

Load the evaluation set. 

In [23]:
with open('hound-test.txt', encoding="utf-8") as f:
    test_data = f.read()

Define a function that process the test set into the format that the model requires.

In [30]:
def test_data_processing(text_data, tokenizer, seq_length=50):
    text = text_data 
    text = text.replace("\n", " ")
    #Tokenize all words in the text
    text = text_to_word_sequence(text, filters='‘’"#$%&*+-/:;<=>@[\\]^_`{|}~\t\n', lower=False)
    tokens = tokenizer.texts_to_sequences(text)
    #Turn the tokenized text into sequences of the specified length and append to the "sequences" list
    sequences = []
    length = seq_length +1
    
    #For the first 50 words
    for i in range(1, length):
        # select sequence of tokens
        seq = tokens[0:i+1]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
    #For the rest of the data 
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        flattened_seq = [val for sublist in seq for val in sublist]
        sequences.append(flattened_seq)
        
    #Finding the maximum length in this dataset (just in case...)
    max_sequence_length = max(len(x) for x in sequences)    
    #Make sure all sequences are of the same length
    sequences = pad_sequences(sequences,maxlen=51,padding='pre')
    X = sequences[:,:-1]
    y = sequences[:,-1]

    return X,y

In [37]:
Xtest, ytest = test_data_processing(test_data, tokenizer)

In [38]:
# make predictions from the test data
ypred=model.predict(Xtest)

In [71]:
# collect all the probabilities assigned to the correct words from the test sequences
def model_perplexity(ytest,ypred, dictionary, verbose=False):
    y_test_proba = []
    
    for i in tqdm(range(ytest.shape[0])):
        if verbose:
            print(f"Probability of the word {[key for key in dictionary.items() if key[1] == ytest[i]][0][0]} is {ypred[i][ytest[i]]}.")
        y_test_proba.append(ypred[i][ytest[i]])
    return y_test_proba

In [103]:
ytest_probability =  model_perplexity(ytest,ypred, word_index)
ytest_probability = np.asarray(ytest_probability)[np.nonzero(np.asarray(ytest_probability))]

100%|█████████████████████████████████████████████████████████████████████████| 66729/66729 [00:02<00:00, 30438.45it/s]


Perplexity is a measurement of how well a probability model predicts a test data and is the exponentiation of the entropy.

In [104]:
# get the entropy and perplexity
entropy=np.mean(-np.log(ytest_probability))
print(math.exp(entropy))

47472.87709264595


## Summary 

My time was mainly spent on comparing the training set to the test set and coming up with data processing steps that would make the training set to ressemble the test set. The following is list of the data processing steps:
- Capitalize all letters
- Add space on both sides of the punctuation such as "!", "?", ",", and "." (except for accronyms) to mimic the test set
- Remove any punctuations that do not exist in the test set
- Removing irrelevant content from the training set
- Transform the training set into sequences of 51 words as input data to the LSTM Language Model

The calculated perplexity on this model is 47472.88. Obviously, this LSTM model that was trained on 50 epochs only is a poor language model since the perplexity is **extremely high**. 

Given the time, I would improve the results by doing the following: 
- Start with a simpler n-gram model with a proper smoothing to be used as a benchmark 
- Train the model on more epochs with the same architecture
- Increase the complexity of the model