# Language Modeling: Word level
___
#### Description: 
In this notebook I train a recurrent neural network (RNN) on Moby Dick and use the model to generate new text. Like the character-level model, I use a many-to-one RNN where the input is a sequence of words and the output is the word that follows the sequence. 
___
#### Dataset: 
The dataset consists of the text for Moby Dick.
___
#### Reference: 

This notebook was completed using the following resources as a guide:
https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

In [1]:
# Read in the text
with open('Datasets/mobydick.txt', 'r') as f:
    text = f.read()
    
print(text[:500])


MOBY-DICK 

CHAPTER I 

LOOMINGS 

CALL me Ishmael. Some years ago never mind how 
long precisely having little or no money in my purse, 
and nothing particular to interest me on shore, I thought 
I would sail about a little and see the watery part of the 
world. It is a way I have of driving off the spleen, and 
regulating the circulation. Whenever I find myself 
growing grim about the mouth ; whenever it is a damp, 
drizzly November in my soul ; whenever I find myself 
involuntarily pausing b


In [2]:
# Cleaning the text
"""
By skimming over the text we can see what needs to be cleaned. For example, the text includes 
headers/footers, page numbers, volume numbers, chapter titles, etc. These should be removed.
As for punctuation, I will remove all instances of "- \n" since they usually indicate a word
has been split to a new line. I will also remove punctuation except for: '.', ',', ';', '?', '!' 
which will be replaced with: <eos>, <comma>, <semicolon>, <question>, <exclamation>. 
Numbers will also be removed and words will all be made lowercase.

"""

'\nBy skimming over the text we can see what needs to be cleaned. For example, the text includes \nheaders/footers, page numbers, volume numbers, chapter titles, etc. These should be removed.\nAs for punctuation, I will remove all instances of "- \n" since they usually indicate a word\nhas been split to a new line. I will also remove punctuation except for: \'.\', \',\', \';\', \'?\', \'!\' \nwhich will be replaced with: <eos>, <comma>, <semicolon>, <question>, <exclamation>. \nNumbers will also be removed and words will all be made lowercase.\n\n'

In [2]:
# Function to clean text
import re

def clean_and_tokenize(text):
    text = re.sub('(\n[0-9A-Z -]+.{0,12} \n)', '',text) # Remove headers/footers
    text = text.replace('Mr.', 'Mr') # Remove '.' from Mr. 
    text = text.replace('Mrs.', 'Mrs') # Remove '.' from Mrs.
    text = text.replace(';', ' <semicolon> ') # Replace ';' with <semicolon>
    text = text.replace('.', ' <eos> ') # Replace '.' with <eos>
    text = text.replace(',', ' <comma> ') # Replace ',' with <comma>
    text = text.replace('?', ' <question> ') # Replace '?' with <question>
    text = text.replace('!', ' <exclamation> ') # Replace '!' with <exclamation>
    text = text.replace('- \n', '') # Fix divided words
    text = re.sub('[^a-zA-Z<>\s]', '', text) # Remove remaining punctuation
    text = text.lower() # Make all text lowercase
    text = text.split() # split text to a list of words
    
    return text

In [3]:
# Clean and tokenize the text
text = clean_and_tokenize(text)
print(text[:100])

['call', 'me', 'ishmael', '<eos>', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', '<comma>', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', '<comma>', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '<eos>', 'it', 'is', 'a', 'way', 'i', 'have', 'of', 'driving', 'off', 'the', 'spleen', '<comma>', 'and', 'regulating', 'the', 'circulation', '<eos>', 'whenever', 'i', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', '<semicolon>', 'whenever', 'it', 'is', 'a', 'damp', '<comma>', 'drizzly', 'november', 'in', 'my', 'soul', '<semicolon>', 'whenever', 'i', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', '<comma>', 'and', 'bringing', 'up', 'the']


In [4]:
# Extract all unique words from text
words = sorted(list(set(text)))
print('Unique words:', len(words))

Unique words: 12503


In [5]:
# Create dictionaries that maps words to indices and vice versa
word_indices = dict((w, i+1) for i, w in enumerate(words))
indices_word = dict((i+1, c) for i, c in enumerate(words))
# 0th index for padding

In [6]:
# Convert each word to an integer using word_indices
word_ind = [word_indices[word] for word in text]

In [7]:
# First 100 words in the text as integers
print(word_ind[:100])

[1507, 6657, 5784, 2, 10034, 12466, 242, 7154, 6815, 5219, 6359, 8182, 4924, 6312, 7440, 7193, 6894, 5439, 7042, 8492, 1, 387, 7243, 7650, 11128, 5686, 6657, 7405, 9681, 1, 5300, 11008, 5300, 12408, 9202, 44, 11, 6312, 387, 9448, 10947, 12046, 7644, 7352, 10947, 12387, 2, 5800, 5782, 11, 12055, 5300, 4921, 7352, 3242, 7353, 10947, 10169, 1, 387, 8775, 10947, 1861, 2, 12185, 5300, 4020, 7043, 4732, 4701, 44, 10947, 6971, 5, 12185, 5800, 5782, 11, 2648, 1, 3243, 7255, 5439, 7042, 10068, 5, 12185, 5300, 4020, 7043, 5758, 7701, 917, 2006, 11996, 1, 387, 1348, 11707, 10947]


In [8]:
# length of sequence (input + output)
maxlen = 51

In [9]:
# Create sequences
sequences = []
for i in range(0, len(word_ind)-maxlen):
    sequence = word_ind[i:maxlen+i]
    sequences.append(sequence)

In [10]:
# Convert list to numpy array
import numpy as np

sequences = np.array(sequences)
print('(#Sequences, Sequence length) ->', sequences.shape)

(#Sequences, Sequence length) -> (120245, 51)


In [11]:
# Number of unique characters + 1
vocab_size = len(words) + 1 # including 0th index
print('Vocab size:', vocab_size)

Vocab size: 12504


In [13]:
# Split sequences into inputs and outputs
from keras.utils import to_categorical

X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size) # convert y to one-hot vector

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (120245, 50)
y shape: (120245, 12504)


In [14]:
# Build a RNN
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential([
    Embedding(vocab_size, 50, input_length=X.shape[1]),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(256, activation='relu'),
    Dense(vocab_size, activation='softmax')]
)

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            625200    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 256)               25856     
_________________________________________________________________
dense_2 (Dense)              (None, 12504)             3213528   
Total params: 4,005,384
Trainable params: 4,005,384
Non-trainable params: 0
_________________________________________________________________
None


In [15]:
# Compile and fit the model to X and y
from keras.callbacks import ModelCheckpoint

# Checkpoint weights after every epoch (optional)
checkpoint = ModelCheckpoint('weights-{epoch:02d}-{acc:.2f}.hdf5')

# Compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit
model.fit(X, y, epochs=100, callbacks=[checkpoint])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100


Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1c3cbef0>

In [18]:
# Generate text
import sys
from keras.preprocessing.sequence import pad_sequences

# Prompt user input
text = input('Insert word(s) from vocab: ')

print('======== Generated text ========')
sys.stdout.write(text)

# Convert inputted text to a list of word indices
text = text.split()
X_user = [word_indices[word] for word in text]

# Pad X_user to have the same length as our training examples
X_user = pad_sequences([X_user], maxlen=X.shape[1], padding='pre')

# Repeatedly generate a new word and update X_user to include it
for i in range(800):
    # Prediction on X_user
    y_pred = model.predict(X_user)

    # Sample from y_pred and convert to word
    sample = np.random.choice(vocab_size, 1, p = y_pred.ravel())
    next_word = indices_word[sample[0]]
    
    # Update X_user to include new word
    X_user = (np.append(X_user[0][1:], (sample[0]))).reshape(1,-1)
    
    # Print sampled word
    if next_word == '<eos>':
        sys.stdout.write('.')
    elif next_word == '<comma>': 
        sys.stdout.write(',')
    elif next_word == '<semicolon>': 
        sys.stdout.write(';')
    elif next_word == '<question>': 
        sys.stdout.write('?')
    elif next_word == '<exclamation>': 
        sys.stdout.write('!')
    else:
        sys.stdout.write(' ')
        sys.stdout.write(next_word)

Insert word(s) from vocab: the whale was
the whale was tossing away at a cane into its bed, superstitiously form out your bill, god bless ye, he s a very look in swimming out the ship of the tinkling glasses within us like a anchor a sort of badgerhaired as stubb again; but upon the hands were ready to sit her noble dismal night, and lumbered the eyes of which districts were gallantly below by the castle to pacific, and not only cruising ahead and hesitatingly ere such never fancied his ship can possibly have to break his white stool in the open air, and standing in the remotest degree ahab, but never go as if they burst this sleep; but they have left completely made your old wooden whalebone, you, having one look at one tithe of their heart. now he procured down in the transom, and was there he will answer him to see from any wise old is us yet had just thrown after daggoo. more considered his eyes like a whale till altogether business of such lost a second land, he darted down the mi

In [19]:
# Save the model and the summary for it
model.save('mobydick_text_generator.h5')

with open('mobydick_text_generator.txt', 'w+') as f:
    model.summary(print_fn=lambda x: f.write(x + '\n'))

#### Things that can be done to improve model:
Use pre-trained word embeddings.
Tune the hyperparameters.
