# Overview

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence. Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions. In this tutorial, you will discover how to develop a statistical language model using deep learning in Python.


## Data Prepartion 

### Data Overview: The Republic by Plato

The Republic is the classical Greek philosopher Plato’s most famous work. It is structured as a dialog (e.g. conversation) on the topic of order and justice within a city state The entire text is available for free in the public domain. 

### Data Cleaning

What do you see that we will need to handle in preparing the data? Here’s what I see from a quick look:

    􏰀 Book/Chapter headings (e.g. BOOK I.).
    􏰀 Lots of punctuation (e.g. -, ;-, ?-, and more).
    􏰀 Strange names (e.g. Polemarchus).
    􏰀 Some long monologues that go on for hundreds of lines. 􏰀 Some quoted dialog (e.g. ‘...’).
These observations, and more, suggest at ways that we may wish to prepare the text data. The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.

### Language Model Design

In this tutorial, we will develop a model of the text that we can then use to generate new sequences of text. The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word. A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of seed text used to generate new sequences when we use the model.

There is no correct answer. With enough time and resources, we could explore the ability of the model to learn with differently sized input sequences. Instead, we will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily. We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.

Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text. 

In [2]:
# Data preparation 

# 1. Read in file
with open("../data/republic_clean.txt", "r") as file:
    text = file.read()

print(text[:200])

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### Clean Text

We need to transform the raw text into a sequence of tokens or words that we can use as a source to train the model. Based on reviewing the raw text (above), below are some specific operations we will perform to clean the text.

    􏰀 Replace ‘-’ with a white space so we can split words better.
    􏰀 Split words based on white space.
    􏰀 Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
    􏰀 Remove all words that are not alphabetic to remove standalone punctuation tokens.
    􏰀 Normalize all words to lowercase to reduce the vocabulary size.

Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.

In [4]:
import re 
import string

# turn doc into clean tokens
def clean_doc(doc):
    doc = doc.replace('--', ' ')
    tokens = doc.split()
    
    # strip punctuations
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', w) for w in tokens]
    
    tokens = [word.lower() for word in tokens if word.isalpha()]
    return tokens

tokens = clean_doc(text)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

### Save clean text as a set of 50 word sequences

We can organize the long list of tokens into sequences of 50 input words and 1 output word. That is, sequences of 51 words.

In [12]:
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    seq = tokens[i-length:i]
    line = ' '.join(seq)
    sequences.append(line)
    
print('Total Sequences: %d' % len(sequences))
print('Sequence example:\n', sequences[:1])

# Save file 
with open("../data/republic_sequences.txt", "w") as file:
    data = '\n'.join(sequences)
    file.write(data)
    file.close()

Total Sequences: 118633
Sequence example:
 ['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was']


## Train Language Model

We can now train a statistical language model from the prepared data. The model we will train is a neural language model. It has a few unique characteristics:

    􏰀 It uses a distributed representation for words so that different words with similar meanings will have a 
      similar representation.
    􏰀 It learns the representation at the same time as learning the model.
    􏰀 It learns to predict the probability for the next word using the context of the last 100 words.

Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context. Let’s start by loading our training data.

In [16]:
# Load sequences
with open("../data/republic_sequences.txt", "r") as file:
    data_seq = file.read().split('\n')

print(data_seq[0])

book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was


### Encode Sequences

The word embedding layer expects input sequences to be comprised of integers. We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping. To do this encoding, we will use the Tokenizer class in the Keras API.

Words are assigned values from 1 to the total number of words (e.g. 7,409). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length. Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.

In [19]:
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# Tokenize sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_seq)
sequences = tokenizer.texts_to_sequences(data_seq)

# vocab size
vocab_size = len(tokenizer.word_index) + 1

print("Total number of sequences: ", len(sequences))
print("Vocab size: ", vocab_size)

Total number of sequences:  118633
Vocab size:  7410


In [22]:
import numpy as np

# Seprated into X and y
sequences = np.array(sequences)
X,y = sequences[:,:-1], sequences[:,-1]

# One hot encode y based on size of vocab
# We want the probability distribution over the entire vocab space
y = to_categorical(y, num_classes=vocab_size)

# Finally, we need to specify to the Embedding layer how long input sequences are. 
# We know that there are 50 words because we designed the model, but a good generic way 
# to specify that is to use the second dimension (number of columns) of the input data’s shape. 
# That way, if you change the length of sequences when preparing data, you do not need to change 
# this data loading code; it is generic."""

print("X shape", X.shape)
seq_length = X.shape[1]

X shape (118633, 50)


### Build and Fit Model

The learned embedding needs to know the size of the vocabulary and the length of input sequences as previously discussed. It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space.

Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or larger values. We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

In [24]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add( Embedding(vocab_size, 50, input_length=seq_length) )
    model.add( LSTM(100, return_sequences=True) )
    model.add( LSTM(100) )
    model.add( Dense(100, activation='relu') )
    model.add( Dense(vocab_size, activation='softmax') )
    
    # Complile 
    model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
    
    return model

model = define_model(vocab_size, seq_length)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 50)            370500    
_________________________________________________________________
lstm_3 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_3 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_4 (Dense)              (None, 7410)              748410    
Total params: 1,269,810
Trainable params: 1,269,810
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.fit(X, y, batch_size=128, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x182d3f1ef0>

The model is compiled specifying the categorical cross entropy loss needed to fit the model. Technically, the model is learning a multiclass classification and this is the suitable loss function for this type of problem. The efficient Adam implementation to mini-batch gradient descent is used and accuracy is evaluated of the model. Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128 to speed things up. Training may take a few hours on modern hardware without GPUs. You can speed it up with a larger batch size and/or fewer training epochs.

During training, you will see a summary of performance, including the loss and accuracy evaluated from the training data at the end of each batch update. You will get different results, but perhaps an accuracy of just over 50% of predicting the next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a model that memorized the text), but rather a model that captures the essence of the text.

In [27]:
from pickle import dump
# save the model to file
model.save('../models/model.h5')
# save the tokenizer
dump(tokenizer, open('../models/tokenizer.pkl', 'wb'))

## Using Language Model

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text. The model will require 50 words as input. Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

In [31]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

# Load sequences for sample seed text
with open("../data/republic_sequences.txt", "r") as file:
    data_seq = file.read().split('\n')

    print(data_seq[0])
seq_length = len(data_seq[0].split()) - 1
print("Seq length: ", seq_length)

# Load model
model = load_model('../models/model.h5')

# load the tokenizer
tokenizer = load(open('../models/tokenizer.pkl', 'rb'))

book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was
Seq length:  50


In [39]:
from random import randint

# Seed text
seed_text = data_seq[randint(0,len(data_seq))] 
print(seed_text + '\n')

# Encode it
encoded = tokenizer.texts_to_sequences([seed_text])[0]
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

# predict next word
pred = model.predict_classes(encoded, verbose=0)

out_word = ''
for word, index in tokenizer.word_index.items():
    if index == pred:
        out_word = word
        break
print("Predicted next word: ", out_word )

man will be most likely to care about that which he loves to be sure and he will be most likely to love that which he regards as having the same interests with himself and that of which the good or evil fortune is supposed by him at any time most

Predicted next word:  the


In [43]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    
    # generate a fixed number of words
    for _ in range(n_words):
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre') # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
             if index == yhat:
                    out_word = word
                    break       
        # append to input
        in_text += ' ' + out_word
        result.append(out_word) 
    return ' '.join(result)

# Generate next 10 words
new_seq = generate_seq( model, tokenizer, seq_length, seed_text, 10 )
print(new_seq)

the the the the the the the the the the


### Note model was only trained for for 1 epoch for demonstration purposes. Retrain model with larger epoch size for better results. This model may take several hours to train on a gpu