# Overview

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence. Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence. Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence

Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words. There are many ways to frame the sequences from a source text for language modeling. In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library. There is no single best approach, just different framings that may suit different applications.

## Model 1: One word in, one word out sequences

Given one word as the input, the model will learn to prededict the next word in the sequence.


1. The first step is to encode the text as integers. Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers. Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts to sequences() function.

2. We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding. The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word index attribute. We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions. 

3. Next, we need to create sequences of words to fit the model with one word as input and one word as output.

In [17]:
import numpy as np
from keras.preprocessing.text import Tokenizer

# source text
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
    Jack fell down and broke his crown\n
    And Jill came tumbling after\n """

# 1. Tokenize Text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
print( "Encoded data: ", encoded)

# 2. Determine vocab size
vocab_size = len(tokenizer.word_index) + 1
print("Vocab size: ", vocab_size)

# 3. Create word -> sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    print("Adding sequences:", sequence)
    sequences.append(sequence)
    
print("total sequences: ", len(sequences))

Encoded data:  [2, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 2, 14, 15, 1, 16, 17, 18, 1, 3, 19, 20, 21]
Vocab size:  22
Adding sequences: [2, 1]
Adding sequences: [1, 3]
Adding sequences: [3, 4]
Adding sequences: [4, 5]
Adding sequences: [5, 6]
Adding sequences: [6, 7]
Adding sequences: [7, 8]
Adding sequences: [8, 9]
Adding sequences: [9, 10]
Adding sequences: [10, 11]
Adding sequences: [11, 12]
Adding sequences: [12, 13]
Adding sequences: [13, 2]
Adding sequences: [2, 14]
Adding sequences: [14, 15]
Adding sequences: [15, 1]
Adding sequences: [1, 16]
Adding sequences: [16, 17]
Adding sequences: [17, 18]
Adding sequences: [18, 1]
Adding sequences: [1, 3]
Adding sequences: [3, 19]
Adding sequences: [19, 20]
Adding sequences: [20, 21]
total sequences:  24


In [29]:
# Split out into X and y
sequences = np.array(sequences)
X,y = sequences[:,0], sequences[:,1]
print("X: ", X)
print("y:", y)

X:  [ 2  1  3  4  5  6  7  8  9 10 11 12 13  2 14 15  1 16 17 18  1  3 19 20]
y: [ 1  3  4  5  6  7  8  9 10 11 12 13  2 14 15  1 16 17 18  1  3 19 20 21]


**Convert outout to One Hot encoding**

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model. Keras provides the to categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

In [30]:
from keras.utils import to_categorical
y = to_categorical(y, num_classes=vocab_size)
print(y)

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

### Define and Build Model

We are now ready to define the neural network model. The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input length=1. The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer. We can compile and fit the network on the encoded text data. Technically, we are modeling a multiclass classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed. The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.

In [31]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

# define the model
def define_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    
    #compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = define_model(vocab_size)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_2 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack’ by encoding it and calling model.predict classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

In [33]:
# Fit the model
model.fit(X,y,epochs=500, verbose=2)

Epoch 1/500
 - 0s - loss: 0.2408 - acc: 0.8750
Epoch 2/500
 - 0s - loss: 0.2405 - acc: 0.8750
Epoch 3/500
 - 0s - loss: 0.2401 - acc: 0.8750
Epoch 4/500
 - 0s - loss: 0.2398 - acc: 0.8750
Epoch 5/500
 - 0s - loss: 0.2395 - acc: 0.8750
Epoch 6/500
 - 0s - loss: 0.2392 - acc: 0.8750
Epoch 7/500
 - 0s - loss: 0.2389 - acc: 0.8750
Epoch 8/500
 - 0s - loss: 0.2386 - acc: 0.8750
Epoch 9/500
 - 0s - loss: 0.2383 - acc: 0.8750
Epoch 10/500
 - 0s - loss: 0.2380 - acc: 0.8750
Epoch 11/500
 - 0s - loss: 0.2377 - acc: 0.8750
Epoch 12/500
 - 0s - loss: 0.2374 - acc: 0.8750
Epoch 13/500
 - 0s - loss: 0.2371 - acc: 0.8750
Epoch 14/500
 - 0s - loss: 0.2368 - acc: 0.8750
Epoch 15/500
 - 0s - loss: 0.2365 - acc: 0.8750
Epoch 16/500
 - 0s - loss: 0.2362 - acc: 0.8750
Epoch 17/500
 - 0s - loss: 0.2360 - acc: 0.8750
Epoch 18/500
 - 0s - loss: 0.2357 - acc: 0.8750
Epoch 19/500
 - 0s - loss: 0.2354 - acc: 0.8750
Epoch 20/500
 - 0s - loss: 0.2351 - acc: 0.8750
Epoch 21/500
 - 0s - loss: 0.2349 - acc: 0.8750
E

Epoch 171/500
 - 0s - loss: 0.2129 - acc: 0.8750
Epoch 172/500
 - 0s - loss: 0.2128 - acc: 0.8750
Epoch 173/500
 - 0s - loss: 0.2128 - acc: 0.8750
Epoch 174/500
 - 0s - loss: 0.2127 - acc: 0.8750
Epoch 175/500
 - 0s - loss: 0.2126 - acc: 0.8750
Epoch 176/500
 - 0s - loss: 0.2125 - acc: 0.8750
Epoch 177/500
 - 0s - loss: 0.2125 - acc: 0.8750
Epoch 178/500
 - 0s - loss: 0.2124 - acc: 0.8750
Epoch 179/500
 - 0s - loss: 0.2123 - acc: 0.8750
Epoch 180/500
 - 0s - loss: 0.2122 - acc: 0.8750
Epoch 181/500
 - 0s - loss: 0.2122 - acc: 0.8750
Epoch 182/500
 - 0s - loss: 0.2121 - acc: 0.8750
Epoch 183/500
 - 0s - loss: 0.2120 - acc: 0.8750
Epoch 184/500
 - 0s - loss: 0.2119 - acc: 0.8750
Epoch 185/500
 - 0s - loss: 0.2119 - acc: 0.8750
Epoch 186/500
 - 0s - loss: 0.2118 - acc: 0.8750
Epoch 187/500
 - 0s - loss: 0.2117 - acc: 0.8750
Epoch 188/500
 - 0s - loss: 0.2117 - acc: 0.8750
Epoch 189/500
 - 0s - loss: 0.2116 - acc: 0.8750
Epoch 190/500
 - 0s - loss: 0.2115 - acc: 0.8750
Epoch 191/500
 - 0s 

 - 0s - loss: 0.2047 - acc: 0.8750
Epoch 339/500
 - 0s - loss: 0.2047 - acc: 0.8750
Epoch 340/500
 - 0s - loss: 0.2046 - acc: 0.8750
Epoch 341/500
 - 0s - loss: 0.2046 - acc: 0.8750
Epoch 342/500
 - 0s - loss: 0.2046 - acc: 0.8750
Epoch 343/500
 - 0s - loss: 0.2046 - acc: 0.8750
Epoch 344/500
 - 0s - loss: 0.2045 - acc: 0.8750
Epoch 345/500
 - 0s - loss: 0.2045 - acc: 0.8750
Epoch 346/500
 - 0s - loss: 0.2045 - acc: 0.8750
Epoch 347/500
 - 0s - loss: 0.2044 - acc: 0.8750
Epoch 348/500
 - 0s - loss: 0.2044 - acc: 0.8750
Epoch 349/500
 - 0s - loss: 0.2044 - acc: 0.8750
Epoch 350/500
 - 0s - loss: 0.2044 - acc: 0.8750
Epoch 351/500
 - 0s - loss: 0.2043 - acc: 0.8750
Epoch 352/500
 - 0s - loss: 0.2043 - acc: 0.8750
Epoch 353/500
 - 0s - loss: 0.2043 - acc: 0.8750
Epoch 354/500
 - 0s - loss: 0.2042 - acc: 0.8750
Epoch 355/500
 - 0s - loss: 0.2042 - acc: 0.8750
Epoch 356/500
 - 0s - loss: 0.2042 - acc: 0.8750
Epoch 357/500
 - 0s - loss: 0.2042 - acc: 0.8750
Epoch 358/500
 - 0s - loss: 0.2041

<keras.callbacks.History at 0x181e11b2b0>

In [36]:
# Generate a prediction
input_ = 'Jack'
print("input: ", input_)

encoded = tokenizer.texts_to_sequences([input_])[0]
encoded = np.array(encoded)
pred = model.predict_classes(encoded, verbose=0)

for word, index in tokenizer.word_index.items():
    if index == pred:
        print(word)

input:  Jack
and


This process could then be repeated a few times to build up a generated sequence of words. To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.

In [40]:
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    
    # Generate a fixed number of words
    for _ in range(n_words):
        # 1. Encode input text 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = np.array(encoded)
        
        # 2. Predict a word in provided vocabulary
        pred = model.predict_classes(encoded, verbose=0)
        
        # 3. Map Predicted word index to actual word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == pred:
                out_word = word
                break
        in_text, result = out_word, result  + ' ' + out_word
    return result
    
print(generate_seq(model, tokenizer, seed_text="Jill", n_words = 10))

Jill came tumbling after hill to fetch a pail of water


This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.

### Model 2: Line by Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up. For example:

![seq_words_ex](images/seq_words_ex.jpeg)

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity. In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text. Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras. First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.


In [46]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# source text
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
    Jack fell down and broke his crown\n
    And Jill came tumbling after\n """

# 1. Tokenize Text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

vocab_size = len(tokenizer.word_index) + 1
print("Vocab size: ", vocab_size)

# 2. Create lined based sequences
sequences = list()

for line in data.split("\n"):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print("total sequences: ", len(sequences))

# 3. Pad sequences, so that they are all the same length
max_length = max([ len(seq) for seq in sequences ])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print("max sequence length", max_length)

print("padded seq ex: ", sequences[0])

Vocab size:  22
total sequences:  21
max sequence length 7
padded seq ex:  [0 0 0 0 0 2 1]


In [47]:
import numpy as np
from keras.utils import to_categorical

# Split sequnces into X and y
sequences = np.array(sequences)
X,y = sequences[:, :-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [50]:
# Define Model
# Input is now the length of the padded_seq.

def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=seq_length))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return model

# seq length is the max length of seqences, which we used to set the padding length to
# Note we subtract 1, as that sequence also had y value at the end
model = define_model(vocab_size, max_length-1)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_4 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


In [59]:
# Fit model
model.fit(X,y,epochs=500, verbose=2)

Epoch 1/500
 - 0s - loss: 3.0917 - acc: 0.0476
Epoch 2/500
 - 0s - loss: 3.0906 - acc: 0.0476
Epoch 3/500
 - 0s - loss: 3.0893 - acc: 0.0476
Epoch 4/500
 - 0s - loss: 3.0878 - acc: 0.0000e+00
Epoch 5/500
 - 0s - loss: 3.0863 - acc: 0.0476
Epoch 6/500
 - 0s - loss: 3.0848 - acc: 0.0952
Epoch 7/500
 - 0s - loss: 3.0832 - acc: 0.0952
Epoch 8/500
 - 0s - loss: 3.0817 - acc: 0.0952
Epoch 9/500
 - 0s - loss: 3.0801 - acc: 0.0952
Epoch 10/500
 - 0s - loss: 3.0784 - acc: 0.0952
Epoch 11/500
 - 0s - loss: 3.0767 - acc: 0.0952
Epoch 12/500
 - 0s - loss: 3.0750 - acc: 0.0952
Epoch 13/500
 - 0s - loss: 3.0731 - acc: 0.0952
Epoch 14/500
 - 0s - loss: 3.0712 - acc: 0.0952
Epoch 15/500
 - 0s - loss: 3.0692 - acc: 0.0952
Epoch 16/500
 - 0s - loss: 3.0671 - acc: 0.0952
Epoch 17/500
 - 0s - loss: 3.0649 - acc: 0.0952
Epoch 18/500
 - 0s - loss: 3.0625 - acc: 0.0952
Epoch 19/500
 - 0s - loss: 3.0601 - acc: 0.0952
Epoch 20/500
 - 0s - loss: 3.0574 - acc: 0.0952
Epoch 21/500
 - 0s - loss: 3.0547 - acc: 0.09

Epoch 171/500
 - 0s - loss: 0.6478 - acc: 0.8571
Epoch 172/500
 - 0s - loss: 0.6417 - acc: 0.8571
Epoch 173/500
 - 0s - loss: 0.6373 - acc: 0.8571
Epoch 174/500
 - 0s - loss: 0.6304 - acc: 0.8571
Epoch 175/500
 - 0s - loss: 0.6258 - acc: 0.8571
Epoch 176/500
 - 0s - loss: 0.6204 - acc: 0.8571
Epoch 177/500
 - 0s - loss: 0.6145 - acc: 0.8571
Epoch 178/500
 - 0s - loss: 0.6101 - acc: 0.8571
Epoch 179/500
 - 0s - loss: 0.6042 - acc: 0.8571
Epoch 180/500
 - 0s - loss: 0.5994 - acc: 0.8571
Epoch 181/500
 - 0s - loss: 0.5945 - acc: 0.8571
Epoch 182/500
 - 0s - loss: 0.5891 - acc: 0.8571
Epoch 183/500
 - 0s - loss: 0.5847 - acc: 0.8571
Epoch 184/500
 - 0s - loss: 0.5797 - acc: 0.8571
Epoch 185/500
 - 0s - loss: 0.5750 - acc: 0.8571
Epoch 186/500
 - 0s - loss: 0.5705 - acc: 0.8571
Epoch 187/500
 - 0s - loss: 0.5657 - acc: 0.8571
Epoch 188/500
 - 0s - loss: 0.5612 - acc: 0.8571
Epoch 189/500
 - 0s - loss: 0.5567 - acc: 0.8571
Epoch 190/500
 - 0s - loss: 0.5521 - acc: 0.8571
Epoch 191/500
 - 0s 

 - 0s - loss: 0.1910 - acc: 0.9524
Epoch 339/500
 - 0s - loss: 0.1898 - acc: 0.9524
Epoch 340/500
 - 0s - loss: 0.1886 - acc: 0.9524
Epoch 341/500
 - 0s - loss: 0.1875 - acc: 0.9524
Epoch 342/500
 - 0s - loss: 0.1865 - acc: 0.9524
Epoch 343/500
 - 0s - loss: 0.1853 - acc: 0.9524
Epoch 344/500
 - 0s - loss: 0.1843 - acc: 0.9524
Epoch 345/500
 - 0s - loss: 0.1832 - acc: 0.9524
Epoch 346/500
 - 0s - loss: 0.1822 - acc: 0.9524
Epoch 347/500
 - 0s - loss: 0.1812 - acc: 0.9524
Epoch 348/500
 - 0s - loss: 0.1802 - acc: 0.9524
Epoch 349/500
 - 0s - loss: 0.1792 - acc: 0.9524
Epoch 350/500
 - 0s - loss: 0.1781 - acc: 0.9524
Epoch 351/500
 - 0s - loss: 0.1771 - acc: 0.9524
Epoch 352/500
 - 0s - loss: 0.1762 - acc: 0.9524
Epoch 353/500
 - 0s - loss: 0.1752 - acc: 0.9524
Epoch 354/500
 - 0s - loss: 0.1742 - acc: 0.9524
Epoch 355/500
 - 0s - loss: 0.1732 - acc: 0.9524
Epoch 356/500
 - 0s - loss: 0.1723 - acc: 0.9524
Epoch 357/500
 - 0s - loss: 0.1714 - acc: 0.9524
Epoch 358/500
 - 0s - loss: 0.1705

<keras.callbacks.History at 0x1820c68710>

In [58]:
# Generate sequences

def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    in_text = seed_text
    
    # Generated a fixed number of words
    for _ in range(n_words):
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = pad_sequences([encoded], maxlen=seq_length, padding="pre")
        
        # Make prediction
        pred = model.predict_classes(encoded, verbose=0)
        
        out_word=''
        for word, index in tokenizer.word_index.items():
            if index == pred:
                out_word = word
                break
        in_text += ' ' + out_word

    return in_text

In [60]:
# Fit Model
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4)) 
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Jack fell down and broke
Jill jill came tumbling after
