- Language modeling involves predicting the next word in a sequence given the sequence of words already present. A language model is a key element in many natural language processing models such as machine translation and speech recognition.

# Model 1: One - Word - In, One - Word - Out Sequences

We can start with a very simple model. Given one word as input, the model will learn to predict
the next word in the sequence. For example:
- X, y
- Jack, and
- and, Jill
- Jill, went

In [1]:
# source text
data = """ Jack and Jill went up the hill\n
To fetch a pail of water\n
Jack fell down and broke his crown\n
And Jill came tumbling after\n """

- The first step is to encode the text as integers. Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers. Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.

In [2]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

Using TensorFlow backend.


In [3]:
[data]

[' Jack and Jill went up the hill\n\nTo fetch a pail of water\n\nJack fell down and broke his crown\n\nAnd Jill came tumbling after\n ']

In [4]:
[data][0]

' Jack and Jill went up the hill\n\nTo fetch a pail of water\n\nJack fell down and broke his crown\n\nAnd Jill came tumbling after\n '

In [5]:
from keras.preprocessing.text import Tokenizer
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
encoded

[2,
 1,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 2,
 14,
 15,
 1,
 16,
 17,
 18,
 1,
 3,
 19,
 20,
 21]

In [6]:
len(encoded)

25

In [7]:
# determine the vocabulary size (unique words)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 22


- Running above example, we can see that the size of the vocabulary is 21 words. We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions. Next, we need to create sequences of words to fit the model with one word as input and one word as output.

In [8]:
# create word -> word sequences
sequences = []
for i in range(1,len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
print(sequences)

Total Sequences: 24
[[2, 1], [1, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 11], [11, 12], [12, 13], [13, 2], [2, 14], [14, 15], [15, 1], [1, 16], [16, 17], [17, 18], [18, 1], [1, 3], [3, 19], [19, 20], [20, 21]]


In [9]:
from numpy import array
# split into X and y elements
sequences = array(sequences)
sequences

array([[ 2,  1],
       [ 1,  3],
       [ 3,  4],
       [ 4,  5],
       [ 5,  6],
       [ 6,  7],
       [ 7,  8],
       [ 8,  9],
       [ 9, 10],
       [10, 11],
       [11, 12],
       [12, 13],
       [13,  2],
       [ 2, 14],
       [14, 15],
       [15,  1],
       [ 1, 16],
       [16, 17],
       [17, 18],
       [18,  1],
       [ 1,  3],
       [ 3, 19],
       [19, 20],
       [20, 21]])

In [10]:
X = sequences[:,0]
X

array([ 2,  1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,  2, 14, 15,  1,
       16, 17, 18,  1,  3, 19, 20])

In [11]:
y = sequences[:,1]
y

array([ 1,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,  2, 14, 15,  1, 16,
       17, 18,  1,  3, 19, 20, 21])

- We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value.This gives the network a ground truth to aim for from which we can calculate error and update the model. Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

In [12]:
from keras.utils import to_categorical
# one hot encode outputs
y = to_categorical(y,num_classes = vocab_size)
y

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,

In [13]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# define the model
def define_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size,10,input_length = 1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size,activation = 'softmax'))
    # compile network
    model.compile(loss = 'categorical_crossentropy',optimizer = 'adam', metrics = ['accuracy'])
    model.summary()
    return model
# define model
model = define_model(vocab_size)
# fit network
model.fit(X,y,epochs = 500, verbose = 0)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________






<keras.callbacks.callbacks.History at 0x1dbeaf00bc8>

In [14]:
# evaluate 
in_text = 'Jack'
print(in_text)
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = array(encoded)
yhat = model.predict_classes(encoded,verbose = 0)
for word,index in tokenizer.word_index.items():
    if index == yhat:
        print(word)

Jack
and


In [19]:
yhat

array([1], dtype=int64)

In [18]:
tokenizer.word_index.items()

dict_items([('and', 1), ('jack', 2), ('jill', 3), ('went', 4), ('up', 5), ('the', 6), ('hill', 7), ('to', 8), ('fetch', 9), ('a', 10), ('pail', 11), ('of', 12), ('water', 13), ('fell', 14), ('down', 15), ('broke', 16), ('his', 17), ('crown', 18), ('came', 19), ('tumbling', 20), ('after', 21)])

In [22]:
# generate a sequence from the model
def generate_seq(model,tokenizer,seed_text,n_words):
    in_text,result = seed_text,seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded,verbose = 0)
        # map predicted word index to word
        out_word = ''
        for word,index in tokenizer.word_index.items():
            if index == yhat :
                out_word = word
                break
        # append to input
        in_text,result = out_word, result + ' ' + out_word
        print(in_text)
    return result

In [23]:
print(generate_seq(model,tokenizer,'Jack',6))

and
jill
came
tumbling
after
hill
Jack and jill came tumbling after hill


In [17]:
print(generate_seq(model,tokenizer,'came',2))

came tumbling after


# Model 2: Line-by-Line Sequence

In [None]:
# X,                                     y
# _, _, _, _, _, Jack,                   and
# _, _, _, _, Jack, and,                 Jill
# _, _, _, Jack, and, Jill,              went
# _, _, Jack, and, Jill, went,           up
# _, Jack, and, Jill, went, up,          the
# Jack, and, Jill, went, up, the,        hill

- This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity. In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text. Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras. First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

In [28]:
data

' Jack and Jill went up the hill\n\nTo fetch a pail of water\n\nJack fell down and broke his crown\n\nAnd Jill came tumbling after\n '

In [29]:
[data]

[' Jack and Jill went up the hill\n\nTo fetch a pail of water\n\nJack fell down and broke his crown\n\nAnd Jill came tumbling after\n ']

In [30]:
[data][0]

' Jack and Jill went up the hill\n\nTo fetch a pail of water\n\nJack fell down and broke his crown\n\nAnd Jill came tumbling after\n '

In [54]:
# create line-based sequence
sequences = list()
for line in data.split('\n'):
    print(line)
    encoded = tokenizer.texts_to_sequences([line])[0]
    print(encoded)
    for i in range(1,len(encoded)):
        sequence = encoded[:i+1]
        print(sequence)
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

 Jack and Jill went up the hill
[2, 1, 3, 4, 5, 6, 7]
[2, 1]
[2, 1, 3]
[2, 1, 3, 4]
[2, 1, 3, 4, 5]
[2, 1, 3, 4, 5, 6]
[2, 1, 3, 4, 5, 6, 7]

[]
To fetch a pail of water
[8, 9, 10, 11, 12, 13]
[8, 9]
[8, 9, 10]
[8, 9, 10, 11]
[8, 9, 10, 11, 12]
[8, 9, 10, 11, 12, 13]

[]
Jack fell down and broke his crown
[2, 14, 15, 1, 16, 17, 18]
[2, 14]
[2, 14, 15]
[2, 14, 15, 1]
[2, 14, 15, 1, 16]
[2, 14, 15, 1, 16, 17]
[2, 14, 15, 1, 16, 17, 18]

[]
And Jill came tumbling after
[1, 3, 19, 20, 21]
[1, 3]
[1, 3, 19]
[1, 3, 19, 20]
[1, 3, 19, 20, 21]
 
[]
Total Sequences: 21


In [55]:
sequences

[[2, 1],
 [2, 1, 3],
 [2, 1, 3, 4],
 [2, 1, 3, 4, 5],
 [2, 1, 3, 4, 5, 6],
 [2, 1, 3, 4, 5, 6, 7],
 [8, 9],
 [8, 9, 10],
 [8, 9, 10, 11],
 [8, 9, 10, 11, 12],
 [8, 9, 10, 11, 12, 13],
 [2, 14],
 [2, 14, 15],
 [2, 14, 15, 1],
 [2, 14, 15, 1, 16],
 [2, 14, 15, 1, 16, 17],
 [2, 14, 15, 1, 16, 17, 18],
 [1, 3],
 [1, 3, 19],
 [1, 3, 19, 20],
 [1, 3, 19, 20, 21]]

In [41]:
# pad input sequences
from keras.preprocessing.sequence import pad_sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences,maxlen = max_length,padding = 'pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 7


In [42]:
# split into input and output elements
sequences = array(sequences)
sequences

array([[ 0,  0,  0,  0,  0,  2,  1],
       [ 0,  0,  0,  0,  2,  1,  3],
       [ 0,  0,  0,  2,  1,  3,  4],
       [ 0,  0,  2,  1,  3,  4,  5],
       [ 0,  2,  1,  3,  4,  5,  6],
       [ 2,  1,  3,  4,  5,  6,  7],
       [ 0,  0,  0,  0,  0,  8,  9],
       [ 0,  0,  0,  0,  8,  9, 10],
       [ 0,  0,  0,  8,  9, 10, 11],
       [ 0,  0,  8,  9, 10, 11, 12],
       [ 0,  8,  9, 10, 11, 12, 13],
       [ 0,  0,  0,  0,  0,  2, 14],
       [ 0,  0,  0,  0,  2, 14, 15],
       [ 0,  0,  0,  2, 14, 15,  1],
       [ 0,  0,  2, 14, 15,  1, 16],
       [ 0,  2, 14, 15,  1, 16, 17],
       [ 2, 14, 15,  1, 16, 17, 18],
       [ 0,  0,  0,  0,  0,  1,  3],
       [ 0,  0,  0,  0,  1,  3, 19],
       [ 0,  0,  0,  1,  3, 19, 20],
       [ 0,  0,  1,  3, 19, 20, 21]])

In [43]:
X,y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y,num_classes = vocab_size)

In [58]:
y

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [44]:
# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    #plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [48]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
    # append to input
    in_text += ' ' + out_word
    return in_text

In [50]:
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(X, y, epochs=500, verbose=0)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_3 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


<keras.callbacks.callbacks.History at 0x1dbf01f5a48>

In [51]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Jack fell
Jill jill
