- A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence. Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions. In this tutorial, you will discover how to develop a statistical language model using deep learning in Python.

## 1. Data Preparation 

### 1.1 Load Text

In [1]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename,'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load document
in_filename = 'republic_clean.txt'
doc = load_doc(in_filename)
print(doc[:200])

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### 1.2 Clean Text

In [10]:
a = text_to_word_sequence('hi Hello! how are you')
a

['hi', 'hello', 'how', 'are', 'you']

In [15]:
# turn a doc into clean tokens
from keras.preprocessing.text import text_to_word_sequence
def clean_doc(doc):
    doc = text_to_word_sequence(doc) 
    tokens = [word for word in doc if word.isalpha()]
    return tokens
tokens = clean_doc(doc)
print(tokens[:50])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i']
Total Tokens: 118650
Unique Tokens: 7275


### 1.3 Save Clean Text
We can organize the long list of tokens into sequences of 50 input words and 1 output word.
That is, sequences of 51 words. We can do this by iterating over the list of tokens from token 51
onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of
the list of tokens. We will transform the tokens into space-separated strings for later storage
in a file. The code to split the list of clean tokens into sequences with a length of 51 tokens is
listed below.

In [20]:
# organize into sequence of tokens
length = 50 + 1
sequences = []
for i in range(length,len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))
print(sequences[:1])

Total Sequences: 118599
['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was']


- Next, we can save the sequences to a new file for later loading. We can define a new function for saving lines of text to a file. This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

In [27]:
# save tokens to file, one dialog per line
def save_doc(lines,filename):
    data = '\n'.join(lines)
    file = open(filename,'w')
    file.write(data)
    file.close()

# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences,out_filename)

# sample output of republic_sequences.txt

# book i i ... catch sight of
# i i went ... sight of us
# i went down ... of us from
# '''

- now we have training data stored in the file republic_sequences.txt in current working directory. Next, let's look at how to fit a language model to this data.

## 2. Train Language Model
We can now train a statistical language model from the prepared data. The model we will train is a neural language model. It has a few unique characteristics:
    - It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
    - It learns the representation at the same time as learning the model.
    - It learns to predict the probability for the next word using the context of the last 100 words.

Specifically, we will use an Embedding Layer to learn the representation of words, and a
Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on
their context. Let's start by loading our training data.

### 2.1 Load Sequences

In [28]:
# load doc into memory
def load_doc(filename):
    file = open(filename,'r')
    text = file.read()
    file.close()
    return text
# load 
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
print(lines[0])

book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was


### 2.2 Encode Sequences
- The word embedding layer expects input sequences to be comprised of integers. We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping. To do this encoding, we will use the Tokenizer class in the Keras API.
- First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer. We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

In [30]:
# integer encode sequences of words
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
print(sequences[0])

[1045, 11, 11, 1044, 329, 7275, 4, 1, 2875, 35, 215, 1, 260, 3, 2252, 9, 11, 180, 819, 123, 92, 2874, 4, 1, 2250, 7274, 1, 7273, 7272, 2, 75, 120, 11, 1271, 4, 110, 6, 30, 169, 16, 49, 7271, 1, 1611, 13, 57, 8, 535, 151, 11, 57]


In [32]:
# vocabulary size
vocabulary = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size %d' % vocab_size)
print('Vocabulary', vocabulary)

Vocabulary Size 7276
Vocabulary {'the': 1, 'and': 2, 'of': 3, 'to': 4, 'is': 5, 'in': 6, 'he': 7, 'a': 8, 'that': 9, 'be': 10, 'i': 11, 'not': 12, 'which': 13, 'are': 14, 'you': 15, 'they': 16, 'or': 17, 'will': 18, 'said': 19, 'as': 20, 'we': 21, 'but': 22, 'have': 23, 'them': 24, 'his': 25, 'for': 26, 'by': 27, 'who': 28, 'their': 29, 'what': 30, 'then': 31, 'this': 32, 'one': 33, 'if': 34, 'with': 35, 'there': 36, 'all': 37, 'true': 38, 'at': 39, 'when': 40, 'do': 41, 'other': 42, 'has': 43, 'yes': 44, 'any': 45, 'him': 46, 'good': 47, 'no': 48, 'would': 49, 'may': 50, 'state': 51, 'from': 52, 'man': 53, 'say': 54, 'our': 55, 'only': 56, 'was': 57, 'an': 58, 'so': 59, 'must': 60, 'should': 61, 'more': 62, 'us': 63, 'on': 64, 'can': 65, 'were': 66, 'very': 67, 'now': 68, 'like': 69, 'such': 70, 'replied': 71, 'just': 72, 'certainly': 73, 'than': 74, 'also': 75, 'these': 76, 'same': 77, 'men': 78, 'another': 79, 'about': 80, 'being': 81, 'justice': 82, 'own': 83, 'how': 84, 'soul': 85

### 2.3 Sequence Inputs and Outputs

In [34]:
from numpy import array
# separate into input and output
sequences = array(sequences)
print(sequences)

[[1045   11   11 ...  151   11   57]
 [  11   11 1044 ...   11   57 1148]
 [  11 1044  329 ...   57 1148   35]
 ...
 [ 384  466    4 ...  416   13   21]
 [ 466    4   33 ...   13   21   23]
 [   4   33   79 ...   21   23   86]]


In [36]:
from keras.utils import to_categorical
X = sequences[:,:-1]
y = sequences[:,-1]
y = to_categorical(y,num_classes = vocab_size)
seq_length = X.shape[1]
print('Sequence Length:', seq_length)
print(y)

Sequence Length: 50
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### 2.4 Fit Model 

In [38]:
from pickle import dump
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# define the model
def define_model(vocab_size,seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size,50,input_length = seq_length))
    model.add(LSTM(100, return_sequences = True))
    model.add(LSTM(100))
    model.add(Dense(vocab_size, activation = 'softmax'))
    # compile network
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    # summarize defined model
    model.summary()
    return model

# define model
model = define_model(vocab_size,seq_length)
# fit model
model.fit(X,y, batch_size = 128, epochs = 100)
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl','wb'))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            363800    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 7276)              734876    
Total params: 1,239,476
Trainable params: 1,239,476
Non-trainable params: 0
_________________________________________________________________






Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100


Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


## 3. Use Language Model

### 3.1 Load Data

In [39]:
# load doc into memory
def load_doc(filename):
    file = open(filename,'r')
    text = file.read()
    file.close()
    return text

# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

- We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text. The model will require 50 words as input. Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

In [57]:
seq_length = len(lines[0].split()) - 1
seq_length

50

### 3.2 Load Model

In [58]:
from random import randint 
from pickle import load 
from keras.models import load_model 
from keras.preprocessing.sequence import pad_sequences

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl','rb'))

### 3.3 Generate Text

In [59]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

was a dream and the education and training which they received from us an appearance only in reality during all that time they were being formed and fed in the womb of the earth where they themselves and their arms and appurtenances were manufactured when they were completed the earth their



In [64]:
# generate a sequence from a language model
def generate_seq(model, tokenizer,seq_length,seed_text,n_words):
    result = []
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        print('encoded text:','\n',encoded)
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded],maxlen = seq_length,truncating = 'pre')
        print('encoded pad sequences:', '\n',encoded)
        # predict probabilities for each word
        yhat = model.predict_classes(encoded,verbose = 0)
        print('yhat:',yhat)
        # map predicted word index to word
        out_word = ''
        for word,index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text +=  ' ' + out_word
        print('input text:', '\n',in_text)
        result.append(out_word)
    return ' '.join(result)

# generate new text
generated = generate_seq(model,tokenizer,seq_length, seed_text,2)
print(generated)

encoded text: 
 [57, 8, 870, 2, 1, 244, 2, 603, 13, 16, 689, 52, 63, 58, 707, 56, 6, 513, 1874, 37, 9, 146, 16, 66, 81, 2012, 2, 1805, 6, 1, 5122, 3, 1, 411, 265, 16, 175, 2, 29, 891, 2, 5123, 66, 5124, 40, 16, 66, 1217, 1, 411, 29]
encoded pad sequences: 
 [[   8  870    2    1  244    2  603   13   16  689   52   63   58  707
    56    6  513 1874   37    9  146   16   66   81 2012    2 1805    6
     1 5122    3    1  411  265   16  175    2   29  891    2 5123   66
  5124   40   16   66 1217    1  411   29]]
yhat: [679]
input text: 
 was a dream and the education and training which they received from us an appearance only in reality during all that time they were being formed and fed in the womb of the earth where they themselves and their arms and appurtenances were manufactured when they were completed the earth their mother
encoded text: 
 [57, 8, 870, 2, 1, 244, 2, 603, 13, 16, 689, 52, 63, 58, 707, 56, 6, 513, 1874, 37, 9, 146, 16, 66, 81, 2012, 2, 1805, 6, 1, 5122, 3, 1, 411,