# Language Modelling and Text Generation using LSTMs

Link: https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275

 Import the required libraries

In [43]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras.utils as ku

Lets use a popular nursery rhyme — “Cat and Her Kittens” as our corpus. A corpus is defined as the collection of text documents.

In [2]:
data = """The cat and her kittens
They put on their mittens,
To eat a Christmas pie.
The poor little kittens
They lost their mittens,
And then they began to cry.
O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens."
"If it be so, ye shall not go,
For ye are naughty kittens."""

There will be three main parts of the code: dataset preparation, model training, and generating prediction. The boiler plate code of this architecture is following:

In [3]:
def dataset_preparation():
    pass 
def create_model():
    pass
def generate_text():
    pass

In dataset preparation step, we will first perform Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. An inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus.

# we need to convert the corpus into a flat dataset of sentence sequences.
 
fit_on_texts: Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2 it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).

texts_to_sequences: Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [41]:
def dataset_preparation(data):
    tokenizer = Tokenizer()
    corpus = data.lower().split("\n")
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    input_sequences = []
    
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    #pad the sequences and make their lengths equal. We can use pad_sequence function of Keras for this purpose.
    max_sequence_len = max(len(x) for x in input_sequences)
    input_sequences = pad_sequences(input_sequences, maxlen = max_sequence_len, padding = 'pre')
    
    predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes = total_words)
    
    return predictors, label, max_sequence_len, total_words, tokenizer

Perfect, now we can obtain the input vector X and the label vector Y which can be used for the training purposes. Recent research experiments have shown that recurrent neural networks have shown a good performance in sequence to sequence learning and text data applications. Lets look at them in brief.

In [44]:
def create_model(predictors, label, max_seq_len, total_words):
    input_len = max_seq_len - 1
    model = Sequential()
    model.add(Embedding(total_words, 10, input_length = input_len))
    model.add(LSTM(150))
    model.add(Dropout(0.1))
    model.add(Dense(total_words, activation = "softmax"))
    model.compile(loss = "categorical_crossentropy", optimizer = 'adam')
    model.fit(predictors, label, epochs = 100, verbose = 2)
    return model

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [46]:
def generate_text(seed_text, next_words, max_sequence_len, model, tokenizer):
    for j in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen = max_sequence_len - 1, padding = 'pre')
        predicted = model.predict_classes(token_list, verbose = 0)
        #Generating next words to the seed text
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text, predicted

In [45]:
#Fitting the model
X, Y, max_len, total_words, tokenizer =  dataset_preparation(data)
model = create_model(X, Y, max_len, total_words)

Train on 48 samples
Epoch 1/100
48/48 - 1s - loss: 3.7626
Epoch 2/100
48/48 - 0s - loss: 3.7588
Epoch 3/100
48/48 - 0s - loss: 3.7557
Epoch 4/100
48/48 - 0s - loss: 3.7522
Epoch 5/100
48/48 - 0s - loss: 3.7491
Epoch 6/100
48/48 - 0s - loss: 3.7451
Epoch 7/100
48/48 - 0s - loss: 3.7405
Epoch 8/100
48/48 - 0s - loss: 3.7350
Epoch 9/100
48/48 - 0s - loss: 3.7289
Epoch 10/100
48/48 - 0s - loss: 3.7212
Epoch 11/100
48/48 - 0s - loss: 3.7135
Epoch 12/100
48/48 - 0s - loss: 3.7028
Epoch 13/100
48/48 - 0s - loss: 3.6827
Epoch 14/100
48/48 - 0s - loss: 3.6595
Epoch 15/100
48/48 - 0s - loss: 3.6504
Epoch 16/100
48/48 - 0s - loss: 3.6077
Epoch 17/100
48/48 - 0s - loss: 3.5931
Epoch 18/100
48/48 - 0s - loss: 3.5695
Epoch 19/100
48/48 - 0s - loss: 3.5932
Epoch 20/100
48/48 - 0s - loss: 3.6032
Epoch 21/100
48/48 - 0s - loss: 3.5586
Epoch 22/100
48/48 - 0s - loss: 3.5671
Epoch 23/100
48/48 - 0s - loss: 3.5667
Epoch 24/100
48/48 - 0s - loss: 3.5524
Epoch 25/100
48/48 - 0s - loss: 3.5604
Epoch 26/100
4

In [49]:
#Generating Text
text, predicted = generate_text("cat and", 3, max_len, model, tokenizer)
print(text)

cat and and her kittens


In [35]:
text = generate_text("cat and", 3, max_len, model, tokenizer)
print(text)

cat and her kittens kittens


In [50]:
text, predicted = generate_text("we naughty", 3, max_len, model, tokenizer)
print(text)

we naughty lost to day
