# Next Word Prediction
----------

- Next word prediction tools are available on the majority of smartphone keyboards, and Google also employs this feature based on our browsing history.
- Therefore, in order to accurately forecast the next word, preloaded data is also saved in our smartphones' keyboard functions. In this post, I'll use Python to train a Deep Learning model for next word prediction. For my next word prediction model, I'll leverage the Python Tensorflow and Keras libraries.

`Task`: Using Tensorflow and Keras Library train a RNN, to predict te next word. 

`Dataset Link:` https://drive.google.com/file/d/1GeUzNVqiixXHnTl8oNiQ2W3CynX_lsu2/view

By Yashraj Mishra 

# Importing Required Libraries 

In [1]:
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import pickle 
import heapq
# nlp concept
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Activation
# from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop

In [2]:
path = '1661-0.txt'
text = open(path, encoding="utf8").read().lower()
print('corpus length:', len(text))

corpus length: 581888


In [3]:
#Split the dataset into each word in order, without presence of some special characters,

tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)

# Feature Engineering 
---
- It requires a dictionary with each word in data within the list of unique words as the key, with significant position of value.
- In Feature Engineering, I will define a Word length which will represent the number of previous words that will determine our next word.
- I will define prev words to keep five previous words and their corresponding next words in the list of next words.

- Unique List 

In [4]:
unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))

In [5]:
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])

['project', 'gutenberg', 's', 'the', 'adventures']
of


# Building Model using RNN (Recurrent Neural Network)

In [6]:
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)
for i, each_words in enumerate(prev_words):
    for j, each_word in enumerate(each_words):
        X[i, j, unique_word_index[each_word]] = 1
    Y[i, unique_word_index[next_words[i]]] = 1

In [7]:
print(X[0][0])

[False False False ... False False False]


In [8]:
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))

  super().__init__(**kwargs)


# Model Training

In [9]:
optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=20, shuffle=True).history

Epoch 1/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 104ms/step - accuracy: 0.0553 - loss: 6.6239 - val_accuracy: 0.0718 - val_loss: 6.8953
Epoch 2/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 103ms/step - accuracy: 0.1105 - loss: 5.7989 - val_accuracy: 0.1012 - val_loss: 6.7243
Epoch 3/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 103ms/step - accuracy: 0.1360 - loss: 5.4740 - val_accuracy: 0.1000 - val_loss: 6.6494
Epoch 4/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 103ms/step - accuracy: 0.1537 - loss: 5.1902 - val_accuracy: 0.0990 - val_loss: 6.5822
Epoch 5/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 105ms/step - accuracy: 0.1747 - loss: 4.9524 - val_accuracy: 0.0987 - val_loss: 6.6934
Epoch 6/20
[1m811/811[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 102ms/step - accuracy: 0.2004 - loss: 4.6798 - val_accuracy: 0.0978 - val_loss: 6.6902
Epoch 7/20

# Saving Model

In [10]:
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))
model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))



# Predictions

In [12]:
def prepare_input(text):
    x = np.zeros((1, WORD_LENGTH, len(unique_words)))
    for t, word in enumerate(text.split()):
        print(word)
        x[0, t, unique_word_index[word]] = 1
    return x
prepare_input("How are you ".lower())

how
are
you


array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

In [13]:
def sample(preds, top_n=3):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    return heapq.nlargest(top_n, range(len(preds)), preds.take)

In [14]:
def predict_completions(text, n=3):
    if text == "":
        return("0")
    x = prepare_input(text)
    preds = model.predict(x, verbose=0)[0]
    next_indices = sample(preds, n)
    return [unique_words[idx] for idx in next_indices]

# Testing the Results

In [15]:
q =  "Do your work by your own instead of depending on someone"
print("correct sentence: ",q)
seq = " ".join(tokenizer.tokenize(q.lower())[0:5])
print("Sequence: ",seq)
print("next possible words: ", predict_completions(seq, 5))

correct sentence:  Do your work by your own instead of depending on someone
Sequence:  do your work by your
do
your
work
by
your
next possible words:  ['do', 'your', 'get', 'you', 'come']


In [16]:
q =  "Do your work by your own instead of depending on someone"
print("correct sentence: ",q)
seq = " ".join(tokenizer.tokenize(q.lower())[0:5])
print("Sequence: ",seq)
print("next possible words: ", predict_completions(seq, 5))

correct sentence:  Do your work by your own instead of depending on someone
Sequence:  do your work by your
do
your
work
by
your
next possible words:  ['do', 'your', 'get', 'you', 'come']
