<a href="https://colab.research.google.com/github/hussain0048/Projects-/blob/master/Next_Word_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1-Introduction**

Most of the keyboards in smartphones give next word prediction features; google also uses next word prediction based on our browsing history. So a preloaded data is also stored in the keyboard function of our smartphones to predict the next word correctly. In this article, I will train a Deep Learning model for next word prediction using Python. I will use the Tensorflow and Keras library in Python for next word prediction model[1].
For making a Next Word Prediction model, I will train a Recurrent Neural Network (RNN). So let’s start with this task now without wasting any time.[1]

#**2- Import Library**
To start with our next word prediction model, let’s import some all the libraries we need for this task:



In [11]:
import numpy as np
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq

# **3- Dataset**

As I told earlier, Google uses our browsing history to make next word predictions, smartphones, and all the keyboards that are trained to predict the next word are trained using some data. So I will also use a dataset. You can download the dataset from [here](https://drive.google.com/file/d/1GeUzNVqiixXHnTl8oNiQ2W3CynX_lsu2/view)

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
path = "/content/drive/My Drive/Datasets/Next Word Prediction/1661-0.txt"
text = open(path).read().lower()
print('corpus length:', len(text))

corpus length: 581888


Now I will split the dataset into each word in order but without the presence of some special characters.[1]



In [19]:
tokenizer = RegexpTokenizer(r'w+')
words = tokenizer.tokenize(text)

Now the next process will be performing the feature engineering in our data. For this purpose, we will require a dictionary with each word in the data within the list of unique words as the key, and it’s significant portions as value.

In [20]:
unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))

# **4-Feature Engineering**

Feature Engineering means taking whatever information we have about our problem and turning it into numbers that we can use to build our feature matrix. If you want a detailed tutorial of feature engineering, you can learn it from here.[1]

Here I will define a Word length which will represent the number of previous words that will determine our next word. I will define prev words to keep five previous words and their corresponding next words in the list of next words[1]




In [21]:
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])

['w', 'w', 'w', 'w', 'w']
www


Now I will create two numpy arrays x for storing the features and y for storing its corresponding label. I will iterate x and y if the word is available so that the corresponding position becomes 1.

In [22]:
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)
for i, each_words in enumerate(prev_words):
    for j, each_word in enumerate(each_words):
        X[i, j, unique_word_index[each_word]] = 1
    Y[i, unique_word_index[next_words[i]]] = 1

Now before moving forward, have a look at a single sequence of words:



In [23]:
print(X[0][0])


[ True False]


#**5-Building the Recurrent Neural network**
As I stated earlier, I will use the Recurrent Neural networks for next word prediction model. Here I will use the LSTM model, which is a very powerful RNN.[1]


In [24]:
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))

#**6-Training the Next Word Prediction Model**
I will be training the next word prediction model with 20 epochs[1]

In [None]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=2, shuffle=True).history

Now we have successfully trained our model, before moving forward to evaluating our model, it will be better to save this model for our future use[1]

In [26]:
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))
model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))

In [29]:
print(history)

{'loss': [0.009844760410487652, 0.002820393769070506], 'accuracy': [0.9999086856842041, 0.9999086856842041], 'val_loss': [0.2027292400598526, 0.23149938881397247], 'val_accuracy': [0.984402060508728, 0.984402060508728]}


#**7-Evaluating the Next Word Prediction Model**

Now let’s have a quick look at how our model is going to behave based on its accuracy and loss changes while training:



In [None]:
plt.plot(history['accuracy'])
plt.plot(history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')

In [None]:
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')

#**8-Testing Next Word Prediction Model**

Now let’s build a python program to predict the next word using our trained model. For this, I will define some essential functions that will be used in the process.

In [35]:
def prepare_input(text):
    x = np.zeros((1, SEQUENCE_LENGTH, len(chars)))
    for t, char in enumerate(text):
        x[0, t, char_indices[char]] = 1.
        
    return x

Now before moving forward, let’s test the function, make sure you use a lower() function while giving input :



In [37]:
prepare_input("This is an example of input for our LSTM".lower())


NameError: ignored

# **References**
[Next Word Prediction Model](https://thecleverprogrammer.com/2020/07/20/next-word-prediction-model/)