## Task - 6 Next Word Prediction Using LSTM 

- Author : **Rahul Rathour** 
- Data Science intern at LetsGrowMore
- link for the text file : https://www.gutenberg.org/cache/epub/1513/pg1513.txt

- References : https://ishwargautam.blogspot.com/2021/07/next-word-prediction-using-lstm.html

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM,Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import pickle
import numpy as np
import os

In [2]:
from google.colab import files
uploaded = files.upload()

Saving romeo.txt to romeo.txt


In [3]:
file=open('romeo.txt','r',encoding='utf8')

In [4]:
lines = []
for i in file:
    lines.append(i)

In [5]:
data=""
for i in lines:
    data=" ".join(lines)

In [6]:
data=data.replace('\n','').replace('\r','').replace('\ufeff','').replace('“','').replace('”','')

In [7]:
data=data.split()
data=' '.join(data)
data[:500]


'The Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before usi'

In [8]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts([data])

In [9]:
#saving the tokenizer for predict function
pickle.dump(tokenizer,open('token.pkl','wb'))

In [10]:
sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:15]

[1, 53, 49, 306, 6, 12, 2, 22, 32, 967, 783, 16, 306, 8, 18]

In [11]:
vocab_size=len(tokenizer.word_index) + 1
print(vocab_size)

4282


In [12]:
sequences = []
for i in range(3,len(sequence_data)):
    words=sequence_data[i-3:i+1]
    sequences.append(words)

In [13]:
print('The length of sequences',len(sequences))

The length of sequences 29349


In [14]:
sequences=np.array(sequences)
sequences[:10]

array([[  1,  53,  49, 306],
       [ 53,  49, 306,   6],
       [ 49, 306,   6,  12],
       [306,   6,  12,   2],
       [  6,  12,   2,  22],
       [ 12,   2,  22,  32],
       [  2,  22,  32, 967],
       [ 22,  32, 967, 783],
       [ 32, 967, 783,  16],
       [967, 783,  16, 306]])

In [15]:
X=[]
y=[]

In [16]:
for i in sequences:
    X.append(i[0:3])
    y.append(i[3])

In [17]:
X=np.array(X)
y=np.array(y)

In [18]:
print('Data: ',X[:10])
print('Response: ',y[:10])

Data:  [[  1  53  49]
 [ 53  49 306]
 [ 49 306   6]
 [306   6  12]
 [  6  12   2]
 [ 12   2  22]
 [  2  22  32]
 [ 22  32 967]
 [ 32 967 783]
 [967 783  16]]
Response:  [306   6  12   2  22  32 967 783  16 306]


In [19]:
y=to_categorical(y,num_classes=vocab_size)
y[:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [20]:
model=Sequential()
model.add(Embedding(vocab_size,10,input_length=3))
model.add(LSTM(1000,return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000,activation='relu'))
model.add(Dense(vocab_size,activation='softmax'))

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 10)             42820     
                                                                 
 lstm (LSTM)                 (None, 3, 1000)           4044000   
                                                                 
 lstm_1 (LSTM)               (None, 1000)              8004000   
                                                                 
 dense (Dense)               (None, 1000)              1001000   
                                                                 
 dense_1 (Dense)             (None, 4282)              4286282   
                                                                 
Total params: 17,378,102
Trainable params: 17,378,102
Non-trainable params: 0
_________________________________________________________________


In [22]:

from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint=ModelCheckpoint('next_words.h5',monitor='loss',verbose=1,save_best_only=True)
model.compile(loss='categorical_crossentropy',optimizer=Adam(learning_rate=0.001))
model.fit(X,y,epochs=100,batch_size=64,callbacks=[checkpoint])

Epoch 1/100
Epoch 1: loss improved from inf to 6.81847, saving model to next_words.h5
Epoch 2/100
Epoch 2: loss improved from 6.81847 to 6.45490, saving model to next_words.h5
Epoch 3/100
Epoch 3: loss improved from 6.45490 to 6.19038, saving model to next_words.h5
Epoch 4/100
Epoch 4: loss improved from 6.19038 to 5.93268, saving model to next_words.h5
Epoch 5/100
Epoch 5: loss improved from 5.93268 to 5.66360, saving model to next_words.h5
Epoch 6/100
Epoch 6: loss improved from 5.66360 to 5.40580, saving model to next_words.h5
Epoch 7/100
Epoch 7: loss improved from 5.40580 to 5.16285, saving model to next_words.h5
Epoch 8/100
Epoch 8: loss improved from 5.16285 to 4.92012, saving model to next_words.h5
Epoch 9/100
Epoch 9: loss improved from 4.92012 to 4.66583, saving model to next_words.h5
Epoch 10/100
Epoch 10: loss improved from 4.66583 to 4.39757, saving model to next_words.h5
Epoch 11/100
Epoch 11: loss improved from 4.39757 to 4.10078, saving model to next_words.h5
Epoch 12/1

<keras.callbacks.History at 0x7f653c078460>

In [23]:
from tensorflow.keras.models import load_model
import numpy as np
import pickle

# Load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl', 'rb'))

def Predict_Next_Words(model, tokenizer, text):

  sequence = tokenizer.texts_to_sequences([text])
  sequence = np.array(sequence)
  preds = np.argmax(model.predict(sequence))
  predicted_word = ""
  
  for key, value in tokenizer.word_index.items():
      if value == preds:
          predicted_word = key
          break
  
  print(predicted_word)
  return predicted_word

In [25]:
while(True):
  text = input("Enter your line: ")
  
  if text == "0":
      print("Execution completed.....")
      break
  
  else:
      try:
          text = text.split(" ")
          text = text[-3:]
          print(text)
        
          Predict_Next_Words(model, tokenizer, text)
          
      except Exception as e:
        print("Error occurred: ",e)
        continue

KeyboardInterrupt: ignored