##Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the jokes dataset from huggingface

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

In [1]:
!pip install transformers



In [2]:

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

with open('/content/top_100_jokes.txt','r') as file:
    text=file.read().lower()

tokenizer=Tokenizer()
tokenizer.fit_on_texts([text])
total_words=len(tokenizer.word_index)+1
total_words

2506

In [3]:
tokenizer.word_index

{'a': 1,
 'the': 2,
 'to': 3,
 'i': 4,
 'you': 5,
 'in': 6,
 'of': 7,
 'my': 8,
 'and': 9,
 'what': 10,
 'it': 11,
 'is': 12,
 'do': 13,
 'me': 14,
 'for': 15,
 'that': 16,
 'he': 17,
 'have': 18,
 'when': 19,
 'was': 20,
 'on': 21,
 'they': 22,
 'like': 23,
 'why': 24,
 'how': 25,
 'are': 26,
 'did': 27,
 'be': 28,
 'just': 29,
 'with': 30,
 'an': 31,
 'but': 32,
 'if': 33,
 'at': 34,
 'call': 35,
 "it's": 36,
 'because': 37,
 'his': 38,
 "don't": 39,
 'her': 40,
 'one': 41,
 'about': 42,
 'say': 43,
 'so': 44,
 'this': 45,
 'your': 46,
 "i'm": 47,
 'out': 48,
 'up': 49,
 'get': 50,
 "what's": 51,
 'not': 52,
 'two': 53,
 'who': 54,
 'can': 55,
 'make': 56,
 'people': 57,
 'know': 58,
 'into': 59,
 'all': 60,
 'from': 61,
 'joke': 62,
 'does': 63,
 'never': 64,
 'we': 65,
 'them': 66,
 'there': 67,
 'would': 68,
 'had': 69,
 'take': 70,
 'really': 71,
 'guy': 72,
 'got': 73,
 'has': 74,
 'think': 75,
 'no': 76,
 'says': 77,
 "can't": 78,
 'could': 79,
 'she': 80,
 'bar': 81,
 'now': 8

In [4]:
input_sequences=[]
for line in text.split('\n'):
    token_list=tokenizer.texts_to_sequences([line])[0]
    for i in range(1,len(token_list)):
        n_gram_sequence=token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [6]:
'''
This is my first line in the dataset
[me narrating a documentary about narrators] "I can't hear what they're saying cuz I'm talking"
After the input_sequence, The input sentence is divided into multiple lines, each line contains a new word along with previous words of the sentence.
'''
input_sequences

[[14, 875],
 [14, 875, 1],
 [14, 875, 1, 498],
 [14, 875, 1, 498, 42],
 [14, 875, 1, 498, 42, 876],
 [14, 875, 1, 498, 42, 876, 4],
 [14, 875, 1, 498, 42, 876, 4, 78],
 [14, 875, 1, 498, 42, 876, 4, 78, 92],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10, 111],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10, 111, 189],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10, 111, 189, 877],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10, 111, 189, 877, 47],
 [14, 875, 1, 498, 42, 876, 4, 78, 92, 10, 111, 189, 877, 47, 268],
 [345, 8],
 [345, 8, 878],
 [345, 8, 878, 879],
 [345, 8, 878, 879, 12],
 [345, 8, 878, 879, 12, 93],
 [345, 8, 878, 879, 12, 93, 15],
 [345, 8, 878, 879, 12, 93, 15, 5],
 [345, 8, 878, 879, 12, 93, 15, 5, 93],
 [345, 8, 878, 879, 12, 93, 15, 5, 93, 880],
 [345, 8, 878, 879, 12, 93, 15, 5, 93, 880, 881],
 [345, 8, 878, 879, 12, 93, 15, 5, 93, 880, 881, 9],
 [345, 8, 878, 879, 12, 93, 15, 5, 93, 880, 881, 9, 882],
 [345, 8, 878, 879, 12, 93, 15, 5, 93,

In [7]:
## Pad Sequences
max_sequence_len=max([len(x) for x in input_sequences])
max_sequence_len

39

In [8]:
input_sequences=np.array(pad_sequences(input_sequences,maxlen=max_sequence_len,padding='pre'))
input_sequences

array([[   0,    0,    0, ...,    0,   14,  875],
       [   0,    0,    0, ...,   14,  875,    1],
       [   0,    0,    0, ...,  875,    1,  498],
       ...,
       [   0,    0,    0, ...,   37,   22,   26],
       [   0,    0,    0, ...,   22,   26, 2504],
       [   0,    0,    0, ...,   26, 2504, 2505]], dtype=int32)

In [10]:
import tensorflow as tf
# X will be all the words in the sentence except the last word
# Y will be the last word of the sentence
"""
During the input sequence step, we split each sentence into multiple lines, i.e.,
Sentence = "My name is XYZ"
Input sentence split the sentence into
"My name"
"My name is"
"My name is XYZ"
So in training and test data, all the last word are the testing data and all the other words are the training data.
"""
x,y=input_sequences[:,:-1],input_sequences[:,-1]

In [11]:
x

array([[   0,    0,    0, ...,    0,    0,   14],
       [   0,    0,    0, ...,    0,   14,  875],
       [   0,    0,    0, ...,   14,  875,    1],
       ...,
       [   0,    0,    0, ..., 2503,   37,   22],
       [   0,    0,    0, ...,   37,   22,   26],
       [   0,    0,    0, ...,   22,   26, 2504]], dtype=int32)

In [12]:
y

array([ 875,    1,  498, ...,   26, 2504, 2505], dtype=int32)

In [13]:
y=tf.keras.utils.to_categorical(y,num_classes=total_words)
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [14]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [15]:
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [37]:
#Normal LSTM RNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

model = Sequential()
model.add(Input(shape=(max_sequence_len-1,)))
model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
model.summary()


In [38]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout,GRU

model = Sequential()
model.add(Input(shape=(max_sequence_len-1,)))
model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1))
model.add(GRU(150,return_sequences=True))
model.add(Dropout(0.2))
model.add(GRU(100))
model.add(Dense(total_words,activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
model.summary()

In [44]:
## Train the model
history = model.fit(x, y, epochs=50, verbose=1, callbacks=[early_stopping])

Epoch 1/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 109ms/step - accuracy: 0.0441 - loss: 6.6619
Epoch 2/50


  current = self.get_monitor_value(logs)


[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 109ms/step - accuracy: 0.0463 - loss: 6.5902
Epoch 3/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 114ms/step - accuracy: 0.0456 - loss: 6.3375
Epoch 4/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 109ms/step - accuracy: 0.0551 - loss: 6.0541
Epoch 5/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 109ms/step - accuracy: 0.0609 - loss: 5.8978
Epoch 6/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 114ms/step - accuracy: 0.0745 - loss: 5.6558
Epoch 7/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 110ms/step - accuracy: 0.0824 - loss: 5.4352
Epoch 8/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 110ms/step - accuracy: 0.0896 - loss: 5.2096
Epoch 9/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 113ms/step - accuracy: 0.1044 - loss: 4.9755
Epoch 10/50
[1m260/260[0m

In [45]:
# Function to predict the next word
def predict_next_word(model, tokenizer, text, max_sequence_len):
    token_list = tokenizer.texts_to_sequences([text])[0]
    if len(token_list) >= max_sequence_len:
        token_list = token_list[-(max_sequence_len-1):]  # Ensure the sequence length matches max_sequence_len-1
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=1)
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    return None

In [46]:
input_text="what to do"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word Prediction:{next_word}")

Input text:what to do
Next Word Prediction:you


In [47]:
model.save("next_word_lstm.h5")
print("Model Saved Successfully ✅")



Model Saved Successfully ✅


In [48]:
input_text="He was a real gentlemen"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word Prediction:{next_word}")

Input text:He was a real gentlemen
Next Word Prediction:and


In [51]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(seed_text, next_words=20):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)

        predicted_word_index = np.argmax(predicted_probs)
        output_word = None

        for word, index in tokenizer.word_index.items():
            if index == predicted_word_index:
                output_word = word
                break

        if output_word is None:
            break
        print(output_word)
        seed_text += " " + output_word

    return seed_text

seed_sentence = "He was a real gentlemen"
generated_text = generate_text(seed_sentence, next_words=10)
print("Generated Text:", generated_text)

and
always
opened
the
fridge
door
for
me
by
me
Generated Text: He was a real gentlemen and always opened the fridge door for me by me



**Bidirectional LSTM (BiLSTM)**

Standard LSTM processes the text in one direction (left to right).

*   BiLSTM processes it both forward and backward, making the model more context-aware.
*   This helps in better understanding the relationships between words.




**Attention Mechanism**



*   Instead of just using the last LSTM hidden state, attention assigns different importance (weights) to different words in the sequence.
*   This helps the model focus on the most relevant words while predicting the next word.





In [None]:
#Bi LSTM

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, Input, Layer

# Custom Attention Layer that returns weights
class AttentionLayer(Layer):
    def __init__(self):
        super(AttentionLayer, self).__init__()

    def call(self, lstm_output):
        """
        lstm_output: The output from the BiLSTM layer (batch_size, seq_length, hidden_dim)
        """
        attention_scores = tf.nn.softmax(lstm_output, axis=1)
        attention_output = tf.reduce_sum(lstm_output * attention_scores, axis=1)
        return attention_output, attention_scores



inputs = Input(shape=(max_sequence_len - 1,))
#This defines an input layer but does NOT yet connect it to anything.
embedding = Embedding(total_words, 100, input_length=max_sequence_len - 1)(inputs)

bilstm = Bidirectional(LSTM(150, return_sequences=True))(embedding)

attention_output, attention_weights = AttentionLayer()(bilstm)

dropout = Dropout(0.2)(attention_output)

outputs = Dense(total_words, activation="softmax")(dropout)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.summary()


In [24]:
history = model.fit(x, y, epochs=50, verbose=1, callbacks=[early_stopping])

Epoch 1/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m51s[0m 160ms/step - accuracy: 0.0391 - loss: 7.0560
Epoch 2/50


  current = self.get_monitor_value(logs)


[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 144ms/step - accuracy: 0.0439 - loss: 6.5449
Epoch 3/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 136ms/step - accuracy: 0.0492 - loss: 6.3701
Epoch 4/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 131ms/step - accuracy: 0.0447 - loss: 6.2901
Epoch 5/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 149ms/step - accuracy: 0.0494 - loss: 6.1586
Epoch 6/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 138ms/step - accuracy: 0.0548 - loss: 6.0358
Epoch 7/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 135ms/step - accuracy: 0.0540 - loss: 5.9424
Epoch 8/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 152ms/step - accuracy: 0.0551 - loss: 5.8673
Epoch 9/50
[1m260/260[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 150ms/step - accuracy: 0.0548 - loss: 5.7669
Epoch 10/50
[1m260/260[0m

In [34]:
def generate_text(seed_text, next_words=20):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)

        predicted_word_index = np.argmax(predicted_probs)
        output_word = None

        for word, index in tokenizer.word_index.items():
            if index == predicted_word_index:
                output_word = word
                break

        if output_word is None:
            break

        seed_text += " " + output_word

    return seed_text

seed_sentence = "what to do when"
generated_text = generate_text(seed_sentence, next_words=50)
print("Generated Text:", generated_text)

Generated Text: what to do when problems problems problems problems problems problems revere's revere's grandparents grandparents grandparents native native dream dream dream fall fall chain chain chain days days days days days midget midget midget away away away away away visibly becoming 111 111 moon island google steve steve steve bacon bacon bacon cheese cheese comedies


In [27]:
model.save("Bi_lstm.h5")
## Save the tokenizer
import pickle
with open('tokenizer.pickle','wb') as handle:
    pickle.dump(tokenizer,handle,protocol=pickle.HIGHEST_PROTOCOL)
print("Model & Tokenizer Saved Successfully ✅")



Model & Tokenizer Saved Successfully ✅


In [49]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(seed_text, next_words=20):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)

        predicted_word_index = np.argmax(predicted_probs)
        output_word = None

        for word, index in tokenizer.word_index.items():
            if index == predicted_word_index:
                output_word = word
                break

        if output_word is None:
            break

        seed_text += " " + output_word

    return seed_text

seed_sentence = "He was a real gentlemen"
generated_text = generate_text(seed_sentence, next_words=10)
print("Generated Text:", generated_text)

Generated Text: He was a real gentlemen and always opened the fridge door for me by me
