Project Introduction

# Next Word Prediction using LSTM (Language Modeling)

This project builds a resume-grade **Next Word Prediction system**
using an LSTM-based language model.

Key learning objectives:
- Language modeling from first principles
- Sliding window sequence generation
- Softmax over large vocabularies
- Many-to-Many sequence modeling
- Foundation for text generation & autocomplete


Imports & Configuration

In [1]:
import numpy as np
import tensorflow as tf
import pickle


In [2]:
# Reproducibility
tf.random.set_seed(42)
np.random.seed(42)


In [3]:
# CONFIG (LOCKED)
VOCAB_SIZE = 5000
CONTEXT_LEN = 5
EMBED_DIM = 100
LSTM_UNITS = 150
BATCH_SIZE = 128
EPOCHS = 20


Download Dataset

In [4]:
url = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
path_to_file = tf.keras.utils.get_file("shakespeare.txt", url)

text = open(path_to_file, "rb").read().decode("utf-8")
text = text.lower()

print("Total characters:", len(text))


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Total characters: 1115394


Tokenization & Vocabulary

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [6]:
tokenizer = Tokenizer(
    num_words=VOCAB_SIZE,
    oov_token="<oov>",
)

tokenizer.fit_on_texts([text])

In [7]:
total_words = len(tokenizer.word_index) + 1
print("Total unique tokens:", total_words)


Total unique tokens: 12634


Generate Input–Target Sequences (Sliding Window)

In [12]:
sequences = []

tokens = tokenizer.texts_to_sequences([text])[0]

for i in range(CONTEXT_LEN, len(tokens)):
    seq = tokens[i-CONTEXT_LEN:i+1]
    sequences.append(seq)

sequences = np.array(sequences)
print("Total sequences:", sequences.shape)

print(sequences[0])


Total sequences: (204084, 6)
[ 89 270 140  36 970 144]


Split Input & Target

In [13]:
X = sequences[:, :-1]
y = sequences[:, -1]


Build LSTM Language Model

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense


In [15]:
model = Sequential([
    Embedding(VOCAB_SIZE, EMBED_DIM, input_length=CONTEXT_LEN),
    LSTM(LSTM_UNITS),
    Dense(VOCAB_SIZE, activation="softmax")
])




In [18]:
model.summary()

Compile Model (LOSS MATTERS)

In [19]:
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)


Train Model

In [20]:
history = model.fit(
    X,
    y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS
)


Epoch 1/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 6ms/step - accuracy: 0.0494 - loss: 6.5947
Epoch 2/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.0780 - loss: 5.9583
Epoch 3/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.1017 - loss: 5.6794
Epoch 4/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.1109 - loss: 5.4787
Epoch 5/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.1189 - loss: 5.3112
Epoch 6/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.1251 - loss: 5.1578
Epoch 7/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.1304 - loss: 5.0155
Epoch 8/20
[1m1595/1595[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.1372 - loss: 4.8820
Epoch 9/20
[1m159

Save Model & Tokenizer

In [21]:
model.save("next_word_lstm.h5")

with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)




Next Word Prediction Function

In [28]:
def predict_next_word(seed_text):
    seed_text = seed_text.lower()
    seq = tokenizer.texts_to_sequences([seed_text])[0]
    seq = seq[-CONTEXT_LEN:]
    seq = tf.keras.preprocessing.sequence.pad_sequences(
        [seq], maxlen=CONTEXT_LEN, padding="pre"
    )

    preds = model.predict(seq, verbose=0)[0]

    # block <OOV>
    oov_index = tokenizer.word_index.get("<OOV>")
    if oov_index is not None:
        preds[oov_index] = 0

    predicted_id = np.argmax(preds)

    return tokenizer.index_word.get(predicted_id, "")


Test the Model

In [47]:
seed = "it is your "
print("Next word:", predict_next_word(seed))

Next word: worship


In [46]:
seed_text = "we are chosen"
print(predict_next_word(seed_text))

in


In [45]:
seed_text = "if things go"
print(predict_next_word(seed_text))

with


In [43]:
seed_text = "i am glad to"
print(predict_next_word(seed_text))

have
