
This project is "Next Word Prediction".

We will build a model that can complete your sentences. This is the core technology behind features like "Smart Compose" in Gmail or the predictive text on your phone.

We will use a Long Short-Term Memory (LSTM) network, which is excellent at remembering patterns in long sequences of text. We will train it on "The Adventures of Sherlock Holmes" so it learns to speak like a 19th-century detective.



Cell 1: Import Libraries & Load Dataset
We will download the book text directly from Project Gutenberg.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# 1. Download the Dataset (Sherlock Holmes)
path = tf.keras.utils.get_file(
    'sherlock_holmes.txt',
    origin='https://www.gutenberg.org/files/1661/1661-0.txt'
)

# 2. Read and Lowercase the text
text = open(path, 'r', encoding='utf-8').read().lower()

print(f"✅ Text Loaded. Character count: {len(text)}")
print("--- Sample Text ---")
print(text[3000:3500])

Downloading data from https://www.gutenberg.org/files/1661/1661-0.txt
[1m607504/607504[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2us/step
✅ Text Loaded. Character count: 581425
--- Sample Text ---
gs in baker street, buried among his old
books, and alternating from week to week between cocaine and ambition,
the drowsiness of the drug, and the fierce energy of his own keen
nature. he was still, as ever, deeply attracted by the study of crime,
and occupied his immense faculties and extraordinary powers of
observation in following out those clues, and clearing up those
mysteries which had been abandoned as hopeless by the official police.
from time to time i heard some vague account of his d


Cell 2: Tokenization & Sequence Creation
Deep Learning models don't understand words; they understand numbers. We use a Tokenizer to assign a unique number to every word. Then, we create "N-gram sequences" (e.g., "the cat" -> "sat") to teach the model what comes next.

In [2]:
# 1. Tokenize (Limit to top 2000 words for speed)
tokenizer = Tokenizer(num_words=2000, oov_token="<OOV>")
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

print(f"Dictionary size: {total_words} words")

# 2. Create Input Sequences
# We slide a window over the text to create training samples
# Example: "The cat sat" -> [The, cat], [The, cat, sat]
input_sequences = []
# We'll just use the first 1000 lines to keep training fast for this demo
# (Remove [:1000] to train on the whole book if you have a GPU)
split_text = text.split('\n')[:2000]

for line in split_text:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

print(f"Total sequences created: {len(input_sequences)}")

Dictionary size: 8923 words
Total sequences created: 15419


Cell 3: Padding & Data Split
Sentences have different lengths, but our model expects fixed-size inputs. We "pad" the shorter sequences with zeros to match the longest sentence.

In [3]:
# 1. Pad Sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# 2. Split into Features (X) and Label (y)
# X = All words EXCEPT the last one
# y = The LAST word (which we want to predict)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# 3. One-Hot Encode labels
y = to_categorical(y, num_classes=total_words)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (15419, 19)
y shape: (15419, 8923)


Cell 4: Build the LSTM Model
Embedding Layer: Converts word numbers into dense vectors (captures meaning).

LSTM Layer: The "memory" layer that understands the sequence context.

Dense Layer: Outputs a probability score for every possible next word in our dictionary.

In [4]:
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(LSTM(100)) # 100 units of memory
model.add(Dense(total_words, activation='softmax')) # Output layer

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



Cell 5: Train the Model
We train for 50 epochs. Since we used a small slice of the book (2000 lines), this will be quick.

In [6]:
print("Training model... (This may take 1-2 minutes)")
history = model.fit(X, y, epochs=5, verbose=1)

Training model... (This may take 1-2 minutes)
Epoch 1/5
[1m482/482[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 44ms/step - accuracy: 0.1010 - loss: 5.5016
Epoch 2/5
[1m482/482[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 45ms/step - accuracy: 0.1138 - loss: 5.2956
Epoch 3/5
[1m482/482[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 45ms/step - accuracy: 0.1236 - loss: 5.1560
Epoch 4/5
[1m482/482[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 41ms/step - accuracy: 0.1309 - loss: 5.0275
Epoch 5/5
[1m482/482[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 45ms/step - accuracy: 0.1453 - loss: 4.9087


Cell 6: Test Prediction (Generate Text)
Now for the fun part! We give it a "seed text" (e.g., "Sherlock"), and it predicts the next words one by one.

In [7]:
def predict_next_words(seed_text, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

        # Predict the next word index
        predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)

        # Convert index back to word
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break

        seed_text += " " + output_word
    return seed_text

# Test it out!
print(predict_next_words("Sherlock Holmes", 10))
print(predict_next_words("The case was", 10))
print(predict_next_words("I am", 5))

Sherlock Holmes <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>
The case was <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>
I am <OOV> <OOV> <OOV> <OOV> <OOV>
