# Next Word Prediction (NWP)

**The Next Word Prediction (NWP) project is focused on building a deep learning model that predicts the most likely next word in a sequence of text. This is achieved by training a Long Short-Term Memory (LSTM) network on a corpus of text data, where the model learns patterns, context,and relationships between words. The workflow involves tokenizing the text, creating input-output sequences, padding sequences to a uniform length, and converting labels to one-hot vectors.**

**The model uses an embedding layer to convert words into dense vectors, followed by stacked LSTM layers to capture temporal dependencies. A dense output layer with softmax activation predicts the next word from the vocabulary. The trained model can generate meaningful text by predicting words iteratively, demonstrating an understanding of linguistic context. This project is applicable to text generation, autocomplete features, and conversational AI systems.**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Import libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

**Load the dataset**

In [2]:
df=pd.read_csv("medium_data.csv")
df.head()

Unnamed: 0,id,url,title,subtitle,image,claps,responses,reading_time,publication,date
0,1,https://towardsdatascience.com/a-beginners-gui...,A Beginner’s Guide to Word Embedding with Gens...,,1.png,850,8,8,Towards Data Science,2019-05-30
1,2,https://towardsdatascience.com/hands-on-graph-...,Hands-on Graph Neural Networks with PyTorch & ...,,2.png,1100,11,9,Towards Data Science,2019-05-30
2,3,https://towardsdatascience.com/how-to-use-ggpl...,How to Use ggplot2 in Python,A Grammar of Graphics for Python,3.png,767,1,5,Towards Data Science,2019-05-30
3,4,https://towardsdatascience.com/databricks-how-...,Databricks: How to Save Files in CSV on Your L...,When I work on Python projects dealing…,4.jpeg,354,0,4,Towards Data Science,2019-05-30
4,5,https://towardsdatascience.com/a-step-by-step-...,A Step-by-Step Implementation of Gradient Desc...,One example of building neural…,5.jpeg,211,3,4,Towards Data Science,2019-05-30


* For next word prediction, the model must learn from continuous text (sentences, paragraphs).
* That’s why we select only one text column — otherwise, non-text data would confuse the tokenizer and the model.

In [3]:
titles=df['title']

*  Convert all titles into string type to avoid issues with numbers or NaN values.
* Then, turn the pandas Series into a Python list for easier processing.

In [4]:
texts=titles.astype('str').to_list()

In [5]:
texts


['A Beginner’s Guide to Word Embedding with Gensim Word2Vec\xa0Model',
 'Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric',
 'How to Use ggplot2 in\xa0Python',
 'Databricks: How to Save Files in CSV on Your Local\xa0Computer',
 'A Step-by-Step Implementation of Gradient Descent and Backpropagation',
 'An Easy Introduction to SQL for Data Scientists',
 'Hypothesis testing visualized',
 'Introduction to Latent Matrix Factorization Recommender Systems',
 'Which 2020 Candidate is the Best at\xa0Twitter?',
 'What if AI model understanding were\xa0easy?',
 '<em class="markup--em markup--h3-em">What I Learned from (Two-time) Kaggle Grandmaster Abhishek\xa0Thakur</em>',
 'Making a DotA2 Bot Using\xa0ML',
 'Building A ‘Serverless’ Chrome Extension',
 'How to Teach\xa0Code',
 'Reinventing Personalization For Customer Experience',
 'How to Automate Hyperparameter Optimization',
 'Ideas: Design Methodologies for Data\xa0Sprints',
 'RoboSomm Chapter 3: Wine Embeddings and a Wine Reco

* Initialize a Keras tokenizer to build a vocabulary from the text data.
* Fit the tokenizer on our text list so it learns all unique words and assigns each a number.
* Convert each text into a sequence of integers representing the words.

In [6]:
tokenizer=tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
seq= tokenizer.texts_to_sequences(texts)

* Display the first 10 tokenized sequences.
* Useful for quickly checking how the text was converted to integers.
* Helps verify that the tokenizer has correctly mapped words to indices.

In [7]:
seq[:10]

[[4, 565, 60, 1, 434, 1309, 14, 3507, 3508],
 [3509, 21, 782, 111, 157, 14, 477, 477, 1650],
 [5, 1, 62, 3510, 192],
 [3511, 5, 1, 231, 1073, 10, 2216, 21, 9, 3512],
 [4, 169, 63, 169, 398, 6, 3513, 2217, 7, 1310],
 [23, 241, 100, 1, 399, 11, 18, 265],
 [1311, 293, 3514],
 [100, 1, 3515, 783, 2218, 627, 294],
 [346, 435, 1651, 13, 2, 71, 3516],
 [20, 68, 46, 108, 131, 3517]]

* Access the dictionary mapping each word in the texts to a unique integer index.
* Shows how the tokenizer has encoded the vocabulary.
* Useful for understanding or debugging the tokenization process.

In [8]:
tokenizer.word_index

{'to': 1,
 'the': 2,
 'strong': 3,
 'a': 4,
 'how': 5,
 'of': 6,
 'and': 7,
 'markup': 8,
 'your': 9,
 'in': 10,
 'for': 11,
 'you': 12,
 'is': 13,
 'with': 14,
 'class': 15,
 'h3': 16,
 'why': 17,
 'data': 18,
 'i': 19,
 'what': 20,
 'on': 21,
 'from': 22,
 'an': 23,
 'learning': 24,
 'can': 25,
 'are': 26,
 'my': 27,
 'be': 28,
 'using': 29,
 'do': 30,
 'ux': 31,
 'design': 32,
 'not': 33,
 'when': 34,
 'writing': 35,
 'that': 36,
 'we': 37,
 'about': 38,
 '5': 39,
 'machine': 40,
 'make': 41,
 'it': 42,
 'should': 43,
 'as': 44,
 'need': 45,
 'ai': 46,
 '3': 47,
 'more': 48,
 'don’t': 49,
 'life': 50,
 'marketing': 51,
 'or': 52,
 'will': 53,
 'have': 54,
 'ways': 55,
 'get': 56,
 'time': 57,
 'at': 58,
 'up': 59,
 'guide': 60,
 'science': 61,
 'use': 62,
 'by': 63,
 'write': 64,
 'business': 65,
 'new': 66,
 'python': 67,
 'if': 68,
 'deep': 69,
 'self': 70,
 'best': 71,
 'first': 72,
 'into': 73,
 'top': 74,
 'tips': 75,
 'things': 76,
 'stop': 77,
 'analysis': 78,
 'intelligence'

In [None]:
X = []  # List to store input sequences
y = []  # List to store corresponding next-word labels
total_words_dropped = 0  # Counter for sequences too short to use

for i in seq:
    if len(i) > 1:
        for index in range(1, len(i)):
            X.append(i[:index])  # Add sub-sequence up to current index as input
            y.append(i[index])   # Add the next word as the label
    else:
        total_words_dropped += 1  # Increment counter if sequence has only one word

print("Total Single Words Dropped are:", total_words_dropped)  # Show number of dropped sequences


Total Single Words Dropped are: 14


In [None]:
X[:10]  # Display the first 10 input sequences (sub-sequences) to verify the data preparation.

[[4],
 [4, 565],
 [4, 565, 60],
 [4, 565, 60, 1],
 [4, 565, 60, 1, 434],
 [4, 565, 60, 1, 434, 1309],
 [4, 565, 60, 1, 434, 1309, 14],
 [4, 565, 60, 1, 434, 1309, 14, 3507],
 [3509],
 [3509, 21]]

In [None]:
y[:10] # Display the first 10 target words corresponding to the input sequences in X.

[565, 60, 1, 434, 1309, 14, 3507, 3508, 21, 782]

* Pad all input sequences in X to the same length for uniformity.
* Shorter sequences are padded with zeros at the beginning by default.
* Necessary for feeding the data into a neural network.

In [22]:
X = tf.keras.preprocessing.sequence.pad_sequences(X)

In [23]:
X

array([[  0,   0,   0, ...,   0,   0,   4],
       [  0,   0,   0, ...,   0,   4, 565],
       [  0,   0,   0, ...,   4, 565,  60],
       ...,
       [  0,   0,   0, ...,   1,  64,   4],
       [  0,   0,   0, ...,  64,   4, 104],
       [  0,   0,   0, ...,   4, 104,  65]], dtype=int32)

In [24]:
X.shape

(43439, 37)

* Convert the target labels y into one-hot encoded vectors.
* This is required for categorical prediction with a softmax output layer.

In [25]:
y = tf.keras.utils.to_categorical(y)

In [26]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [27]:
y.shape

(43439, 10970)

* Calculate the vocabulary size (total unique words) for the embedding layer.
* Add 1 to account for the padding token (index 0).

In [28]:
vocab_size = len(tokenizer.word_index) + 1

In [29]:
vocab_size

10970

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 14),           # Map words to 14-dimensional vectors
    tf.keras.layers.LSTM(100, return_sequences=True),    # First LSTM layer, returns full sequences
    tf.keras.layers.LSTM(100),                           # Second LSTM layer, returns final state
    tf.keras.layers.Dense(100, activation='relu'),       # Dense layer with 100 neurons and ReLU activation
    tf.keras.layers.Dense(vocab_size, activation='softmax'),  # Output layer predicting next word
])

* Display a summary of the model architecture.
* Shows each layer, output shapes, and the number of trainable parameters.

In [34]:
model.summary()

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),  # Use Adam optimizer with learning rate 0.001
    loss='categorical_crossentropy',                            # Categorical crossentropy for multi-class prediction
    metrics=['accuracy']                                        # Track accuracy during training
)

* Train the model on the prepared data
* X = input sequences, y = one-hot encoded next-word labels
* Train for 250 epochs to let the model learn sequence patterns
* Training on CPU is slow, especially for 250 epochs with LSTM layers.
* Using Google Colab helps by providing a GPU, but large models and long sequences can still take a long time (1+ hour).
* Consider reducing epochs, using smaller sequence lengths, or leveraging GPU/TPU for faster training.


In [33]:
model.fit(X, y, epochs=250)

Epoch 1/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 10ms/step - accuracy: 0.0401 - loss: 7.7623
Epoch 2/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 10ms/step - accuracy: 0.0802 - loss: 7.0263
Epoch 3/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 10ms/step - accuracy: 0.1084 - loss: 6.7489
Epoch 4/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 10ms/step - accuracy: 0.1197 - loss: 6.5253
Epoch 5/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 10ms/step - accuracy: 0.1253 - loss: 6.3308
Epoch 6/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 10ms/step - accuracy: 0.1300 - loss: 6.1723
Epoch 7/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 10ms/step - accuracy: 0.1339 - loss: 6.0174
Epoch 8/250
[1m1358/1358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 10ms/step - accuracy: 0.1402 - loss: 5.8531


<keras.src.callbacks.history.History at 0x7f3be91f8aa0>

* Save the trained model to a file named 'nwp.h5'.
* This allows reloading the model later without retraining.

In [35]:
model.save('nwp.h5')



In [None]:
import os

# Print the current working directory
# Useful to check where files (like 'nwp.h5') will be saved or loaded from

print("Current working directory:", os.getcwd())

Current working directory: /content


In [None]:
# Create a NumPy array of all words in the tokenizer's vocabulary.
# Converts the dictionary keys (words) into an array for easy indexing or sampling.

vocab_array = np.array(list(tokenizer.word_index.keys()))

In [38]:
vocab_array

array(['to', 'the', 'strong', ..., 'hits', 'the\xa0paper', 'blog\xa0post'],
      dtype='<U28')

In [None]:
def make_prediction(text, n_words):
    # Predict the next 'n_words' words based on the input 'text'
    for i in range(n_words):
        # Convert input text to sequence of integers
        text_tokenize = tokenizer.texts_to_sequences([text])
        
        # Pad the sequence to the fixed length expected by the model
        text_padded = tf.keras.preprocessing.sequence.pad_sequences(text_tokenize, maxlen=14)
        
        # Predict next word: model output is probabilities, take the argmax
        prediction = np.squeeze(np.argmax(model.predict(text_padded), axis=-1))
        
        # Map the predicted index back to the corresponding word
        prediction = str(vocab_array[prediction - 1])
        
        # Optional: print top predictions (here slicing removes last 3 for display)
        print(vocab_array[np.argsort(model.predict(text_padded)) - 1].ravel()[:-3])
        
        # Append predicted word to input text for next iteration
        text += " " + prediction
        
    return text  # Return the expanded text after adding predicted words


In [None]:
# Generate 10 new words following the seed word 'child' using the trained model.
# The function appends each predicted word to the input text iteratively.

make_prediction('child',10)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
['more\xa0ram' 'and\xa0cats' 'dogs' ... 'designs\xa0online'
 'in\xa0evidence' 'david']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
['using\xa0it' 'of\xa0voice' 'black\xa0moms' ... 'rasa' 'more' 'an']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
['special' '\xa0for\xa0the\xa0modern' 'his\xa0head' ... 'management' 'new'
 'or']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
['a\xa0mammoth' 'tech\xa0debt' 'women’s' ... 'application' 'development'
 'action']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━

'child your sweary routine approach i work impact span will is'

In [41]:
make_prediction("hello",10)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
['nlp\xa0' 'first\xa0time' '‘faq’' ... 'who' 'negotiating' 'application']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
['months\u200a—\u200aand' '‘social' 'doesn’t\xa0work' ... 'how'
 'in\xa02019' 'your']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
['average”' 'year\xa0yet' 'cloud\u200a—\u200ait’s\xa0time' ...
 'artificial' 'bank' 'simple']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
['natalie\xa0goldberg' 'entropy' 'who’s' ... 'time' 'how' 'programming']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━

'hello tips a new year’s page their\xa0homework strong it\xa0anyway the near\xa0you'

In [42]:
make_prediction("Bye",5)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
['8pt' 'so\xa0annoying' 'realized' ... 'go' 'you' 'and']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
['gravity\u200a—\u200aa' 'europe' 'writing\xa0stink' ... 'will' 'all'
 'like\xa0it']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
['the\xa0creek' 'scammers' 'masquerading' ... 'how' 'the' 'happy']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
['science\u200a—\u200aanscombe’s' 'the\xa02000s' 'films' ... 'build'
 'possessions' 'save\xa0money']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m

'Bye we don’t a basic\xa0income strong'

In [43]:
make_prediction("hey",5)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
['hexagonal' 'think\xa0big' 'flourishes' ... 'let’s' 'the' 'do']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
['combined' 'alerting\xa0bot' 'cloud\xa0costs' ... 'design' 'an' 'the']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
['vaping' 'thank\xa0you' 'design\xa0story' ... 'months' 'the' 'to']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
['logging' '“consultant”' 'the\xa0creek' ... 'big' 'ux' 'copy']
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
['confess' 

'hey it not our key action'

In [None]:
import pickle

# Save the model 
with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer,f)