### **NEXT WORD PRDICTION USING DEEP LEARNING**

---


### **Project By: Alaissa Shaikh**
### **Data Scientist**

---



This project investigates the application of deep learning techniques to the challenging task of next word prediction. We explore the use of an LSTM network, a powerful architecture for sequential data, to model the complex relationships between words in a text sequence. The model is trained on a large dataset of text and demonstrates promising results in predicting the next word in a given context. This work contributes to the growing field of natural language processing and showcases the potential of deep learning for developing innovative language-based applications.

---



In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

The library provide the necessary tools for data preprocessing, model architecture, and layer definitions.

---



In [2]:
# Step 1: Read the dataset
with open('/content/sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [3]:
# Step 2: Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

This is crucial for subsequent steps in the next word prediction model, such as converting the text data into numerical sequences and determining the size of the vocabulary for the Embedding layer.

---



In [4]:
# Step 3: Create input-output sequences
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

Iterates through each line in the text data and creates input sequences for the next word prediction model. These input sequences consist of multiple words, which will be used to predict the next word in the sequence.

---



In [5]:
# Step 4: Pad the sequences
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

Ensures that all input sequences have the same length by padding shorter sequences with zeros at the beginning. This is necessary for many deep learning models, as they require input data to have a consistent shape.

---



In [6]:
# Step 5: Split data into features (X) and labels (y)
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

It will learn to predict the next word (label) given a sequence of words (features).

---



In [7]:
# Convert labels to one-hot encoding
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

Converts the integer labels into a format suitable for training a neural network. One-hot encoding allows the model to easily learn the relationships between different words and their probabilities.

---



In [11]:
# Step 6: Build the model
model = Sequential()
model.add(Embedding(total_words, 100))  # Removed `input_length`
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))

Defines a simple but effective neural network architecture for the next word prediction task. The model takes input sequences of words, embeds them into dense vectors, processes them with an LSTM layer to capture sequential information, and finally outputs a probability distribution over the vocabulary, indicating the likelihood of each word being the next word in the sequence.

---



In [12]:
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [15]:
# Build the model with dummy input
model.build(input_shape=(None, max_sequence_len - 1))

# Display model summary
model.summary()

*  Total Parameters: The model has a total of 2,208,800 trainable parameters. This number indicates the complexity of the model and the amount of data required to train it effectively.


---


*  In summary: This model uses an embedding layer to represent words as vectors, an LSTM layer to capture sequential information, and a dense layer to generate a probability distribution over the vocabulary. The model is designed for next word prediction tasks and has a relatively large number of parameters, suggesting it may be a complex model.

---



In [16]:
# Step 7: Train the model
model.fit(X, y, epochs=20, verbose=1)  # Reduce epochs for quicker training during testing

Epoch 1/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 7ms/step - accuracy: 0.0616 - loss: 6.5603
Epoch 2/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 7ms/step - accuracy: 0.1177 - loss: 5.5789
Epoch 3/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 7ms/step - accuracy: 0.1449 - loss: 5.1349
Epoch 4/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 7ms/step - accuracy: 0.1650 - loss: 4.7807
Epoch 5/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 7ms/step - accuracy: 0.1829 - loss: 4.4613
Epoch 6/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 7ms/step - accuracy: 0.2056 - loss: 4.1656
Epoch 7/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 7ms/step - accuracy: 0.2319 - loss: 3.8865
Epoch 8/20
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 7ms/step - accuracy: 0.2643 - loss: 3.6165
Epoch 9/20
[1m3

<keras.src.callbacks.history.History at 0x7da947541b90>

In [17]:
# Function to generate predictions
def predict_next_words(seed_text, next_words, max_sequence_len, model, tokenizer):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, paddi# Generate predictions
output_text = predict_next_words(seed_text, next_words, max_sequence_len, model, tokenizer)
print(f"Generated Text: {output_text}")ng='pre')
        predicted = np.argmax(model.predict(token_list), axis=-1)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

This function takes a seed text, a trained language model, and other parameters. It then iteratively predicts the next word, appends it to the seed text, and repeats the process to generate a sequence of words.



---



In [22]:
# User input for seed text and number of words
seed_text = input("Enter a seed text: ")
next_words = int(input("Enter the number of words to predict: "))

Enter a seed text: sherlock holmes felt fine
Enter the number of words to predict: 10


In [23]:
# Generate predictions
output_text = predict_next_words(seed_text, next_words, max_sequence_len, model, tokenizer)
print(f"Generated Text: {output_text}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
Generated Text: sherlock holmes felt fine that i had a wild free of the victim back


Through this project, I gained valuable experience in implementing deep learning models for natural language processing tasks. I learned the importance of data preprocessing, model architecture design, and hyperparameter tuning. I also gained insights into the challenges and limitations of current language models. This project has provided a strong foundation for further exploration in the field of natural language processing and has inspired me to continue exploring the exciting possibilities of deep learning.

---



---

