# Next Word Prediction Using GRU

### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Gated Recurrent Unit (GRU) networks, which are well-suited for sequence prediction tasks.

The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, an LSTM layer, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import tensorflow
from tensorflow import keras

## Data Collection

In [2]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

data=gutenberg.raw('shakespeare-hamlet.txt')

# Save data as a file:
with open ('hamlet.txt','w') as file:
    file.write(data)

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


## Data Preprocessing

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Load Data

In [4]:
with open ('hamlet.txt','r') as file:
    data=file.read().lower()

### Tokenization

In [5]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts([data])

# The tokenizer creates an internal word index (a dictionary) where each unique word is assigned an integer index.
# Words are assigned numbers based on the frequency (repeatability) with which they appear in the texts.
# Suppose, data = ["a cat sat on the mat", "the cat is big", "a big cat"]
# then, tokenizer.fit_on_texts([data]) gives output = {'cat': 1, 'a': 2, 'the': 3, 'big': 4, 'sat': 5, 'on': 6, 'mat': 7, 'is': 8}
# Here, "cat": appears 3 times, "the" & "a": appears 2 times and "sat", "on", "mat", "is", "big" each appear once.

total_words=len(tokenizer.word_index)+1
print(f'Total Words: {total_words}') # Prints total number of unique words in data.

Total Words: 4818


### Indices for each word in data

In [6]:
# print(tokenizer.word_index)
print(list(tokenizer.word_index.items())[:20])

[('the', 1), ('and', 2), ('to', 3), ('of', 4), ('i', 5), ('you', 6), ('a', 7), ('my', 8), ('it', 9), ('in', 10), ('that', 11), ('ham', 12), ('is', 13), ('not', 14), ('his', 15), ('this', 16), ('with', 17), ('your', 18), ('but', 19), ('for', 20)]


### Create Input sequence of data

In [7]:
input_sequence = []
for line in data.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequence.append(n_gram_sequence)
input_sequence[:10]

[[1, 687],
 [1, 687, 4],
 [1, 687, 4, 45],
 [1, 687, 4, 45, 41],
 [1, 687, 4, 45, 41, 1886],
 [1, 687, 4, 45, 41, 1886, 1887],
 [1, 687, 4, 45, 41, 1886, 1887, 1888],
 [1180, 1889],
 [1180, 1889, 1890],
 [1180, 1889, 1890, 1891]]

In [8]:
len(input_sequence)

25732

In [9]:
'''
tokenizer.texts_to_sequences(["a cat sat on the mat"])
This will give output: [[2, 1, 5, 6, 3, 7]]
This is beacuse: {'cat': 1, 'a': 2, 'the': 3, 'big': 4, 'sat': 5, 'on': 6, 'mat': 7, 'is': 8}
This is how we can create input sequnce of entire data line by line.

Example: If a line in data is "I love coding", and the tokenizer converts it to [4, 7, 15], token_list will hold this list of word indices.

for i in range(1, len(token_list)):
This loop runs from index 1 to the length of token_list. It starts at 1 because it is constructing sequences starting from the second word onwards.
For example, if token_list is [4, 7, 15], i will take values from 1 to 2 (so it will construct n-grams of length 2 and 3).

n_gram_sequence = token_list[:i+1]:
Creates an n-gram by taking a slice of token_list from the beginning up to the i+1-th element.
This builds n-gram sequences incrementally, starting from a 2-gram and increasing with each iteration.

Example:
If token_list = [4, 7, 15], then:
When i = 1: n_gram_sequence = [4, 7]
When i = 2: n_gram_sequence = [4, 7, 15] & so on...
'''

'\ntokenizer.texts_to_sequences(["a cat sat on the mat"])\nThis will give output: [[2, 1, 5, 6, 3, 7]]\nThis is beacuse: {\'cat\': 1, \'a\': 2, \'the\': 3, \'big\': 4, \'sat\': 5, \'on\': 6, \'mat\': 7, \'is\': 8}\nThis is how we can create input sequnce of entire data line by line.\n\nExample: If a line in data is "I love coding", and the tokenizer converts it to [4, 7, 15], token_list will hold this list of word indices.\n\nfor i in range(1, len(token_list)):\nThis loop runs from index 1 to the length of token_list. It starts at 1 because it is constructing sequences starting from the second word onwards.\nFor example, if token_list is [4, 7, 15], i will take values from 1 to 2 (so it will construct n-grams of length 2 and 3).\n\nn_gram_sequence = token_list[:i+1]:\nCreates an n-gram by taking a slice of token_list from the beginning up to the i+1-th element.\nThis builds n-gram sequences incrementally, starting from a 2-gram and increasing with each iteration.\n\nExample:\nIf token_

### Create Padding for fixed input size

In [10]:
max_length=max([len(x) for x in input_sequence])
max_length


14

In [11]:
input_sequences=np.array(pad_sequences(input_sequence,maxlen=max_length,padding='pre'))
input_sequences

array([[   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       [   0,    0,    0, ...,  687,    4,   45],
       ...,
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4],
       [   0,    0,    0, ..., 1047,    4,  193]])

## Define X and y

In [12]:
X,y=input_sequences[:,:-1],input_sequences[:,-1]
print(X.shape),print(y.shape)
X

(25732, 13)
(25732,)


array([[   0,    0,    0, ...,    0,    0,    1],
       [   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       ...,
       [   0,    0,    0, ...,  687,    4,   45],
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4]])

In [13]:
y

array([ 687,    4,   45, ..., 1047,    4,  193])

### Convert y into categorical feature for next word prediction

In [14]:
y=keras.utils.to_categorical(y,num_classes=total_words)
print(y.shape)
y

(25732, 4818)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Train-Test Split

In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(20585, 13)
(5147, 13)
(20585, 4818)
(5147, 4818)


## LSTM Model

In [16]:
'''
from tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout
from tensorflow.keras.models import Sequential

def create_model(lstm_units=64, lstm_units2=128, embedding_dim=50, optimizer='adam'):
    model = Sequential()
    model.add(Embedding(total_words, embedding_dim, input_shape=(max_length-1,)))
    
    model.add(LSTM(lstm_units, return_sequences=True))
    
    model.add(Dropout(0.3))  # Dropout can be made tunable

    model.add(LSTM(lstm_units2))  # Final LSTM layer
    
    model.add(Dense(total_words, activation='softmax'))
    
    # Compile with passed optimizer
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=['accuracy'])
    
    # Print model summary
    print("Model Summary:")
    model.summary()

    # Explicitly set the compiled attribute for compatibility with SciKeras
    model.compiled = True

    return model

# No optimizer tuning for now
from scikeras.wrappers import KerasClassifier
model = KerasClassifier(model=create_model, verbose=1, lstm_units=128, lstm_units2=64, embedding_dim=50)

param_grid = {
    'lstm_units': [128, 256],
    'lstm_units2': [64, 128],
    'embedding_dim': [50, 100],
    'epochs': [100]
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, verbose=2)
grid_result = grid.fit(X, y, batch_size=16)

# Output best result
print(f"Best score: {grid_result.best_score_} using {grid_result.best_params_}")
'''

'\nfrom tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout\nfrom tensorflow.keras.models import Sequential\n\ndef create_model(lstm_units=64, lstm_units2=128, embedding_dim=50, optimizer=\'adam\'):\n    model = Sequential()\n    model.add(Embedding(total_words, embedding_dim, input_shape=(max_length-1,)))\n    \n    model.add(LSTM(lstm_units, return_sequences=True))\n    \n    model.add(Dropout(0.3))  # Dropout can be made tunable\n\n    model.add(LSTM(lstm_units2))  # Final LSTM layer\n    \n    model.add(Dense(total_words, activation=\'softmax\'))\n    \n    # Compile with passed optimizer\n    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=[\'accuracy\'])\n    \n    # Print model summary\n    print("Model Summary:")\n    model.summary()\n\n    # Explicitly set the compiled attribute for compatibility with SciKeras\n    model.compiled = True\n\n    return model\n\n# No optimizer tuning for now\nfrom scikeras.wrappers import KerasClassifier\nmodel

In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense

# Define Model
model=Sequential()
model.add(Embedding(total_words, 50, input_shape=(max_length-1,)))
model.add(LSTM(128))
model.add(Dense(total_words,activation='softmax'))

#Compile Model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


  super().__init__(**kwargs)


## Train Model

In [18]:
history=model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=100,verbose=1)

Epoch 1/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 28ms/step - accuracy: 0.0289 - loss: 7.1340 - val_accuracy: 0.0404 - val_loss: 6.7031
Epoch 2/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 25ms/step - accuracy: 0.0415 - loss: 6.4706 - val_accuracy: 0.0412 - val_loss: 6.7605
Epoch 3/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 27ms/step - accuracy: 0.0439 - loss: 6.3172 - val_accuracy: 0.0490 - val_loss: 6.7819
Epoch 4/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 37ms/step - accuracy: 0.0550 - loss: 6.0775 - val_accuracy: 0.0513 - val_loss: 6.8025
Epoch 5/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 26ms/step - accuracy: 0.0611 - loss: 5.9063 - val_accuracy: 0.0622 - val_loss: 6.8439
Epoch 6/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 25ms/step - accuracy: 0.0697 - loss: 5.7198 - val_accuracy: 0.0674 - val_loss: 6.9117
Epoch 7/10

## Save model and Tokenizer

In [19]:
model.save('model.h5')

import pickle
with open('tokenizer.pickle','wb') as handle:
    pickle.dump(tokenizer,handle,protocol=pickle.HIGHEST_PROTOCOL)



## Prediction

In [20]:
def predict_next_word(model,tokenizer,data,max_length):
    token_list = tokenizer.texts_to_sequences([data])[0]
    if len(token_list) >= max_length:
        token_list = token_list[-(max_length-1):] # Ensure the sequnce length matches the max length
    token_list = pad_sequences([token_list], maxlen=max_length-1, padding='pre')
    prediction = model.predict(token_list, verbose=0)
    predicted_index = np.argmax(prediction, axis=1)
    for word,index in tokenizer.word_index.items():
        if index == predicted_index:
            return word
    return None

### Sample Prediction

In [21]:
sample="The Tragedie of Hamlet by William"
print(f'Sample Text: {sample}')
max_length=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,sample,max_length)
print(f'Next Word: {next_word}')

Sample Text: The Tragedie of Hamlet by William
Next Word: shakespeare
