<a href="https://colab.research.google.com/github/github-ashwin/Python/blob/main/LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical


Sample Text Data:

In [None]:
text = "This is a simple example for next word prediction using LSTM model. A Long Short-Term Memory (LSTM) model is a type of deep neural network that can process and analyze sequential data, such as time series, text, and speech. LSTMs are used in many applications, including speech recognition, language translation, and sentiment analysis."

Tokenization:We create a Tokenizer instance and fit it to our text. This assigns a unique integer to each word based on its frequency. total_words keeps track of the total number of unique words.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1


In [None]:
total_words

47

Create Input Sequences:We split the text into sentences and convert each sentence into a sequence of integers. For each word, we create an n-gram sequence (i.e., all previous words leading up to the current one) and append it to input_sequences.

In [None]:
input_sequences = []
for line in text.split('.'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    print(token_list)
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i + 1]
        print(n_gram_sequence)
        input_sequences.append(n_gram_sequence)


[7, 3, 1, 8, 9, 10, 11, 12, 13, 14, 4, 5]
[7, 3]
[7, 3, 1]
[7, 3, 1, 8]
[7, 3, 1, 8, 9]
[7, 3, 1, 8, 9, 10]
[7, 3, 1, 8, 9, 10, 11]
[7, 3, 1, 8, 9, 10, 11, 12]
[7, 3, 1, 8, 9, 10, 11, 12, 13]
[7, 3, 1, 8, 9, 10, 11, 12, 13, 14]
[7, 3, 1, 8, 9, 10, 11, 12, 13, 14, 4]
[7, 3, 1, 8, 9, 10, 11, 12, 13, 14, 4, 5]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22, 23, 24, 25, 26, 2, 27, 28, 29, 30, 31, 32, 33, 34, 2, 6]
[1, 15]
[1, 15, 16]
[1, 15, 16, 17]
[1, 15, 16, 17, 18]
[1, 15, 16, 17, 18, 4]
[1, 15, 16, 17, 18, 4, 5]
[1, 15, 16, 17, 18, 4, 5, 3]
[1, 15, 16, 17, 18, 4, 5, 3, 1]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22, 23]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22, 23, 24]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22, 23, 24, 25]
[1, 15, 16, 17, 18, 4, 5, 3, 1, 19, 20, 21, 22, 23, 24, 25, 26]
[1, 15, 16, 17, 18, 

Pad Sequences:We determine the maximum sequence length and pad all sequences to ensure they have the same length. Padding is done on the "pre" side, meaning zeros are added at the start.

In [None]:
max_sequence_length = max(len(x) for x in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')


In [None]:
input_sequences

array([[ 0,  0,  0, ...,  0,  7,  3],
       [ 0,  0,  0, ...,  7,  3,  1],
       [ 0,  0,  0, ...,  3,  1,  8],
       ...,
       [ 0,  0,  0, ..., 43, 44,  2],
       [ 0,  0,  0, ..., 44,  2, 45],
       [ 0,  0,  0, ...,  2, 45, 46]], dtype=int32)

Create Predictors and Labels:X contains all but the last word of each sequence (the input), and y contains the last word (the label). We convert y into a categorical format for multi-class classification.


In [None]:
X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = to_categorical(y, num_classes=total_words)


In [None]:
y.shape

(51, 47)

Build the LSTM Model

Embedding Layer: Converts integer sequences to dense vectors of fixed size (100 in this case).

LSTM Layer: A recurrent layer that processes the sequences. Here, we use 110 units.

Dense Layer: The output layer with a softmax activation function for multi-class classification, which predicts the next word from the vocabulary.

In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_length - 1))
model.add(LSTM(110))
model.add(Dense(total_words, activation='softmax'))




Compile the Model

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Train the Model

In [None]:
model.fit(X, y, epochs=100, verbose=1)


Epoch 1/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 50ms/step - accuracy: 0.0235 - loss: 3.8496
Epoch 2/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 0.1097 - loss: 3.8385
Epoch 3/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step - accuracy: 0.1671 - loss: 3.8263
Epoch 4/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 0.0862 - loss: 3.8127
Epoch 5/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 0.1097 - loss: 3.7933 
Epoch 6/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.0940 - loss: 3.7609
Epoch 7/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.0835 - loss: 3.7169
Epoch 8/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.0627 - loss: 3.6757
Epoch 9/100
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

<keras.src.callbacks.history.History at 0x1246bbff450>

In [None]:
model.summary()

Predict the Next Word


In [None]:
def predict_next_word(model, tokenizer, input_text):
    input_seq = tokenizer.texts_to_sequences([input_text])[0]
    input_seq = pad_sequences([input_seq], maxlen=max_sequence_length - 1, padding='pre')
    predicted = model.predict(input_seq, verbose=0)
    return tokenizer.index_word[np.argmax(predicted)]


In [None]:
input_text = "LSTM"
next_word = predict_next_word(model, tokenizer, input_text)
print(f"The next word is: {next_word}")


The next word is: is
