# **LSTM Text Generation: Beyond Auto-Completion**

> *This notebook utilizes LSTM (Long Short-Term Memory) neural networks for predictive text generation. By training on a dataset of text sequences, the model learns to predict the next word in a sequence, enabling the generation of coherent and contextually relevant text. Through this notebook, explore the power of LSTM networks in predicting and generating natural language text.*
> 

## Library Imports and Directory Traversal

This code snippet imports necessary libraries for working with data and building a neural network model using TensorFlow and Keras. Here's a breakdown of what each part of the code does:

1. `numpy` and `pandas` are imported for handling numerical operations and data manipulation, respectively.
2. `tensorflow` is imported as `tf` for deep learning tasks.
3. Specific modules from `tensorflow.keras.preprocessing.text` and `tensorflow.keras.preprocessing.sequence` are imported for text preprocessing tasks like tokenization and padding.
4. `to_categorical` from `tensorflow.keras.utils` is imported for one-hot encoding.
5. Necessary layers and models are imported from `keras`.
6. `time` module is imported for measuring execution time.
7. `pickle` module is imported for serializing and deserializing Python objects.

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, BatchNormalization
from keras.regularizers import l2
import time
import pickle
import os

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

2024-02-11 18:59:45.152334: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-11 18:59:45.152442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-11 18:59:45.273865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


/kaggle/input/next-word-predictor-text-generator-dataset/next_word_predictor.txt


## Reading Text File

This code snippet opens a text file located at 'next_word_predictor.txt' in read mode with UTF-8 encoding. It then reads the contents of the file and assigns it to the variable `data`.

In [None]:
# Open the text file in read mode with UTF-8 encoding
with open('next_word_predictor.txt', 'r', encoding='utf-8') as file:
    # Read the contents of the file and assign it to the variable 'data'
    data = file.read()

## Function Description: Separate Punctuation

This function takes a string of text (`doc_text`) as input and returns a list of tokens (words) with punctuation removed. Each token is converted to lowercase before being added to the list.

#### Parameters:
- `doc_text`: A string representing the input text document.

#### Returns:
- A list of tokens (words) without punctuation, converted to lowercase.

In [None]:
def separate_punc(doc_text):
    """
    Separate punctuation from the input text and convert tokens to lowercase.
    
    Args:
    doc_text (str): Input text document.
    
    Returns:
    list: List of tokens (words) without punctuation, converted to lowercase.
    """
    return [token.lower() for token in doc_text.split(" ") if token not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

## Cleaning Text Data

This code snippet utilizes the previously defined `separate_punc` function to clean the text data stored in the variable `data`. The function separates punctuation from the text and converts tokens to lowercase. After cleaning, the tokens are joined back into a single string and stored in the variable `cleaned_data`.

In [None]:
# Clean the text data by removing punctuation and converting tokens to lowercase
data = separate_punc(data)

# Join the cleaned tokens back into a single string
cleaned_data = " ".join(data)

## Tokenization

This code segment initializes a tokenizer object using the `Tokenizer` class from TensorFlow Keras, and fits it on the cleaned text data. The tokenizer is then used to generate word indices, where each word in the text data is assigned a unique index.

#### Observations from the result:

- The `tokenizer` object is initialized and fitted on the cleaned text data.
- The `word_index` attribute of the tokenizer object contains a dictionary mapping words to their respective indices, which can be used for further processing in natural language processing tasks.

In [None]:
# Initialize a tokenizer object
tokenizer = Tokenizer(num_words=None, char_level=False)

# Fit the tokenizer on the cleaned text data
tokenizer.fit_on_texts([cleaned_data])

# Retrieve the word indices from the tokenizer
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'i': 6,
 'you': 7,
 'in': 8,
 'is': 9,
 'monica': 10,
 'it': 11,
 'with': 12,
 'ross': 13,
 'that': 14,
 'rachel': 15,
 'for': 16,
 'chandler': 17,
 'this': 18,
 'on': 19,
 'joey': 20,
 'was': 21,
 'oh': 22,
 'phoebe': 23,
 'are': 24,
 'all': 25,
 'as': 26,
 'what': 27,
 'be': 28,
 'like': 29,
 'no': 30,
 "it's": 31,
 "i'm": 32,
 'her': 33,
 'they': 34,
 'just': 35,
 'from': 36,
 'okay': 37,
 'not': 38,
 'so': 39,
 'my': 40,
 'have': 41,
 'me': 42,
 'where': 43,
 'know': 44,
 'she': 45,
 'we': 46,
 'out': 47,
 'well': 48,
 'their': 49,
 'can': 50,
 'at': 51,
 'he': 52,
 'yeah': 53,
 'your': 54,
 'about': 55,
 'but': 56,
 'its': 57,
 'up': 58,
 "don't": 59,
 'text': 60,
 'scene': 61,
 'by': 62,
 'do': 63,
 'an': 64,
 'or': 65,
 'were': 66,
 'there': 67,
 'if': 68,
 'uh': 69,
 'look': 70,
 'life': 71,
 'through': 72,
 'into': 73,
 'him': 74,
 'his': 75,
 "you're": 76,
 'hey': 77,
 'how': 78,
 'right': 79,
 'think': 80,
 'time': 81,
 'no

## Word Index Length

This code snippet calculates the length of the word index generated by the tokenizer. The word index is a dictionary that maps words to their respective indices in the text data.

#### Observations from the result:

- The length of the word index is 4993, indicating that there are 4993 unique words in the text data after tokenization.

In [None]:
# Calculate the length of the word index generated by the tokenizer
word_index_length = len(tokenizer.word_index)

# Print the length of the word index with a descriptive message
print(f"The length of the word index is: {word_index_length}")

The length of the word index is: 4993


## Generating Input Sequences

This code snippet generates input sequences from the cleaned data using a tokenizer. It iterates over each sentence in the cleaned data, tokenizes the sentence, and creates input sequences by progressively adding tokens. The resulting input sequences are stored in a list.

#### Observations from the result:

- The `input_sequences` list contains sequences of tokens generated from the cleaned data.
- Each sequence consists of a variable number of tokens, representing the context and target words for training a language model.
- The printed output shows the first few input sequences as a demonstration.

In [None]:
# Initialize an empty list to store input sequences
input_sequences = []

# Iterate over each sentence in the cleaned data
for sentence in cleaned_data.split('\n'):
    # Tokenize the sentence
    tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
    
    # Iterate over the tokenized sentence to create input sequences
    for i in range(1, len(tokenized_sentence)):
        # Append the input sequence to the list
        input_sequences.append(tokenized_sentence[:i+1])

# Print the first few input sequences for demonstration
print(input_sequences[:20])

[[1, 155], [1, 155, 21], [1, 155, 21, 2368], [1, 155, 21, 2368, 1549], [1, 155, 21, 2368, 1549, 8], [1, 155, 21, 2368, 1549, 8, 1], [1, 155, 21, 2368, 1549, 8, 1, 422], [1, 155, 21, 2368, 1549, 8, 1, 422, 692], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370, 1], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370, 1, 423], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370, 1, 423, 4], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370, 1, 423, 4, 1], [1, 155, 21, 2368, 1549, 8, 1, 422, 692, 215, 2, 3, 2369, 1550, 2370, 1, 423, 4, 1, 1142], [1, 155, 21, 2368, 1549, 8, 1, 42

## Finding Maximum Sequence Length

This code snippet calculates the maximum sequence length among all input sequences generated from the cleaned data.

In [None]:
# Calculate the maximum sequence length among all input sequences
max_len = max([len(x) for x in input_sequences])

# Print the maximum sequence length
print(max_len)

325


### Code Description: Padding Input Sequences

This code snippet pads the input sequences to ensure uniform length. It uses the `pad_sequences` function from TensorFlow Keras to pad the sequences with zeros (pre-padding) up to the maximum sequence length (`max_len`). After padding, the input sequences are split into features (`X`) and labels (`y`) where the last token in each sequence is considered the label.

#### Observations from the result:

- The `X` variable contains the padded input sequences excluding the last token.
- The `y` variable contains the last token (label) of each padded input sequence.
- The shapes of `X` and `y` are `(26383, 324)` and `(26383,)`, respectively.

In [None]:
# Pad the input sequences to ensure uniform length
padded_input_sequences = pad_sequences(input_sequences, maxlen=max_len, padding='pre')

# Extract features (X) and labels (y)
X = padded_input_sequences[:, :-1]
y = padded_input_sequences[:, -1]

# Print the shapes of X and y
print(X.shape)
print(y.shape)

(26383, 324)
(26383,)


## One-Hot Encoding Labels

This code snippet performs one-hot encoding on the labels (`y`) to convert them into categorical format. It utilizes the `to_categorical` function from TensorFlow Keras to encode the labels into binary vectors with a dimension equal to the vocabulary size (`len(tokenizer.word_index) + 1`). Each label is represented as a binary vector where the index corresponding to the word index is set to 1, and all other indices are set to 0.

#### Observations from the result:

- The labels (`y`) are one-hot encoded into a categorical format, resulting in a shape of `(26383, 4994)` where 4994 is the vocabulary size plus one to account for the padded token.

In [None]:
# Perform one-hot encoding on the labels (y)
y = to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

# Print the shape of y after one-hot encoding
print(y.shape)

(26383, 4994)


## LSTM Language Model Architecture

This code defines an LSTM-based language model using the Keras Sequential API. The model architecture consists of three layers:

1. **Embedding Layer**: This layer converts integer indices into dense vectors of fixed size. It takes input sequences of length 324 (padded input sequences) and outputs dense vectors of size 100 for each word in the input sequence. The total number of parameters in this layer is 4994 * 100 = 499400.

2. **LSTM Layer**: This layer consists of 150 LSTM units. It takes the output of the embedding layer as input and processes the sequential data, capturing dependencies among words in the input sequence. The total number of parameters in this layer is (100 (input size) + 150) * 4 * 150 = 150600.

3. **Dense Layer**: This layer is a fully connected (dense) layer with softmax activation function. It takes the output of the LSTM layer and predicts the probability distribution over all words in the vocabulary (4994 classes). The total number of parameters in this layer is 150 (input size) * 4994 = 754094.

The model is compiled using binary cross-entropy loss function, Adam optimizer, and accuracy as the evaluation metric.

#### Observations from the model summary:

- The total number of trainable parameters in the model is 1404094.
- The model summary provides detailed information about each layer's type, output shape, and the number of parameters.

In [None]:
# Define the LSTM language model architecture
model = Sequential()
model.add(Embedding(4994, 100, input_length=324))
model.add(LSTM(150))
model.add(Dense(4994, activation="softmax"))

# Compile the model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Print the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 324, 100)          499400    
                                                                 
 lstm (LSTM)                 (None, 150)               150600    
                                                                 
 dense (Dense)               (None, 4994)              754094    
                                                                 
Total params: 1404094 (5.36 MB)
Trainable params: 1404094 (5.36 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Code Description: Training the LSTM Language Model

This code snippet trains the defined LSTM language model (`model`) using the input features (`X`) and one-hot encoded labels (`y`) for a specified number of epochs (150). During training, the model learns to predict the next word in a sequence based on the input context provided by the input sequences (`X`).

#### Observations:
- The model is trained for 100 epochs to learn the patterns and relationships within the input sequences.
- The training process updates the model parameters (weights) using the Adam optimizer and minimizes the binary cross-entropy loss between the predicted and actual labels.

In [None]:
# Train the LSTM language model
model.fit(X, y, epochs=150)

Epoch 1/150
  1/825 [..............................] - ETA: 41:24 - loss: 0.6932 - accuracy: 0.0000e+00

I0000 00:00:1707678002.795523      68 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 7

<keras.src.callbacks.History at 0x7a76abe64820>

## Generating Text using Trained Language Model

This code segment generates text using the trained LSTM language model. It starts with an initial text ("Please let me") and iteratively predicts the next word in the sequence using the trained model. The process is repeated for 15 iterations to generate a sequence of 15 words.

#### Observations:
- The text generation process involves tokenizing the input text, padding it to match the model's input size, and then using the model to predict the next word.
- The predicted word is appended to the input text, and the process is repeated iteratively to generate a sequence of words.
- A delay of 1 second is added between each iteration to simulate a typing effect.

In [None]:
text = "Please let me know"

for i in range(15):
    # Tokenize the input text
    token_text = tokenizer.texts_to_sequences([text])[0]
    # Pad the tokenized text
    padded_token_text = pad_sequences([token_text], maxlen=324, padding='pre')
    # Predict the next word index
    pos = np.argmax(model.predict(padded_token_text))

    # Retrieve the word corresponding to the predicted index
    for word, index in tokenizer.word_index.items():
        if index == pos:
            # Append the predicted word to the input text
            text = text + " " + word
            print(text)
            # Simulate typing effect with a delay of 1 second
            time.sleep(1)

Please let me know if
Please let me know if you
Please let me know if you have
Please let me know if you have any
Please let me know if you have any other
Please let me know if you have any other requests
Please let me know if you have any other requests or
Please let me know if you have any other requests or would
Please let me know if you have any other requests or would like
Please let me know if you have any other requests or would like me
Please let me know if you have any other requests or would like me to
Please let me know if you have any other requests or would like me to generate
Please let me know if you have any other requests or would like me to generate datasets
Please let me know if you have any other requests or would like me to generate datasets for
Please let me know if you have any other requests or would like me to generate datasets for different


## Saving Model and Tokenizer

This code snippet saves the trained LSTM language model (`model`) and the tokenizer (`tokenizer`) to separate files. The trained model is saved in HDF5 format with the filename 'my_model.h5', while the tokenizer is saved using Python's pickle module with the filename 'tokenizer.pickle'.

#### Observations:
- The `model.save()` function saves the trained model to an HDF5 file, which can be loaded later for inference or further training.
- The tokenizer is saved to a file using the `pickle.dump()` function, which serializes the tokenizer object and writes it to a file in binary format.


In [None]:
# Save the trained model to an HDF5 file
model.save('my_model.h5')

# Save the tokenizer to a file using pickle
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)