# Predicting the Next Token in Tweets Using LSTM and TensorFlow

## Introduction

The ability to predict the next word or token in a sequence of text has significant implications for natural language processing (NLP) applications, including text completion, chatbots, and language translation. This project focuses on creating a machine learning model capable of predicting the next token in tweets. We leverage the power of Long Short-Term Memory (LSTM) networks, a type of recurrent neural network (RNN) that excels in learning order dependence in sequence prediction problems.

Our dataset comprises preprocessed tweets, where each tweet has been cleaned and tokenized. Using TensorFlow, we build an LSTM model to learn these sequences and predict the next token based on the context provided by the previous tokens.

The project is divided into several key sections:

1. **Data Preparation:** Loading the preprocessed tweets, tokenizing the text, and preparing the sequences for training.
2. **Model Building:** Constructing an LSTM model using TensorFlow to predict the next token.
3. **Model Training:** Training our model on the prepared tweet sequences.
4. **Evaluation and Testing:** Assessing the model's performance and conducting tests with custom inputs.
5. **Model Saving and Deployment Preparation:** Saving the trained model and preparing it for deployment by containerizing it with Docker.
6. **AWS Deployment Preparation:** Steps to upload the containerized model to Amazon Elastic Container Registry (ECR) for deployment on AWS SageMaker.

By the end of this notebook, we will have a trained LSTM model ready for deployment, capable of predicting the next token in a sequence of tokens derived from tweets. This model serves as a foundation for more complex NLP tasks and demonstrates the process of moving from model training to deployment in a cloud environment.

Let's get started by importing the necessary libraries and loading our data.



In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd


2025-03-12 18:56:00.857291: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-12 18:56:00.860049: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-12 18:56:00.905334: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 2. Load and Prepare Data

Next, we load the preprocessed tweets and prepare them for training. This involves tokenizing the text data and creating sequences that will be used as input to our LSTM model.


In [12]:
# Load preprocessed data
df = pd.read_csv('../data/lstm_dataset.csv')

df

Unnamed: 0,headline,padded
0,over million americans roll up sleeves for om...,[ 49 200 214 1960 43 19066 6 368...
1,american airlines flyer charged banned for lif...,[ 138 1087 9100 977 2082 6 63 27 8034 ...
2,of the funniest tweets about cats and dogs th...,[ 3 1 1798 418 21 1701 7 663 19 ...
3,the funniest tweets from parents this week sept,[ 1 1798 418 18 157 19 91 6910 0 ...
4,woman who called cops on black birdwatcher los...,[ 126 50 857 481 9 82 28655 14...
...,...,...
207991,rim ceo thorsten heins significant plans for b...,[11939 710 66968 66969 5191 735 6 132...
207992,maria sharapova stunned by victoria azarenka i...,[ 3706 13065 7235 37 3148 27076 5 20...
207993,giants over patriots jets over colts among mo...,[ 3528 49 4115 4664 49 8152 966 ...
207994,aldon smith arrested ers linebacker busted for...,[23500 984 515 5644 12332 4859 6 63...


In [13]:
df['padded'].head(10).apply(type)


0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
5    <class 'str'>
6    <class 'str'>
7    <class 'str'>
8    <class 'str'>
9    <class 'str'>
Name: padded, dtype: object

In [15]:
import numpy as np

# Convert NumPy-style string lists to actual lists
df['padded'] = df['padded'].apply(lambda x: np.fromstring(x.strip("[]"), sep=" ").astype(int).tolist() if isinstance(x, str) else x)

# Convert into a proper NumPy 2D array
input_sequences = np.array(df['padded'].tolist(), dtype=np.int32)

# Verify the shape
print("Shape of input_sequences:", input_sequences.shape)

Shape of input_sequences: (207996, 43)


In [20]:
# Separate predictors (X) and labels (y)
X, y = input_sequences[:,:-1], input_sequences[:,-1]

In [22]:
# Load the GloVe embeddings file
glove_path = "../data/glove.6B.100d.txt"  # Download from https://nlp.stanford.edu/projects/glove/
embedding_dim = 100

In [27]:
# Create a dictionary mapping words to their embedding vectors
embeddings_index = {}

with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]  # First item is the word
        coefs = np.asarray(values[1:], dtype="float32")  # The rest are embedding values
        embeddings_index[word] = coefs

print(f"Loaded {len(embeddings_index)} word vectors from GloVe.")

Loaded 400000 word vectors from GloVe.


## 3. Build the LSTM Model

With our data prepared, we can now build the LSTM model. We use an Embedding layer to learn token embeddings, followed by an LSTM layer and a Dense layer for prediction.


In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.summary()


## 4. Compile the Model

We compile the model using the 'adam' optimizer and 'categorical_crossentropy' as the loss function, suitable for multi-class classification tasks.


In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


## 5. Train the Model

It's time to train our model. Note that this process can be time-consuming, depending on the size of your data and the complexity of the model.


In [None]:
history = model.fit(predictors, label, epochs=100, verbose=1)


## 6. Evaluate the Model

After training, we can evaluate our model's performance and plot the training history to visualize the learning process.


In [None]:
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'])
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

# Plot loss
plt.plot(history.history['loss'])
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()


## 7. Test the Model

Finally, let's test our model with a custom input to predict the next token in a sequence.


In [None]:
def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict_classes(token_list, verbose=0)
    return tokenizer.index_word[predicted[0]]

# Test with a custom input
seed_text = "I feel"
next_token = predict_next_token(seed_text)
print(f"Next token after '{seed_text}': {next_token}")


## 8. Save the Model

To deploy the model, we first need to save it. TensorFlow provides a simple API to save models in the SavedModel format, which can be easily served in different environments.


In [1]:
model_save_path = 'saved_model/next_token_predictor'
model.save(model_save_path)


NameError: name 'model' is not defined

## 9. Containerize the Model Using Docker

To prepare our model for deployment on AWS SageMaker, we'll containerize it using Docker. This process involves creating a Dockerfile, building a Docker image, and testing it locally.


In [2]:
# Create a Dockerfile

FROM tensorflow/serving

# Copy the model to the container
COPY ${model_save_path} /models/next_token_predictor/1

# Set environment variables to serve the model
ENV MODEL_NAME=next_token_predictor


SyntaxError: invalid syntax (2872726390.py, line 3)

In [None]:
# Build the Docker image
docker build -t next-token-predictor:latest .


In [None]:
# Run the Docker container locally to test
docker run -p 8501:8501 --name=my_model_container next-token-predictor:latest


## 10. Upload the Model to Amazon ECR

For deploying our model with AWS SageMaker, we need to upload our Docker container to Amazon Elastic Container Registry (ECR). This section outlines the steps to create a repository in ECR, authenticate Docker to push images to ECR, and finally, push the image.


In [None]:
import boto3

# Set your AWS region
aws_region = 'us-west-2'
ecr_repository_name = 'next-token-predictor'

# Create ECR client
ecr_client = boto3.client('ecr', region_name=aws_region)

# Create an ECR repository
response = ecr_client.create_repository(repositoryName=ecr_repository_name)
repository_uri = response['repository']['repositoryUri']

print(f"Repository URI: {repository_uri}")


In [None]:
# Authenticate Docker to push to ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <repository_uri>


In [None]:
# Tag your Docker image with the ECR repository URI
docker tag next-token-predictor:latest <repository_uri>:latest


In [None]:
# Push the Docker image to ECR
docker push <repository_uri>:latest
