# Fine-tuning GPT-2 for Tweet Token Prediction with TensorFlow

In this exciting journey, we're leveraging the cutting-edge capabilities of GPT-2, a powerhouse in the NLP realm, to predict the next token in sequences of tweet data. As we dive into this project, we'll be exploring the fascinating world of transformer models, specifically focusing on fine-tuning the GPT-2 model using TensorFlow to enhance its predictive prowess on our dataset of preprocessed tweets.

Our goal? To fine-tune a pre-trained GPT-2 model so it becomes adept at predicting the next word in a tweet, harnessing the vast knowledge it has acquired from extensive pre-training. This model will serve as a counterpart to our LSTM model, allowing for an A/B test to determine which model better suits our specific NLP task.

Let's embark on this deep learning adventure, equipped with TensorFlow and the Hugging Face Transformers library, to push the boundaries of what's possible with NLP and tweet data.



## Setup and Dependencies
First, ensure that you have installed the Hugging Face transformers library. If not, you can install it using pip:

In [None]:
# Import necessary libraries
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import tensorflow as tf
import pandas as pd

## Dataset Preparation

Our first step is to prepare our dataset of preprocessed tweets for the GPT-2 model. This involves loading the dataset, ensuring it's in the correct format for tokenization, and then creating TensorFlow datasets for training.


In [None]:
# Load the dataset
dataset_path = 'preprocessed_tweets.csv'
tweets_df = pd.read_csv(dataset_path)

# For simplicity, we concatenate all tweets into a single text corpus
tweets_text = ' '.join(tweets_df['cleaned_text'].tolist())


## Tokenization and TensorFlow Dataset

Tokenizing our text data is crucial for transforming it into a format that GPT-2 can understand. We'll then split the tokenized data into training examples and create a TensorFlow dataset.


In [None]:
# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Tokenize the tweets
tokens = tokenizer.encode(tweets_text, return_tensors='tf')

# Organize our data into TensorFlow datasets
# Here, we're making a simple sequence dataset for demonstration purposes
SEQ_LENGTH = 128  # Sequence length to train on
BUFFER_SIZE = 10000
BATCH_SIZE = 16

# Create TensorFlow dataset
def map_func(input_ids):
    return {'input_ids': input_ids[:-1]}, input_ids[1:]

dataset = tf.data.Dataset.from_tensor_slices(tokens[0])
sequences = dataset.batch(SEQ_LENGTH+1, drop_remainder=True)
dataset = sequences.map(map_func)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)


## Model Initialization and Fine-tuning

Now, we'll load the pre-trained GPT-2 model and prepare it for fine-tuning with our tweet data.


In [None]:
# Load the pre-trained GPT-2 model
model = TFGPT2LMHeadModel.from_pretrained('gpt2')

# Prepare the model for training
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)

# Fine-tune the model
EPOCHS = 4

model.fit(dataset, epochs=EPOCHS)


## Saving the Fine-tuned Model

After fine-tuning, it's essential to save our model for future use, whether for further training, evaluation, or deployment.


In [None]:
# Save the fine-tuned model
model.save_pretrained("fine_tuned_gpt2")
tokenizer.save_pretrained("fine_tuned_gpt2")


## Containerizing the Fine-tuned GPT-2 Model

To deploy our fine-tuned GPT-2 model, we'll containerize it using Docker. This involves creating a Dockerfile that specifies the environment and dependencies needed to run our model, building a Docker image based on this Dockerfile, and testing it locally to ensure everything is set up correctly.


In [1]:
# Use TensorFlow Serving image as the base image
FROM tensorflow/serving

# Copy the fine-tuned model to the container
COPY ./fine_tuned_gpt2 /models/gpt2/1

# Set environment variables
ENV MODEL_NAME=gpt2


SyntaxError: invalid syntax (3256988575.py, line 2)

In [None]:
# Build the Docker image
docker build -t gpt2-tweet-predictor .

# Run the Docker container locally
docker run -p 8501:8501 --name=gpt2_model_container gpt2-tweet-predictor


## Uploading the Docker Image to Amazon ECR

After testing the Docker container locally, the next step is to upload our Docker image to Amazon Elastic Container Registry (ECR) so that it can be deployed on AWS SageMaker.


In [None]:
import boto3

# Replace 'your-region' with your AWS region, e.g., 'us-west-2'
aws_region = 'your-region'
ecr_repository_name = 'gpt2-tweet-predictor'

# Create ECR client
ecr_client = boto3.client('ecr', region_name=aws_region)

# Create an ECR repository
response = ecr_client.create_repository(repositoryName=ecr_repository_name)
repository_uri = response['repository']['repositoryUri']

print(f"Repository URI: {repository_uri}")


In [None]:
# Login to ECR
aws ecr get-login-password --region your-region | docker login --username AWS --password-stdin your-account-id.dkr.ecr.your-region.amazonaws.com

# Tag your Docker image with the ECR repository URI
docker tag gpt2-tweet-predictor:latest your-account-id.dkr.ecr.your-region.amazonaws.com/gpt2-tweet-predictor:latest

# Push the Docker image to ECR
docker push your-account-id.dkr.ecr.your-region.amazonaws.com/gpt2-tweet-predictor:latest


## Conclusion

We've now fine-tuned a GPT-2 model for tweet token prediction, containerized the model using Docker, and uploaded it to Amazon ECR. This model is ready for deployment on AWS SageMaker, setting the stage for an A/B testing scenario against our LSTM model. The journey from training to deployment showcases the power of modern NLP models and cloud services in bringing AI applications to life.
