# Fine-tuning a HuggingFace FLAN-T5 Model on Amazon SageMaker with TensorBoard Integration

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

---

**Author**: Hubert Gabryel

**Date**: 2023-10-05

## Table of Contents

1.	[Introduction](#1-introduction)

	1.1 [Background](#11-background)

	1.2 [Objective](#12-objective)

2.	[Setup](#2-setup)

	2.1 [Import Libraries](#21-import-libraries)

	2.2 [Initialize SageMaker Session and Role](#22-initialize-sagemaker-session-and-role)

	2.3 [Model Configuration](#23-model-configuration)

3.	[Data Preparation](#3-data-preparation)

	3.1 [Download and Prepare the Dataset](#31-download-and-prepare-the-dataset)

	3.2 [Load and Preprocess the Data](#32-load-and-preprocess-the-data)

	3.3 [Prepare the Data for Training](#33-prepare-the-data-for-training)

	3.4 [Upload Data to S3](#34-upload-data-to-s3)

4.	[Training Script Modification](#4-training-script-modification)

	4.1 [Download the Training Script](#41-download-the-training-script)

	4.2 [Modify the Training Script for TensorBoard Integration](#42-modify-the-training-script-for-tensorboard-integration)

5.	[Model Training with TensorBoard Integration](#5-model-training-with-tensorboard-integration)

	5.1 [Set Up TensorBoard Output Configuration](#51-set-up-tensorboard-output-configuration)

	5.2 [Define Hyperparameters](#52-define-hyperparameters)

	5.3 [Create and Fit the Estimator](#53-create-and-fit-the-estimator)

6.	[TensorBoard Visualization](#6-tensorboard-visualization)

	6.1 [Start TensorBoard from the SageMaker Console](#61-start-tensorboard-from-the-sagemaker-console)
	
7.	[Conclusion](#7-conclusion)

8.	[References](#8-references)


## 1. Introduction

### 1.1 Background

In this notebook, we demonstrate how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker’s JumpStart models with TensorBoard integration. This integration allows us to monitor and visualize the training process in real-time, providing valuable insights into model performance.

### 1.2 Objective

Our goal is to fine-tune the FLAN-T5 small model on a subset of the Tiny Shakespeare dataset and visualize the training metrics using TensorBoard. We will:

- Set up the SageMaker environment and import necessary libraries.
- Prepare the dataset for training.
- Modify the training script to include TensorBoard logging.
- Train the model with TensorBoard integration.
- Visualize the training metrics using TensorBoard.

## 2. Setup

### 2.1 Import Libraries

In [None]:
# Install or upgrade the SageMaker Python SDK
!pip install -U sagemaker --quiet

In [None]:
# Import necessary libraries
import os
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import boto3
import sagemaker
from sagemaker import get_execution_role, script_uris
from sagemaker.s3 import S3Uploader, S3Downloader
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker.debugger import TensorBoardOutputConfig

# Set random seeds for reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

### 2.2 Initialize SageMaker Session and Role

In [None]:
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Verify S3 access
try:
    s3_client = sagemaker_session.boto_session.client("s3")
    s3_client.head_bucket(Bucket=sagemaker_session.default_bucket())
    print("S3 access confirmed.")
except Exception as e:
    print(f"Unable to access S3 bucket: {e}")

### 2.3 Model Configuration

In [4]:
# Model configuration
MODEL_ID = "huggingface-text2text-flan-t5-small"  # Small model to keep training cost low
MODEL_VERSION = "2.1.2"  # Latest model version at the time of writing

## 3. Data Preparation

### 3.1 Download and Prepare the Dataset

We will use the Tiny Shakespeare dataset for this example.

In [None]:
# Download the Tiny Shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt --no-check-certificate

### 3.2 Load and Preprocess the Data

In [None]:
# Read the data
with open("input.txt", "r") as f:
    data = f.read()

# Limit the data to the first MAX_DATA_LENGTH characters
MAX_DATA_LENGTH = 10000
data = data[:MAX_DATA_LENGTH]

# Split the data into training and validation sets
TEST_SIZE = 0.2
train_text, val_text = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_SEED)

print(f"Training data length: {len(train_text)}")
print(f"Validation data length: {len(val_text)}")

### 3.3 Prepare the Data for Training

We need to format the data into prompts and completions.

In [None]:
def prepare_data(text, sequence_length=256, prompt_length=128):
    data = []
    max_index = len(text) - sequence_length + 1
    for i in range(0, max_index, prompt_length):
        prompt = text[i : i + prompt_length]
        completion = text[i + prompt_length : i + sequence_length]
        if len(completion) == (sequence_length - prompt_length):
            data.append({"prompt": prompt, "completion": completion})
    return data


# Prepare the training and validation data
train_data = prepare_data(train_text)
val_data = prepare_data(val_text)

print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(val_data)}")

### 3.4 Upload Data to S3

In [None]:
# Define S3 bucket and prefix
bucket = sagemaker_session.default_bucket()
data_prefix = "jumpstart-example-data"

# Save the data to local files
pd.DataFrame(train_data).to_json("train.jsonl", orient="records", lines=True)
pd.DataFrame(val_data).to_json("val.jsonl", orient="records", lines=True)

# Upload training data
train_s3_uri = sagemaker_session.upload_data(
    path="train.jsonl", bucket=bucket, key_prefix=f"{data_prefix}/train.jsonl"
)

# Upload validation data
val_s3_uri = sagemaker_session.upload_data(
    path="val.jsonl", bucket=bucket, key_prefix=f"{data_prefix}/val.jsonl"
)

print(f"Training data uploaded to: {train_s3_uri}")
print(f"Validation data uploaded to: {val_s3_uri}")

## 4. Training Script Modification

### 4.1 Download the Training Script

We need to obtain the default training script provided by the JumpStart model and modify it to integrate TensorBoard.

In [15]:
from sagemaker.s3 import S3Downloader

# Retrieve the training script URI
train_script_uri = script_uris.retrieve(
    model_id=MODEL_ID, model_version=MODEL_VERSION, script_scope="training"
)

# Download the training script
S3Downloader.download(train_script_uri, ".")

# Unpack the training script
import tarfile

with tarfile.open("./sourcedir.tar.gz") as tar:
    tar.extractall("./training_script")

### 4.2 Modify the Training Script for TensorBoard Integration

We need to modify the train.py script to include TensorBoard logging.

- Import the TensorBoardCallback:
    In training_script/train.py, add:

```python
from transformers.integrations import TensorBoardCallback
```

- Modify the Seq2SeqTrainingArguments to include TensorBoard parameters:

 ```python
 training_args = Seq2SeqTrainingArguments(
    # ... other arguments ...
    logging_dir="/opt/ml/output/tensorboard",
    report_to=['tensorboard'],
    # ... other arguments ...
)
```

- Add the TensorBoardCallback to the trainer:

```python
if callbacks is None:                   # Added line
    callbacks = []                      # Added line
callbacks.append(TensorBoardCallback()) # Added line

# Create Trainer instance
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset[constants.TRAIN],
        eval_dataset=dataset[constants.VALIDATION],
        data_collator=data_collator,
        callbacks=callbacks,
    )
```

Note: Ensure that the "/opt/ml/output/tensorboard" in the training script matches the container_local_output_path in the TensorBoardOutputConfig.

## 5. Model Training with TensorBoard Integration

### 5.1 Set Up TensorBoard Output Configuration

In [None]:
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=f"s3://{bucket}/tensorboard-output",
    container_local_output_path="/opt/ml/output/tensorboard",  # Should match LOG_DIR in your script
)

print(f"TensorBoard logs will be saved to: s3://{bucket}/tensorboard-output")

### 5.2 Define Hyperparameters

In [17]:
hyperparameters = {
    "epochs": "5",
    "batch_size": "4",
    "learning_rate": "5e-5",
    "logging_strategy": "steps",
    "logging_steps": "5",
    "evaluation_strategy": "steps",
    "save_strategy": "steps",
    "eval_steps": "25",
    "save_steps": "25",
    "gradient_accumulation_steps": "1",
    "fp16": "true",
    "bf16": "false",
}

### 5.3 Create and Fit the Estimator

In [None]:
estimator = JumpStartEstimator(
    model_id=MODEL_ID,
    model_version=MODEL_VERSION,
    instance_type="ml.g5.xlarge",
    hyperparameters=hyperparameters,
    entry_point="transfer_learning.py",  # Name of main script
    source_dir="training_script",  # Directory containing your scripts
    tensorboard_output_config=tensorboard_output_config,
)

# Start the training job
estimator.fit({"train": train_s3_uri, "validation": val_s3_uri})

## 6. TensorBoard Visualization

### 6.1 Start TensorBoard from the SageMaker Console

	1.	Navigate to the SageMaker Console:
    	- Go to the Amazon SageMaker Console.
	2.	Access TensorBoard:
    	- In the left-hand navigation pane, click on Applications and IDEs.
    	- Select TensorBoard.
	3.	Open TensorBoard:
    	- Click on Open TensorBoard to launch the TensorBoard landing page.
	4.	Add Your Training Job:
    	- On the TensorBoard page, click on Add job.
    	- Select your most recent completed training job from the list.
	5.	View Training Metrics:
    	- After the data loads, navigate to the Scalars tab.
    	- Here, you can see charts and graphs of your training metrics.


## 7. Conclusion

In this notebook, we demonstrated how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker with TensorBoard integration. We prepared a subset of the Tiny Shakespeare dataset, modified the training script to include TensorBoard logging, and visualized the training metrics.

Next Steps:

- Experiment with Hyperparameters: Adjust learning rates, batch sizes, and other hyperparameters to improve model performance.
- Use a Larger Dataset: Try using a larger dataset for better results.
- Deploy the Model: After training, deploy the model using SageMaker’s deployment capabilities for inference.

## 8. References

- [Amazon SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-htb-prepare-training-job.html)
- [TensorBoard Documentation](https://www.tensorflow.org/tensorboard/get_started)
- [Tiny Shakespeare Dataset](https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare)
- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)
- [SageMaker JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html)



  ## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/      build_and_train_models|sm-jumpstart_tensorflow_finetune_flan|sm-jumpstart_tensorflow_finetune_flan.ipynb)

