# Build an Object Detection Model in TensorFlow: Model Training and Evaluation


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

---


## Background

This notebook is one of a sequence of notebooks that show you how to use various SageMaker functionalities to build, train, and test the object detection model, including data pre-processing steps like ingestion, cleaning and processing, training, and test the model. There are two parts of the demo: 

1. Overview and Data Preparation.- you will preprocess the data, then create a json file from the cleaned data. By the end of part 1, you will have a complete data set that contains all features used on Object selection to be ingested by a data loader in *[TensorFlow](https://github.com/tensorflow/tensorflow)* using 'TFRecords'.
1. Data loader creation and Model Training (current notebook).- you will use the data set built from part 1 to create a data loader for TensorFlow using *[Keras CV](https://github.com/keras-team/keras-cv)*, train the model and then test the model predictability with the test data. 


## Content
* [Overview](#Overview)
* [Using TensorFlow Data loaders with 'TFRecords'](#Using-TensorFlow-Data-loaders-with-'TFRecords')
* [Model Selection](#Model-Selection)
* [Training Object Detection Models with SageMaker and TensorFlow](#Training-Object-Detection-Models-with-SageMaker-and-TensorFlow)

## Overview

### What is Object Detection, and why is it important?

Object detection refers to detecting instances of objects from certain classes in images or videos. It allows for multiple objects to be detected and localized in an image. Object detection is commonly used in applications such as self-driving cars, face detection, video surveillance, etc.  

### Use Cases for Object Detection

Some common use cases of object detection include:

- Self driving cars - detect pedestrians, cars, traffic signs, etc.
- Face detection - detect faces in images and videos for applications like security and tagging people in images.
- Video surveillance - detect suspicious activities or objects.
- Medical imaging - detect anomalies, tumors, etc. in medical scans.
- Retail - detect objects on shelves for inventory management.

### Define the Machine Learning Problem  

Object detection can be formulated as a supervised machine learning problem:

- Given a set of labelled images containing objects from certain classes, train a model to detect the presence and location of those objects in new images.

- The model needs to identify the class of objects present and draw bounding boxes around them indicating their locations.

### Data Requirements

- Large dataset of images with object annotation - Object locations are annotated using bounding boxes around them.

- Variety of images - Objects captured under different conditions of illumination, scales, occlusion, viewpoints etc. 

### Challenges

- Data annotation - Time consuming and expensive process.

- Class imbalance - Models tend to perform better for classes with more examples.

- Viewpoint variation - Objects look different from different angles and viewpoints. 

- Background clutter - Objects may blend with their surroundings.

- Small objects - Harder to detect smaller objects.

- Occlusion - Objects hidden behind other things are tougher to detect.

In [None]:
# Tensorflow for CPU
#!pip install keras-cv tensorflow --upgrade
!pip install --upgrade pip --quiet
# Tensorflow for GPU
#!pip install keras-cv tensorflow[and-cuda] --upgrade --quiet
#!pip install keras-cv --upgrade --quiet
# Tensorflow without GPU
!pip uninstall tensorflow
!pip install keras-cv tensorflow~=2.15.0 --ignore-installed --upgrade --quiet
!pip install sagemaker botocore boto3 awscli --upgrade --quiet
!pip install tensorrt --quiet
#!pip install Pillow==9.5.0
#!pip install pickleshare --quiet

### Parameters 
The following lists configurable parameters that are used throughout the whole notebook.

In [None]:
import sagemaker
import json
import pandas as pd
import glob
import boto3

In [None]:
# Create a SageMaker session
sagemaker_session = sagemaker.Session()
# Get the default S3 bucket associated with your SageMaker session
bucket = sagemaker_session.default_bucket()  # replace with your own bucket name if you have one
# Create an S3 resource client
s3 = boto3.resource("s3")
# Get the AWS region name
region = boto3.Session().region_name
# Get the execution role for SageMaker
role = sagemaker.get_execution_role()
# Create a SageMaker client
smclient = boto3.Session().client("sagemaker")
# Set a prefix for your S3
prefix = "object-detection-tensorflow"

## Using TensorFlow Data loaders with 'TFRecords'

### Loading data from S3

As a first step, we need to load the 'TFRecord' files that were generated during the preprocessing job and saved to S3. We can use the TensorFlow IO functions to stream data directly from S3 without needing to download the files locally.



The code starts by loading variables from a previous notebook using the '%store -r magic command':

In [None]:
# read variables from previous notebook
# The output URI of the previous data processing job. This will be used to load the preprocessed data.
processing_job_output_uri = f"s3://{bucket}/{prefix}/data/processing/output"
# This is a mapping between the class labels and their numerical representations, created during data preprocessing.
%store -r class_mapping
# The size of the training dataset, usually determined during data splitting.
%store -r training_dataset_size
# The size of the validation dataset, usually determined during data splitting.
%store -r validation_dataset_size

It generates a class_mapping dictionary that maps class IDs to their corresponding class names. This mapping will be used later for visualization purposes.

In [None]:
# Define a list of class names
class_ids = [
    "aeroplane",
    "bicycle",
    "bird",
    "boat",
    "bottle",
    "bus",
    "car",
    "cat",
    "chair",
    "cow",
    "diningtable",
    "dog",
    "horse",
    "motorbike",
    "person",
    "pottedplant",
    "sheep",
    "sofa",
    "train",
    "tvmonitor",
]
# Create a dictionary that maps class indices (keys) to their corresponding class names (values)
class_mapping = dict(zip(range(len(class_ids)), class_ids))

This code cell creates a SageMaker session and gets the default S3 bucket associated with the session. The S3 client is also instantiated to interact with the S3 API. The cell defines the S3 prefixes where the preprocessed training and validation data are stored in 'TFRecord' format. Finally, it lists the objects (files) in the S3 bucket under the specified prefixes for training and validation data, using the list_objects_v2 method of the S3 client. This step is necessary to retrieve the paths of the 'TFRecord' files, which will be used for loading the data.

In [None]:
# Import necessary libraries
import boto3
import sagemaker

# import tensorflow_io
import tensorflow as tf

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

sagemaker_session = (
    sagemaker.Session()
)  # Create SageMaker session to interact with SageMaker resources
bucket = sagemaker_session.default_bucket()  # Get the default S3 bucket for this SageMaker session
s3_client = boto3.client("s3")  # Create an S3 client to interact with S3 buckets and objects

# S3 prefix where preprocessed data is stored for training/validation
prefix_train = "object-detection-tensorflow/data/processing/output/tfrecords/train/"  # S3 prefix for training data
prefix_val = "object-detection-tensorflow/data/processing/output/tfrecords/val/"  # S3 prefix for validation data

The provided code retrieves the data and prepares the datasets for training and validation. It accomplishes this by generating lists of local file paths for the 'TFRecord' files and subsequently loading those files into TensorFlow datasets. These prepared datasets can then be utilized for training or evaluating the object detection model.

In [None]:
import glob

# search all files inside a specific folder
# *.* means file name with any extension
# Create an empty list to store location of training files
filenames_train = []
# Create an empty list to store location of training files
filenames_val = []

# Define the directory paths for train and validation data
dir_path_train = "./data/tfrecords/train/*.*"
dir_path_val = "./data/tfrecords/val/*.*"
dir_path = "./data/tfrecords/"

# Download the processing job output from S3 to the local directory
sagemaker.s3.S3Downloader.download(processing_job_output_uri, dir_path)

# Iterate through all files in the train directory and append their paths to the filenames_train list
for file in glob.glob(dir_path_train, recursive=True):
    filenames_train.append(file)

# Iterate through all files in the validation directory and append their paths to the filenames_val list
for file in glob.glob(dir_path_val, recursive=True):
    filenames_val.append(file)

In [None]:
# Load the training/validation TFRecords dataset from the files stored in the filenames list
dataset = tf.data.TFRecordDataset(filenames_train, num_parallel_reads=tf.data.experimental.AUTOTUNE)
val_dataset = tf.data.TFRecordDataset(
    filenames_val, num_parallel_reads=tf.data.experimental.AUTOTUNE
)

### Using data loaders and parsing 'TFRecords'
Next we can create a data loader to parse the 'TFRecord' examples and create batches of images and labels for visualization.

This cell sets up the necessary functions and imports for working with 'TFRecord' data in the context of object detection. The 'parse_tfrecord_fn' function defines how to parse the 'TFRecord' file format, which stores the image data and bounding box annotations. It reads the features from the 'TFRecord' file, decodes the image data, and creates a dictionary with the image and bounding box information. The prepare_sample function is a helper function that formats the parsed data into the expected format for the object detection model. The 'plot_boxes_tfrecords' function is a utility for visualizing the bounding boxes on the images.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image, ImageDraw, ImageFont


# Function to plot bounding boxes on an image
def plot_boxes_tfrecords(features):
    # Open the image
    data = features["images"].numpy().astype(np.uint8)
    image = Image.fromarray(data, "RGB")

    # Create a drawing object
    draw = ImageDraw.Draw(image)

    # Iterate over the bounding boxes in the features dictionary
    for index in range(len(features["bounding_boxes"]["boxes"])):
        # print(index)
        # Extract the coordinates of the current bounding box
        box = features["bounding_boxes"]["boxes"][index]
        x1, y1, x2, y2 = box
        category = class_mapping[int(features["bounding_boxes"]["classes"][index])]
        draw.rectangle([(x1, y1), (x2, y2)], outline="red", width=2)
        # Draw the label
        label_width, label_height = draw.textsize(category)
        label_x = x1
        label_y = y1 - label_height - 5  # Adjust the offset as needed

        # Draw the label
        draw.text((label_x, label_y), category, fill="red")

    # Display the image
    image.show()


# Parses a TFRecord example and returns a dictionary.
def parse_tfrecord_fn(example):
    # Define the feature description for parsing the TFRecord example
    feature_description = {
        "height": tf.io.FixedLenFeature((), tf.int64),
        "width": tf.io.FixedLenFeature((), tf.int64),
        "filename": tf.io.FixedLenFeature((), tf.string),
        "image": tf.io.FixedLenFeature((), tf.string),
        "object/bbox/xmin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/xmax": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymax": tf.io.VarLenFeature(tf.float32),
        "object/text": tf.io.VarLenFeature(tf.string),
        "object/label": tf.io.VarLenFeature(tf.int64),
    }
    # Parse the example using the feature description
    example = tf.io.parse_single_example(example, feature_description)
    # Decode the JPEG image data and convert it to float32
    example["image"] = tf.cast(tf.io.decode_jpeg(example["image"], channels=3), tf.float32)
    # Convert the filename to a string
    example["filename"] = tf.cast(example["filename"], tf.string)
    # Convert the sparse tensors to dense tensors
    example["object/bbox/xmin"] = tf.sparse.to_dense(example["object/bbox/xmin"])
    example["object/bbox/xmax"] = tf.sparse.to_dense(example["object/bbox/xmax"])
    example["object/bbox/ymin"] = tf.sparse.to_dense(example["object/bbox/ymin"])
    example["object/bbox/ymax"] = tf.sparse.to_dense(example["object/bbox/ymax"])
    example["object/text"] = tf.sparse.to_dense(example["object/text"])
    example["object/label"] = tf.sparse.to_dense(example["object/label"])
    # Combine the bounding box coordinates into a single tensor
    example["object/bbox"] = tf.stack(
        [
            example["object/bbox/xmin"],
            example["object/bbox/ymin"],
            example["object/bbox/xmax"],
            example["object/bbox/ymax"],
        ],
        axis=1,
    )
    return example


# Prepares a sample dictionary for input to the TensorFlow model
def prepare_sample(inputs):
    image = inputs["image"]
    boxes = inputs["object/bbox"]
    labels = inputs["object/label"]
    bounding_boxes = {
        "classes": inputs["object/label"],
        "boxes": boxes,
    }
    return {"images": image, "bounding_boxes": bounding_boxes}

This code prepares the 'TFRecord' dataset to test an object detection model. It loads the dataset, applies necessary preprocessing, shuffles the data, and retrieves a sample for visualization.

In [None]:
# Set the batch size for data loader
BATCH_SIZE = 4

# Load the TFRecord dataset and parse it using the parse_tfrecord_fn function
raw_dataset_sample = dataset.map(parse_tfrecord_fn)

# Apply data preparation and preprocessing steps to the parsed dataset
parsed_dataset_sample = raw_dataset_sample.map(lambda x: prepare_sample(x))

# Shuffle the prepared dataset
sample_ds = parsed_dataset_sample.shuffle(BATCH_SIZE * 4)

# Repeat the validation dataset indefinitely
sample_ds = sample_ds.repeat()

# Get a sample from the shuffled dataset
data_not_augmented = next(iter(sample_ds.take(1)))

# Visualize the sample batch with bounding boxes
plot_boxes_tfrecords(data_not_augmented)

### Data Augmentation
One major benefit of using TensorFlow data loaders is that we can easily apply data augmentation. This helps prevent overfitting and improves the robustness of the model.

Some common augmentation techniques for object detection include:

* Random horizontal/vertical flipping
* Random cropping
* Color jittering
* Adding noise

These can be implemented using the Keras CV layers for computer vision use cases:

In [None]:
import tensorflow as tf
import keras
import keras_cv
from tensorflow import data as tf_data
from keras_cv import visualization

# Using sequential layers to add augmentation to all samples
augmenter = keras.Sequential(
    layers=[
        # This layer randomly flips the input images horizontally. The bounding_box_format parameter
        # specifies the format of the bounding box coordinates, which is "xyxy" in this case.
        keras_cv.layers.RandomFlip(mode="horizontal", bounding_box_format="xyxy"),
        # This layer randomly resizes the input images while maintaining the aspect ratio.
        keras_cv.layers.JitteredResize(
            target_size=(640, 640), scale_factor=(0.75, 1.3), bounding_box_format="xyxy"
        ),
    ]
)

# using an augmentation pipeline to add augmentation randomly to some samples
pipeline = keras_cv.layers.RandomAugmentationPipeline(
    layers=[
        # This layer applies a grid-like mask to the input images, which can help the model generalize better.
        keras_cv.layers.GridMask(
            ratio_factor=(0, 0.3),
        ),
        # This layer randomly applies color degeneration to the input images
        keras_cv.layers.RandomColorDegeneration(0.5),
        # This layer randomly adjusts the saturation of the input images
        keras_cv.layers.RandomSaturation(0.8),
    ],
    # This parameter specifies the number of augmentations to apply to each input image.
    augmentations_per_image=3,
)


# This parameter specifies the number of augmentations to apply to each input image.
def apply_pipeline(inputs):
    inputs["images"] = pipeline(inputs["images"])
    return inputs


# This line applies the random augmentation pipeline to the sample dataset.
sample_ds_augmented = sample_ds.map(apply_pipeline, num_parallel_calls=tf.data.AUTOTUNE)

# This line applies the sequential augmentation layers to the augmented sample dataset.
sample_ds_augmented = sample_ds_augmented.map(augmenter, num_parallel_calls=tf.data.AUTOTUNE)

# This line gets the first batch of augmented data from the sample dataset.
data_augmented = next(iter(sample_ds_augmented.take(1)))

# This line plots the bounding boxes on the augmented images.
plot_boxes_tfrecords(data_augmented)

Adding augmentation during training helps prevent overfitting and makes the model more robust to variations in input images.

## Model Selection

You will need to select an appropriate model architecture for your object detection task. When choosing a model, there are several factors to consider:
* The type of objects you want to detect - Are they general everyday objects, or more specialized categories like faces or text? Simpler architectures like SSD and YOLO work well for detecting common objects, while more complex models like Mask R-CNN may be better for niche categories.

* Model size and speed - Larger models like 'RetinaNet' will be more accurate but slower, while smaller models like MobileNet will be faster but less accurate. Choose a model size that fits your speed and accuracy needs.

* Amount of training data - If you have a large dataset, you can train bigger models with more parameters. With fewer data, stick to smaller models to avoid overfitting.

* Inference speed - Some models like 'MobileNet' are optimized specifically for fast inference after training. Prioritize this if you need to run detection very quickly.

* Built-in vs custom models - Many pre-made model architectures like 'Faster R-CNN' are available. But you can also build custom models better tailored to your specific objects.

A good starting point is to evaluate pre-trained models like 'Faster R-CNN' and 'SSD' (Single Shot Detector) that are available in model zoos. 'Faster R-CNN' with a 'ResNet-50' backbone offers a good balance of accuracy and speed for this dataset. 'SSD' is faster but slightly less accurate, so you may want to try different backbone architectures like 'ResNet', 'MobileNet' and 'EfficientNet' to find the right tradeoff. 

To simplify model development, we will leverage the pre-trained object detection models available in Keras CV. Keras CV provides reference implementations and pre-trained weights for state-of-the-art computer vision models. We can quickly test training and inference for object detection by using a model like 'RetinaNet' or 'EfficientNet', initialized with weights pre-trained on COCO or other datasets. By taking advantage of these pre-trained models in Keras CV, we can prototype and experiment with minimal code and set up time. This allows us to focus on customizing and optimizing the model for our specific use case.

### How to Load Pretrained models    

#### Loading Pretrained models

This cell loads a pre-trained RetinaNet object detection model based on the ResNet50 architecture and the Pascal VOC dataset. The from_preset function is used to load the pre-trained weights and architecture. The bounding_box_format parameter specifies the format of the bounding box coordinates, which is 'xyxy' in this case. The prediction_decoder parameter is set to the 'NonMaxSuppression' layer initialized in the previous cell, which will be used to filter out overlapping bounding boxes during inference. The load_weights parameter is set to True to load the pre-trained weights along with the model architecture.

In [None]:
# Filter out overlapping bounding boxes based on intersection over union (IoU) thresholds.
prediction_decoder = keras_cv.layers.NonMaxSuppression(
    bounding_box_format="xyxy",  # The format of the bounding boxes (x, y, x, y)
    from_logits=True,  # Indicates that the input is logits (raw output from the model)
    iou_threshold=0.7,  # IoU threshold for filtering overlapping bounding boxes
    confidence_threshold=0.3,  # Confidence threshold for filtering low-confidence predictions
)

# Load Resnet architecture and weights from pre-trained model
model1 = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",  # Load pre-trained RetinaNet model with ResNet50 backbone
    bounding_box_format="xyxy",  # The format of the bounding boxes (x, y, x, y)
    prediction_decoder=prediction_decoder,  # Use the custom NMS layer for filtering predictions
    load_weights=True,  # Load pre-trained weights for the model
)

This code creates a resizing layer for inference. The 'keras_cv.layers.Resizing' layer is used to resize the input images to a fixed size of 640x640 pixels. The bounding_box_format parameter specifies the format of the bounding box coordinates, where 'xyxy' means that the coordinates are in the format of [x_min, y_min, x_max, y_max]. The pad_to_aspect_ratio parameter ensures that the aspect ratio of the images is preserved during resizing by adding padding if necessary.

In [None]:
# Get the image data from the 'data_not_augmented' dictionary
image_na = data_not_augmented["images"]

# Resizing the images for inference
inference_resizing = keras_cv.layers.Resizing(
    640, 640, bounding_box_format="xyxy", pad_to_aspect_ratio=True
)

# Apply the resizing layer to the image data
image_batch_na = inference_resizing([image_na])

#### Prediction test
To validate the performance of a pretrained model, we can run a prediction test.

In [None]:
# Prediction of not augmented image
y_pred1 = model1.predict(image_batch_na)

In [None]:
# Plot the bounding box gallery for the given image batch and predictions
visualization.plot_bounding_box_gallery(
    image_batch_na,
    value_range=(0, 255),
    rows=1,
    cols=1,
    y_pred=y_pred1,
    scale=5,
    font_scale=0.7,
    bounding_box_format="xyxy",
    class_mapping=class_mapping,
)

In [None]:
del model1
import gc

gc.collect()

In [None]:
tf.keras.backend.clear_session()

## Training Object Detection Models with SageMaker and TensorFlow

To train object detection models on SageMaker, we first need to configure a SageMaker training job. The key components are:
* Choosing an estimator
  * We can use the TensorFlow estimator to leverage the Keras API and pretrained models like RetinaNet.
* Selecting an instance type
    * GPU instances like ml.p3.2xlarge are best suited for training convolutional neural networks.
* Configuring the training script
    * This sets up the model architecture, loads pretrained weights, and defines the training loop.
* Specifying the training image
    * We can use a TensorFlow image from the SageMaker registry.
* Setting hyperparameters
    * Learning rate, batch size, and epochs are key hyperparameters to tune.
    
For model evaluation, we need to choose appropriate metrics like precision, recall, and 'mAP'. Since object detection involves classifying many bounding boxes, metrics that account for class imbalance like F1 score are also useful.
The pretrained models in Keras CV combined with SageMaker's managed training provide an optimized environment for iterating on object detection models. We can efficiently improve accuracy by tuning hyperparameters and leverage SageMaker infrastructure for scalable distributed training.
   
### How to create a training job in Sagemaker   

In [None]:
import sagemaker
import json
import pandas as pd
import glob
import s3fs
import boto3
from datetime import datetime
import os

#### Loading Preprocessed Data from S3

A key advantage of using SageMaker for model training is it can directly access data stored in S3 buckets. After preprocessing our dataset in a previous step, we staged the output in an S3 location. The SageMaker TensorFlow estimator handles loading this data from S3 into our training script. We simply specify the S3 path when creating the TensorFlow estimator.

If you want to see how the train and validation 'TFRecords' datasets are created in detail, look at [Build an Object Detection Model on Tensorflow and SageMaker: Overview and Data Preparation](1_object_detection_preprocessing.ipynb).

In [None]:
# Set paths to training and validation data in S3
s3_train_data = f"{processing_job_output_uri}/tfrecords/train"
s3_validation_data = f"{processing_job_output_uri}/tfrecords/val"

#### Initialize Model Hyperparameters

In [None]:
# Define hyperparameters
hyperparameters = {
    "batch_size": 8,  # Number of samples to include in each batch during training
    "learning_rate": 0.001,  # The learning rate for the optimizer during training
    "epochs": 2,  # Number of epochs (complete passes through the training data) for training
    "global_clipnorm": 0.3,  # Maximum norm of the gradients for clipping to prevent exploding gradients
    "train_samples": training_dataset_size,  # Number of samples in the training dataset
    "eval_samples": validation_dataset_size,  # Number of samples in the validation dataset
    "model_dir": "/opt/ml/model",  # Directory where the trained model will be saved
    "checkpoint_local_path": "/opt/ml/checkpoints",  # Directory where model checkpoints will be saved during training
    "finetune": True,  # Set to True if you want to fine-tune a pre-trained model, False if training from scratch
}

#### Define SageMaker estimator

This code sets up a TensorFlow estimator for training an object detection model using SageMaker.

In [None]:
# Specify instance type
train_instance_type = "ml.p3.2xlarge"  # this is the recommended instance for testing purposes
# train_instance_type="ml.p3.8xlarge"
# train_instance_type="ml.g5.48xlarge"

In [None]:
tf_model_image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",  # Specify the framework as TensorFlow
    region=region,  # The AWS region where you want to run the training job
    version="2.14",  # The version of TensorFlow you want to use (2.13 in this case)
    image_scope="training",  # Specify that you want the training container image
    py_version="py310",  # Specify the Python version you want to use (3.10 in this case)
    instance_type=train_instance_type,  # ParameterString# The type of EC2 instance to use for training (passed as a parameter)
)

In [None]:
%%time
from time import gmtime, strftime
import os

# Import TensorFlow estimator
import sagemaker
from sagemaker.tensorflow import TensorFlow

# Set the arguments for the TensorFlow estimator
estimator_args = dict(
    source_dir="code",  # The directory containing the training code
    entry_point="train.py",  # The Python script to run for training
    model_dir="/opt/ml/model",  # The path to save the trained model
    instance_type=train_instance_type,  # The type of EC2 instance for training
    instance_count=1,  # The number of instances for training
    framework_version="2.14",  # The TensorFlow version to use                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ", # The TensorFlow version to use
    # py_version="py310",# Uncomment this line to specify the Python version
    image_uri=tf_model_image_uri,  # Image URI for the TensorFlow container
    debugger_hook_config=None,  # Disable the TensorFlow debugger hook
    disable_profiler=True,  # Disable the profiler for training
    # max_run=60 * 20,  # # Uncomment this line to set a maximum runtime for training (20 minutes)
    role=role,  # The IAM role for the training job
    # keep_alive_period_in_seconds=3600,# Uncomment this line to set the keep-alive period (1 hour)
    metric_definitions=[
        # Define the metrics to be captured during training, validation, and testing
        {"Name": "train:loss", "Regex": "loss: ([0-9.]*?) "},
        {"Name": "train:box_loss", "Regex": "box_loss: ([0-9.]*?) "},
        {"Name": "train:classification_loss", "Regex": "classification_loss: ([0-9.]*?) "},
        {"Name": "train:MaP", "Regex": "MaP: (.*?) "},
        {"Name": "val:loss", "Regex": "val_loss: ([0-9.]*?) "},
        {"Name": "val:box_loss", "Regex": "val_box_loss: ([0-9.]*?) "},
        {"Name": "val:classification_loss", "Regex": "val_classification_loss: ([0-9.]*?) "},
        {"Name": "val:MaP", "Regex": "val_MaP: (.*?) "},
        {"Name": "test:loss", "Regex": "Test loss: ([0-9.]*?)"},
        {"Name": "test:box_loss", "Regex": "Test box_loss: ([0-9.]*?)"},
        {"Name": "test:Map", "Regex": "Test MaP: (.*?)"},
    ],
)
# Configure the TensorFlow estimator with the specified arguments and hyperparameters
estimator = TensorFlow(
    hyperparameters=hyperparameters,  # Hyperparameters for the model
    **estimator_args,  # Pass the estimator arguments as keyword arguments
)

#### Create Training Job

The next cell starts the training job for the object detection model. The fit method of the estimator object is used to initiate the training process.

In [None]:
%%time
import boto3
from sagemaker.session import Session

# Create a SageMaker session
sagemaker_session = Session()

# Set an experiment name with a timestamp
exp_name = "object-detection-tensorflow-exp-finetuned-{}".format(
    datetime.now().strftime("%Y%m%d-%H%M%S")
)

# Start training job
estimator.fit(
    inputs={"train": s3_train_data, "eval": s3_validation_data}, wait=True, job_name=exp_name
)

### How to load models from training jobs to use them locally

#### Load last training job metadata
This cell imports the necessary AWS SDK for Python (Boto3) and retrieves the name of the last training job from the SageMaker service. It uses the list_training_jobs API call to get a list of training jobs sorted by creation time in descending order, and takes the first result (the most recent training job that was completed).

In [None]:
import boto3

# Create a SageMaker client object
client = boto3.client("sagemaker")

# Retrieve the list of training jobs sorted by creation time in descending order and limit the result to the last 1 job
last_training_job = client.list_training_jobs(
    SortOrder="Descending", SortBy="CreationTime", StatusEquals="Completed", MaxResults=10
)
# Extract the name of the last training job from the list
last_training_job_name = last_training_job["TrainingJobSummaries"][0]["TrainingJobName"]

In [None]:
# prints last completed training job name
print(last_training_job_name)

This cell calls the describe_training_job API to retrieve the details of the training job with the given name. Finally, it prints the S3 URI of the model artifacts generated by the training job.

In [None]:
# Retrieve the details of the last training job using the describe_training_job method
# from the SageMaker client
last_training_job_data = client.describe_training_job(TrainingJobName=last_training_job_name)
# last_training_job_name="job_name"#uncomment if you know the job name
# last_training_job_name = "object-detection-tensorflow-exp-finetuned-20240402-195644"
model_s3_uri = last_training_job_data["ModelArtifacts"]["S3ModelArtifacts"]
# Print the S3 location where the model artifacts (trained model) are stored
print(last_training_job_data["ModelArtifacts"]["S3ModelArtifacts"])

In [None]:
# store S3 uri with model artifact
%store model_s3_uri

In [None]:
# read last training job model s3 uri
%store -r model_s3_uri

#### Download model artifacts from S3 bucket for local testing
The code of the next cells downloads a pre-trained model from an Amazon S3 bucket and extracts it locally.

In [None]:
import sagemaker

# Create an instance of the S3Downloader class
s3_downloader = sagemaker.s3.S3Downloader()

# Download the model from the specified S3 URI to the local 'model' directory
s3_downloader.download(model_s3_uri, "model")

In [None]:
# extract the pre-trained model from the tar.gz file
!tar -xzvf model/model.tar.gz -C model

#### Load model locally
This code loads the pre-trained model locally.

In [None]:
import keras

# Load a pre-trained object detection model from the specified path
model = keras.saving.load_model("model/1/model.keras")
model.compile()

#### Load validation dataset
This code is preparing a validation dataset for a machine learning model that performs object detection. It loads and preprocesses the validation data from 'TFRecord' files, shuffles and batches the data, resizes the input images to a fixed size, and converts the input data to a format suitable for the model. The preprocessed validation dataset is then ready for evaluating the model's performance on unseen data.

In [None]:
# Import the bounding_box module from keras_cv
from keras_cv import bounding_box

# Set the batch size for training
BATCH_SIZE = 4

# Create a Resizing layer from keras_cv.layers
inference_resizing = keras_cv.layers.Resizing(
    640, 640, bounding_box_format="xyxy", pad_to_aspect_ratio=True
)


# Define a function to convert dictionary inputs to tuples
# This function is used for mapping the input data to a format suitable for the model
def dict_to_tuple(inputs):
    # Extract the images from the input dictionary
    images = inputs["images"]
    # Convert the bounding box coordinates to a dense tensor format
    bounding_boxes = bounding_box.to_dense(inputs["bounding_boxes"], max_boxes=32)
    # Return the images and bounding boxes as a tuple
    return images, bounding_boxes


# Create a TensorFlow dataset from the validation dataset
val_ds = val_dataset.map(parse_tfrecord_fn)
# Preprocess the validation dataset
val_ds = val_ds.map(lambda x: prepare_sample(x), num_parallel_calls=tf_data.AUTOTUNE)
# Shuffle the validation dataset
val_ds = val_ds.shuffle(BATCH_SIZE * 4)

# Batch the validation dataset using ragged batching
val_ds = val_ds.ragged_batch(BATCH_SIZE)

# Resize the images in the validation dataset using the inference_resizing layer
val_ds = val_ds.map(inference_resizing, num_parallel_calls=tf_data.AUTOTUNE)

# Convert the input data to the required format (tuples of images and bounding boxes)
val_ds = val_ds.map(dict_to_tuple, num_parallel_calls=tf_data.AUTOTUNE)

# Repeat the validation dataset indefinitely
val_ds = val_ds.repeat()

# Prefetch data to improve performance
val_ds = val_ds.prefetch(tf_data.AUTOTUNE)

# Get a sample batch from the validation dataset
sample_val_ds = next(iter(val_ds.take(1)))
image, _ = sample_val_ds

#### Local prediction testing

The 'keras_cv.layers.MultiClassNonMaxSuppression' layer is used to create a prediction decoder for the object detection model. This layer performs non-maximum suppression on the raw output of the model, which helps to remove duplicate or overlapping bounding box predictions.

In [None]:
# Create a prediction decoder
prediction_decoder = keras_cv.layers.MultiClassNonMaxSuppression(
    bounding_box_format="xyxy",  # Specify the format of the bounding boxes (x, y, x, y)
    from_logits=True,  # Indicate that the input is logits (raw output from the model)
    iou_threshold=0.75,  # Set the Intersection over Union (IoU) threshold for non-maximum suppression
    confidence_threshold=0.5,  # Set the confidence threshold for non-maximum suppression
)

# Assign the prediction decoder to the model
model.prediction_decoder = prediction_decoder

This function, visualize_detections, is used to visualize the object detection results of a trained model on a sample dataset

In [None]:
# This function visualizes the object detection results
def visualize_detections(model, dataset, bounding_box_format):
    # Get the first batch of images and ground truth bounding boxes from the dataset
    images, y_true = next(iter(dataset.take(1)))  # takes one batch from dataset
    images = images.numpy().astype(np.uint8)

    # Make predictions on the batch of images using the trained model
    y_pred = model.predict(images)

    # Plot the images with ground truth and predicted bounding boxes
    visualization.plot_bounding_box_gallery(
        images,
        value_range=(0, 255),
        bounding_box_format=bounding_box_format,
        y_true=y_true,
        y_pred=y_pred,
        scale=4,
        rows=2,
        cols=2,
        show=True,
        font_scale=0.7,
        class_mapping=class_mapping,
    )

In [None]:
visualize_detections(model, bounding_box_format="xyxy", dataset=val_ds)

#### Create metrics for trained model
This code is setting up an instance of the 'BoxCOCOMetrics' class from the keras_cv.metrics module. This class is used to calculate various evaluation metrics for object detection models, specifically when working with the COCO (Common Objects in Context) dataset format.

In [None]:
metrics_val = keras_cv.metrics.BoxCOCOMetrics(
    bounding_box_format="xyxy",
    evaluate_freq=2,
)

The validation metrics computed in this code can be used to evaluate the performance of the trained model and potentially fine-tune it if necessary.

In [None]:
# Loop through the first 10 batches of the validation dataset
for batch in val_ds.take(100):
    images, y_true = batch  # Unpack the batch into images and ground truth labels
    # Make predictions on the images using the trained model
    y_pred = model.predict(images, verbose=1, steps=10)
    # Update the validation metrics using the ground truth labels and predicted values
    metrics_val.update_state(y_true, y_pred)
# Compute and return the final validation metrics
metrics_val = metrics_val.result(force=True)

This code is useful for evaluating the performance of an object detection model on a validation dataset, as it calculates various metrics such as precision, recall, and mean average precision ('mAP') based on the true and predicted labels. These metrics can be used to assess the model's accuracy and make necessary adjustments or improvements.

In [None]:
# Print the validation metrics
for key in metrics_val.keys():
    print(f"{key}:{metrics_val[key]}")

Note: To achieve good Mean Average Precision ('mAP') values, you may need to run the training for more epochs and perform hyperparameter tuning. This code is just an exercise, and further optimization might be required for real-world object detection tasks.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|2_object_detection_train_eval.ipynb)
