# Build an Object Detection Model on TensorFlow and SageMaker: Overview and Data Preparation


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

---


## Background

This notebook is one of a sequence of notebooks that show you how to use various SageMaker functionalities to build, train, and test the object detection model, including data pre-processing steps like ingestion, cleaning and processing, training, and test the model. There are two parts of the demo: 

1. Overview and Data Preparation (current notebook) - you will preprocess the data, then create a json file from the cleaned data. By the end of part 1, you will have a complete data set that contains all features used on Object selection to be ingested by a data loader in *[TensorFlow](https://github.com/tensorflow/tensorflow)* using 'TFRecords'.
1. Data loader creation and Model Training - you will use the data set built from part 1 to create a data loader for Tensorflow using *[Keras CV](https://github.com/keras-team/keras-cv)*, train the model and then test the model predictability with the test data. 


## Content
* [Overview](#Overview)
* [Data Selection](#Data-Selection)
* [Prepare Data](#Prepare-Data)
* [Preprocessing the Dataset with SageMaker](#Preprocessing-the-Dataset-with-SageMaker)


## Overview

### What is Object Detection, and why is it important?

Object detection refers to detecting instances of objects from certain classes in images or videos. It allows for multiple objects to be detected and localized in an image. Object detection is commonly used in applications such as self-driving cars, face detection, video surveillance, etc.  

### Use Cases for Object Detection

Some common use cases of object detection include:

- Self driving cars - detect pedestrians, cars, traffic signs, etc.
- Face detection - detect faces in images and videos for applications like security and tagging people in images.
- Video surveillance - detect suspicious activities or objects.
- Medical imaging - detect anomalies, tumors, etc. in medical scans.
- Retail - detect objects on shelves for inventory management.

### Define the Machine Learning Problem  

Object detection can be formulated as a supervised machine learning problem:

- Given a set of labelled images containing objects from certain classes, train a model to detect the presence and location of those objects in new images.

- The model needs to identify the class of objects present and draw bounding boxes around them indicating their locations.

### Data Requirements

- Large dataset of images with object annotation - Object locations are annotated using bounding boxes around them.

- Variety of images - Objects captured under different conditions of illumination, scales, occlusion, viewpoints etc. 

### Challenges

- Data annotation - Time consuming and expensive process.

- Class imbalance - Models tend to perform better for classes with more examples.

- Viewpoint variation - Objects look different from different angles and viewpoints. 

- Background clutter - Objects may blend with their surroundings.

- Small objects - Harder to detect smaller objects.

- Occlusion - Objects hidden behind other things are tougher to detect.

## Data Selection

The dataset that will be used for object detection is the *[PASCAL VOC 2012](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html)* dataset. This dataset contains images, annotations, and evaluation tools for object detection.

### Dataset Structure
The folder structure is:

- `Annotations` - contains XML annotation files 
- `ImageSets` - contains text files that specify train, val, test splits
- `JPEGImages` - contains JPEG images
- `SegmentationClass` - contains segmentation class PNGs 
- `SegmentationObject` - contains segmented objects PNGs

### Annotation Format

The XML annotation files contain information about objects present in each image:

- `<filename>`: name of the JPEG image
- `<size>`: image width, height, depth
- `<object>`: contains information about each object instance
    - `<name>`: object class name 
    - `<pose>`: orientation of object (left, right, frontal, rear) 
    - `<truncated>`: whether object is truncated 
    - `<difficult>`: whether object is difficult to detect
    - `<bndbox>`: bounding box of object 
        - contains `<xmin>, <ymin>, <xmax>, <ymax>` coordinates

### Classes

There are 20 classes representing objects:

'person', 'bird', 'cat', 'cow', 'dog', 'horse', 'sheep', 'airplane', 'bicycle', 'boat', 'bus', 'car', 'motorbike', 'train', 'bottle', 'chair', 'dining table', 'potted plant', 'sofa', 'tv/monitor'

### Data Splits

The data is split into train, validation, and test sets defined in the 'ImageSets' folder text files.

- Train - 5717 images 
- Validation - 5823 images
- Test - 10991 images

This dataset will be used to train an object detection model to detect and localize the defined classes.
 * the data will be downloaded from an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) bucket.

For this specific use case, you will focus on a solution to detect and localize objects in images. Some possible expansions of the work include:

- Detect additional classes by expanding the dataset with more images and annotations
- Improve localization accuracy by generating more precise bounding boxes 
- Add image attributes (lighting, orientation, occlusion) and object attributes (size, color) to the data
- Expand to video data and perform object tracking over frames
- Perform instance segmentation instead of bounding box detection
- Develop a real-time object detection application 


## Prepare Data

### Set Up Notebook

In [None]:
!pip install --upgrade pip --quiet
!pip install --upgrade sagemaker boto3 --quiet
!pip install keras-cv tensorflow~=2.13.0 --upgrade --quiet
#!pip install keras-cv --upgrade --quiet
#!pip install pickleshare --upgrade --quiet
#!pip install opencv-python
#!pip install matplotlib
#!conda install opencv

# tensorflow 2.15.0 is needed for Keras CV in Keras 2

In [None]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"
import keras_cv

In [None]:
import sagemaker
import json
import pandas as pd
import glob
import boto3

### Parameters 
The following lists configurable parameters that are used throughout the whole notebook.

In [None]:
# Create a SageMaker session
sagemaker_session = sagemaker.Session()
# Get the default S3 bucket associated with your SageMaker session
bucket = sagemaker_session.default_bucket()  # replace with your own bucket name if you have one
# Create an S3 resource client
s3 = boto3.resource("s3")
# Get the AWS region name
region = boto3.Session().region_name
# Get the execution role for SageMaker
role = sagemaker.get_execution_role()
# Create a SageMaker client
smclient = boto3.Session().client("sagemaker")
# Set a prefix for your S3 objects
prefix = "object-detection-tensorflow"

### Ingest Data

We ingest the dataset from a public SageMaker S3 training bucket.

In [None]:
!mkdir -p data/

In [None]:
##### Alternative: you can copy data from public S3 bucket to your own bucket
import boto3
import botocore

BUCKET_NAME = f"sagemaker-example-files-prod-{region}"  # dataset bucket name
KEY = "datasets/image/VOC2012/VOCtrainval_11-May-2012.tar"  # dataset object key

# Define the local path where the dataset will be downloaded
raw_dataset_folder = "./data/VOCtrainval_11-May-2012.tar"

# Define the target S3 bucket and prefix where the dataset will be copied
s3_target_bucket = f"s3://{bucket}/{prefix}/data"

try:
    # Download the dataset from the source bucket to the local path
    s3.Bucket(BUCKET_NAME).download_file(KEY, raw_dataset_folder)
except botocore.exceptions.ClientError as e:
    # Handle exceptions related to the S3 download operation
    if e.response["Error"]["Code"] == "404":
        print("The object does not exist.")
    else:
        raise

In [None]:
# upload local tar file to s3_target_bucket
S3_dataset_bucket = sagemaker.s3.S3Uploader.upload(raw_dataset_folder, s3_target_bucket)

In [None]:
%%time
# this step can take a while if you use depending ec2 instance,
# untar the partitioned data files into the data folder
!tar -xf {raw_dataset_folder} -C ./data

In [None]:
# remove local tar file
!rm {raw_dataset_folder}

In [None]:
# location of tar file on S3 bucket for later processing
S3_dataset_bucket

### Data Cleaning

Due to the size and complexity of the data (images, XML, text files, size of 2GB), you will start exploring our data by using the text files with the train, val and test splits. 
The image metadata is in the XML files, so you need to create a file that is easier to process in the data loader. 
For this case, you will use JSON Lines format to store a list of JSON files that can then be used to create the TensorFlow data loaders.

#### Remove irrelevant features

Upon initial inspection of the XML structure, features such as 'folder', 'source', 'object/pose', 'object/truncated', 'object/difficult', 'object/part' are not relevant and can be excluded. You will need to write a function that parses the XML files, extracts the relevant features, and creates a dictionary of these features that can then be used to generate a JSON Lines format files for further analysis.

In [None]:
import os
from defusedxml.ElementTree import parse

# this variable helps to track class names
class_ids = set()


# function that parse XML File and returns a dict
def xml_data_parser(xml_file):
    if os.path.isfile(xml_file):
        with open(xml_file) as f:
            tree = parse(f)
            root = tree.getroot()
        annotation_dict = {}
        filename = root.findall("filename")[0].text
        width = root.findall("size")[0].find("width").text
        height = root.findall("size")[0].find("height").text
        depth = root.findall("size")[0].find("depth").text
        annotation_dict["image"] = filename
        annotation_dict["width"] = width
        annotation_dict["height"] = height
        annotation_dict["annotation"] = []
        for i in range(len(root.findall("object"))):
            annotation = {}
            xmin = root.findall("object")[i].find("bndbox").find("xmin").text
            ymin = root.findall("object")[i].find("bndbox").find("ymin").text
            xmax = root.findall("object")[i].find("bndbox").find("xmax").text
            ymax = root.findall("object")[i].find("bndbox").find("ymax").text
            name = root.findall("object")[i].find("name").text
            annotation["category"] = name
            annotation["bbox"] = [xmin, ymin, xmax, ymax]
            annotation_dict["annotation"].append(annotation)
            class_ids.add(name)
        return annotation_dict
    else:
        print(f"{xml_file} doesnt exists")
        return {}


# This function parses XML file and returns a dict structure
xml_data_parser("./data/VOCdevkit/VOC2012/Annotations/2007_000027.xml")

### Data Exploration
To avoid repeatedly loading the individual XML files, we will create a JSON Lines format file containing the training and validation datasets. The VOC2012 dataset includes TXT files for object detection that contain the names of the corresponding XML and JPEG files. We can use these names as IDs when generating the JSON Lines format files. 

In [None]:
import json
from sklearn.model_selection import train_test_split


# load file and create a list with name of files
def parse_dataset_files(filename):
    with open(filename, "r") as f:
        dataset_lines = f.readlines()
        dataset_files = [i.strip() for i in dataset_lines]

    xml_files = [
        f"./data/VOCdevkit/VOC2012/Annotations/{dataset_files[i]}.xml"
        for i in range(len(dataset_files))
    ]

    result = [xml_data_parser(xml_file) for xml_file in xml_files]
    return result


# Dumps the list of dicts to a JSONL file
def dump_and_load_dataset(dataset, filename):
    with open(filename, "w") as outfile:
        json.dump(dataset, outfile)
    # Loads the JSONL file to a list of dicts structure
    with open(filename, "r") as f:
        output_dataset = json.load(f)
    return output_dataset


# creation of datasets(list of dicts)
train_dataset = parse_dataset_files("./data/VOCdevkit/VOC2012/ImageSets/Main/train.txt")
val_dataset = parse_dataset_files("./data/VOCdevkit/VOC2012/ImageSets/Main/val.txt")
training_dataset_size = len(train_dataset)
validation_dataset_size = len(val_dataset)
print(
    f"{training_dataset_size} samples for training and {validation_dataset_size} samples for validation before loading"
)

# Dump of dicts to jsonl files and load for testing purposes
train_filename = "./data/train_labels_VOC.jsonl"
# val_filename="./data/val_labels_VOC.jsonl"
train_dataset = dump_and_load_dataset(train_dataset, train_filename)
# val_dataset=dump_and_load_dataset(val_dataset,val_filename)

# creates validation dataset from training dataset with test_size=0.33 and random_state=42
training_dataset, val_dataset = train_test_split(train_dataset, test_size=0.33, random_state=42)

# store S3bucket string from preprocessing
%store training_dataset_size
%store validation_dataset_size

In [None]:
# creation of a dict to map classes with class ids
class_ids = sorted(list(class_ids))
# Creating a dictionary that maps numerical class IDs to their corresponding class names
class_mapping = dict(zip(range(len(class_ids)), class_ids))
# Creating an inverse dictionary that maps class names to their corresponding numerical class IDs
class_mapping_by_label = dict(zip(class_ids, range(len(class_ids))))


# This function takes a class label (string) as input
# and returns the corresponding numerical class ID
def class_text_to_int(label):
    if label in class_mapping_by_label.keys():
        return class_mapping_by_label[label]

In [None]:
# Store the class mapping in the notebook's metadata
%store class_mapping

In [None]:
# prints the first element of the traing JSONL fle
print(train_dataset[2])

#### Evaluating the Object Detection Data

To validate that our data is correct and that the bounding boxes are in the proper format, we can load and visualize the first 5 images in our dataset. This allows us to inspect the current image along with the bounding boxes and labels. By sampling these initial images, we can confirm the data is formatted appropriately before proceeding.

In [None]:
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont


# Function to plot bounding boxes on an image
def plot_boxes(image_path, bboxes, width, height):
    # Open the image
    image = Image.open(image_path)
    # print(bboxes)

    # Create a drawing object
    draw = ImageDraw.Draw(image)
    # Load a font for drawing labels
    # font = ImageFont.truetype("arial.ttf", 16)

    # Loop over the bounding boxes and draw them on the image
    for bbox in bboxes:
        # x1, y1, x2, y2 = bbox
        x1 = float(bbox["bbox"][0])
        y1 = float(bbox["bbox"][1])
        x2 = float(bbox["bbox"][2])
        y2 = float(bbox["bbox"][3])
        draw.rectangle([(x1, y1), (x2, y2)], outline="red", width=2)

        # Draw the label
        text_width, text_height = draw.textsize(bbox["category"])
        draw.rectangle([(x1, y1), (x1 + text_width + 10, y1 - text_height - 5)], fill="red")
        draw.text((x1 + 5, y1 - text_height), bbox["category"], fill="white")

    # Display the image
    image.show()


# Loop through first 5 images in the dataset and plot bounding boxes
for i in range(5):
    boxes = [i for i in train_dataset[i]["annotation"]]
    width = float(train_dataset[i]["width"])
    height = float(train_dataset[i]["height"])
    plot_boxes(
        f"./data/VOCdevkit/VOC2012/JPEGImages/{train_dataset[i]['image']}", boxes, width, height
    )

#### Addressing Imbalance Classes
Imbalanced class distributions, where some classes are much more frequent than others, are very common in object detection datasets.  Useful tactics for handling imbalance include under sampling, oversampling, and augmentation. In this case, we will utilize data augmentation in the Keras preprocessing layers inside our data loader to improve balance. Image augmentation via random scaling, cropping, flipping, and rotation can generate synthetic minority class examples. This helps equalize class frequencies so models better learn the rare classes too. Their implementation demonstrates augmentation nearly doubling 'mAP' performance on an imbalanced detection dataset.

In [None]:
import collections


# Define a function to count the number of instances for each class in the dataset
def count_dataset(dataset, name):
    categories = []
    for sample in dataset:
        for category in sample["annotation"]:
            categories.append(category["category"])
    class_counts = collections.Counter(categories)
    class_df = pd.DataFrame.from_dict(class_counts, orient="index").reset_index()
    class_df.columns = ["label", "count"]
    class_df["type"] = name

    return class_df

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

datasets = {"train": train_dataset, "val": val_dataset}
for key in datasets:
    dataset_df = count_dataset(datasets[key], key)
    ax = sns.catplot(
        data=dataset_df,
        x="count",
        y="label",
        col="type",
        kind="bar",
        height=6,
        aspect=0.8,
    )

In preprocessing, you have:
* Removed irrelevant features from the data

* Generated JSON Lines format files and created list of dictionary datasets for training and validation

* Visually validated bounding boxes, labels, and images to ensure proper formatting

* Identified imbalanced class distributions that may need augmentation or sampling to improve

Overall, you have cleaned and formatted the raw data into train and validation sets, confirmed integrity via sampling, and identified potential areas for improvement via balancing. Our data is now ready for model training after these key preprocessing steps


## 'TFRecords' dataset processing
When working with large datasets in TensorFlow, it is common to use 'TFRecords' format. 'TFRecords' are a simple format for storing data serialized as protobuf messages. The benefits of using 'TFRecords' include:
* More efficient I/O performance - reading/writing protobuf messages is much faster than parsing raw images and annotations.
* Optimization - 'TFRecord' files contain serialized data, allowing for pre-processing and data augmentation during parsing.
* Portability - 'TFRecord' files can be used across different environments.
* Compact - 'TFRecord' files take up less space compared to uncompressed images.
For object detection datasets, we can store images, bounding boxes, classes, etc. in 'TFExample' protocol buffer messages in a 'TFRecord' file.

'TFRecords' work nicely with TensorFlow's input pipeline for reading and parsing data efficiently during training. They can also be used with SageMaker Pipe Mode for large datasets. Pipe Mode allows us to stream data directly from Amazon S3 via 'TFRecord' files during training without needing to store the full dataset locally. This enables training on datasets that are larger than local storage capacity.
. 

#### Creating 'TFRecord' Files
'TFRecord' is a simple format for storing machine learning data. It allows us to store image and annotation data together in a single file.

To create 'TFRecord' files for our object detection dataset, we first need to encode the image data and bounding box information from our dataset into 'tf.train.Example' protocol buffers. Each Example 'proto' contains the following fields:
* 'height': Image height
* 'width': Image width
* 'filename': Filename of the image
* 'image': Encoded image bytes
* 'object/bbox/xmin': Normalized left x coordinate of bounding box
* 'object/bbox/xmax': Normalized right x coordinate of bounding box
* 'object/bbox/ymin': Normalized top y coordinate of bounding box
* 'object/bbox/ymax': Normalized bottom y coordinate of bounding box
* 'object/label': Class label index
* 'object/text': Class label text

We can use the tf.io.encode_jpeg() function to encode the image bytes and the bounding box data can be obtained from the annotations.

In [None]:
import tensorflow as tf


# TFRecords helper functions
def image_feature(value):
    # This function takes a value (likely an image) and encodes it as a JPEG image
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()]))


def int64_feature(value):
    # This function takes an integer value and returns a tf.train.Feature object
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def int64_list_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


def bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def bytes_list_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))


def float_list_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))


def create_example(image, image_path, group):
    # This function takes an image, its file path, and a group of annotations
    # It creates a tf.train.Example object with various features extracted from the data

    width = int(group["width"])
    height = int(group["height"])
    classes_text = []
    classes = []
    xmin = []
    ymin = []
    xmax = []
    ymax = []

    # Iterate over the annotations in the group
    for index, row in enumerate(group["annotation"]):
        # Extract bounding box coordinates and class information
        # print(row)
        # np.array([i])
        xmin.append(float(row["bbox"][0]))
        ymin.append(float(row["bbox"][1]))
        xmax.append(float(row["bbox"][2]))
        ymax.append(float(row["bbox"][3]))
        classes.append(class_text_to_int(row["category"]))
        classes_text.append(row["category"].encode("utf8"))

    # Create a tf.train.Example object with various features
    feature = tf.train.Example(
        features=tf.train.Features(
            feature={
                "height": int64_feature(height),
                "width": int64_feature(width),
                "filename": bytes_feature(group["image"].encode("utf8")),
                "image": image_feature(image),
                "object/bbox/xmin": float_list_feature(xmin),
                "object/bbox/xmax": float_list_feature(xmax),
                "object/bbox/ymin": float_list_feature(ymin),
                "object/bbox/ymax": float_list_feature(ymax),
                "object/text": bytes_list_feature(classes_text),
                "object/label": int64_list_feature(classes),
            }
        )
    )

    return feature

Once we have encoded the data into Examples, we can write them to a 'TFRecord' file using 'tf.io.TFRecordWriter'. Multiple Examples are serialized and written to each file.

In [None]:
# creates output folder
tfrecords_dir = "./data/tfrecords"
# create TFrecord folders
if not os.path.exists(tfrecords_dir + "/train"):
    os.makedirs(tfrecords_dir + "/train")
if not os.path.exists(tfrecords_dir + "/val"):
    os.makedirs(tfrecords_dir + "/val")

In [None]:
num_samples = 1024  # number of samples on each TFRecord file
num_tfrecords_train = len(train_dataset) // num_samples
num_tfrecords_val = len(val_dataset) // num_samples
if len(train_dataset) % num_samples:
    num_tfrecords_train += 1  # add one record if there are any remaining samples
if len(val_dataset) % num_samples:
    num_tfrecords_val += 1  # add one record if there are any remaining samples

In [None]:
# Define the path to the directory containing the images
images_dir = "data/VOCdevkit/VOC2012/JPEGImages"
# Loop through the specified number of TFRecord files for the training dataset
for tfrec_num in range(num_tfrecords_train):
    # Select a subset of samples from the training dataset based on the current TFRecord file index
    samples = train_dataset[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
    # Open a TFRecordWriter to create a new TFRecord file
    with tf.io.TFRecordWriter(
        tfrecords_dir + "/train/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
    ) as writer:
        # Iterate through the selected samples
        for sample in samples:
            # Construct the full path to the image file
            image_path = f"{images_dir}/{sample['image']}"
            # Read the image file and decode the JPEG data
            image = tf.io.decode_jpeg(tf.io.read_file(image_path))
            # Create a TFExample from the image data and sample metadata
            example = create_example(image, image_path, sample)
            # Write the TFExample to the TFRecord file
            writer.write(example.SerializeToString())

In [None]:
# Loop through the specified number of TFRecord files for the validation dataset
for tfrec_num in range(num_tfrecords_val):
    samples = val_dataset[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
    with tf.io.TFRecordWriter(
        tfrecords_dir + "/val/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
    ) as writer:
        for sample in samples:
            image_path = f"{images_dir}/{sample['image']}"
            image = tf.io.decode_jpeg(tf.io.read_file(image_path))
            example = create_example(image, image_path, sample)
            writer.write(example.SerializeToString())

During training, we can read the image and label data efficiently from the 'TFRecord' files using 'tf.data.TFRecordDataset' and parse the Examples back into tensors. This provides an optimized input pipeline for our model.

In [None]:
# Helper function to parse TFRecordDataset back to tensors
def parse_tfrecord_fn(example):
    # Define the structure of the TFRecord file
    feature_description = {
        "height": tf.io.FixedLenFeature((), tf.int64),
        "width": tf.io.FixedLenFeature((), tf.int64),
        "filename": tf.io.FixedLenFeature((), tf.string),
        "image": tf.io.FixedLenFeature((), tf.string),
        "object/bbox/xmin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/xmax": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymax": tf.io.VarLenFeature(tf.float32),
        "object/text": tf.io.VarLenFeature(tf.string),
        "object/label": tf.io.VarLenFeature(tf.int64),
    }
    # Parse the example from the TFRecord file
    example = tf.io.parse_single_example(example, feature_description)
    # Decode the JPEG image data and convert it to a float32 tensor
    example["image"] = tf.cast(tf.io.decode_jpeg(example["image"], channels=3), tf.float32)
    # Convert the filename to a string tensor
    example["filename"] = tf.cast(example["filename"], tf.string)
    # Convert the sparse tensors to dense tensors
    example["object/bbox/xmin"] = tf.sparse.to_dense(example["object/bbox/xmin"])
    example["object/bbox/xmax"] = tf.sparse.to_dense(example["object/bbox/xmax"])
    example["object/bbox/ymin"] = tf.sparse.to_dense(example["object/bbox/ymin"])
    example["object/bbox/ymax"] = tf.sparse.to_dense(example["object/bbox/ymax"])
    example["object/text"] = tf.sparse.to_dense(example["object/text"])
    example["object/label"] = tf.sparse.to_dense(example["object/label"])

    # Combine the bounding box coordinates into a single tensor
    example["object/bbox"] = tf.stack(
        [
            example["object/bbox/xmin"],
            example["object/bbox/ymin"],
            example["object/bbox/xmax"],
            example["object/bbox/ymax"],
        ],
        axis=1,
    )

    return example


# Helper function to prepare sample in the next format
# {"images": image, "bounding_boxes": {
#        "classes": label_list,
#        "boxes": boxes_list,
#    }
# }
def prepare_sample(inputs):
    image = inputs["image"]  # Get the image tensor
    boxes = inputs["object/bbox"]  # Get the bounding box tensor
    bounding_boxes = {
        "classes": inputs["object/label"],  # Get the object labels
        "boxes": boxes,  # Get the bounding box coordinates
    }
    return {
        "images": image,
        "bounding_boxes": bounding_boxes,
    }  # Return the sample in the desired format

#### Validating the 'TFRecord' Files
After creating the 'TFRecord' files, it's important to validate that they were created correctly before training.

We can load and parse a sample of the data and visualize the bounding boxes overlaid on the images. This allows us to verify that the image data and annotations match up properly.


In [None]:
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont


# Function to plot bounding boxes on an image
def plot_boxes_tfrecords(features):
    # Open the image
    data = features["images"].numpy().astype(np.uint8)
    image = Image.fromarray(data, "RGB")

    # Create a drawing object
    draw = ImageDraw.Draw(image)

    # Iterate over the bounding boxes in the features dictionary
    for index in range(len(features["bounding_boxes"]["boxes"])):
        # print(index)
        # Extract the coordinates of the current bounding box
        box = features["bounding_boxes"]["boxes"][index]
        x1, y1, x2, y2 = box
        category = class_mapping[int(features["bounding_boxes"]["classes"][index])]
        draw.rectangle([(x1, y1), (x2, y2)], outline="red", width=2)
        # Draw the label
        # text_width, text_height = draw.textsize(category)
        # draw.rectangle([(x1, y1), (x1 + text_width + 10, y1 - text_height - 5)], fill="red")
        # draw.text((x1 + 5, y1 - text_height), category, fill="white")

    # Display the image
    image.show()

This plots the image and overlays the bounding box coordinates loaded from the 'TFRecord' file.

By visualizing several examples in this way, we can verify that the 'TFRecord' files contain the correct data before training our model.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

BATCH_SIZE = 4

# Loads the first TFRecord file for testing
raw_dataset_sample = tf.data.TFRecordDataset(f"{tfrecords_dir}/train/file_00-{num_samples}.tfrec")
# This line applies a function `parse_tfrecord_fn` to each example in the dataset.
raw_dataset_sample = raw_dataset_sample.map(parse_tfrecord_fn)
# This line applies another function `prepare_sample` to each example in the dataset.
parsed_dataset_sample = raw_dataset_sample.map(lambda x: prepare_sample(x))
# This line shuffles the dataset by creating a buffer of `BATCH_SIZE * 4` elements and randomly sampling from that buffer.
sample_ds = parsed_dataset_sample.shuffle(BATCH_SIZE * 4)
# This line takes one batch of data from the shuffled dataset.
data = next(iter(sample_ds.take(1)))
# visualize the sample
plot_boxes_tfrecords(data)

## Preprocessing the Dataset with SageMaker


One of the benefits of using SageMaker for preprocessing data is that we can leverage powerful compute instances to speed up our data preparation scripts. Although for this use case we could run the preprocessing locally, for larger datasets it is useful to execute our scripts on optimized SageMaker instances.

We will use a custom script that implements the object detection data preprocessing steps discussed previously. By containerizing this script, we can execute it on SageMaker processing jobs and save the outputs to S3.



The 'TensorFlowProcessor' is a SageMaker component that allows you to run TensorFlow scripts or Docker containers as processing jobs on SageMaker. It provides a convenient way to preprocess data, perform feature engineering, or run any other data processing tasks using TensorFlow.

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlowProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Get the current AWS region
region = boto3.session.Session().region_name
# Get the execution role for SageMaker
role = get_execution_role()

# Initialize the TensorFlowProcessor
tp = TensorFlowProcessor(
    framework_version="2.3",  # TensorFlow version to use
    role=role,  # AWS IAM role for SageMaker to access AWS resources
    instance_type="ml.m5.xlarge",  # Instance type for the processing job
    instance_count=1,  # Number of instances to use for the processing job
    base_job_name="frameworkprocessor-TF",  # Base name for the processing job
    py_version="py37",  # Python version to use
    sagemaker_session=sagemaker_session,  # SageMaker session object
)

In [None]:
# Define the name of the output file for the processing job
processing_job_output_name = "processing_job_output.csv"

First, we will package our data preparation code into a Docker container. The script handles tasks like splitting the dataset into train and validation sets, converting labels to the required format, and encoding the image data.

Next, we will configure and launch a SageMaker processing job to run this containerized script on the input dataset from S3. We can select ml.p3.2xlarge instances to parallelize and accelerate the data preprocessing.

The output preprocessed object detection datasets are saved back to S3. Now we have an efficient way to preprocess large volumes of data while leveraging SageMaker's managed compute resources. Running on capable instances improves the speed of our data pipeline.

This demonstrates how SageMaker processing jobs allow us to customize and automate preprocessing on scalable infrastructure for computer vision tasks like object detection.
You can find a complete guide to the SageMaker Processing job in [this blog](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/).

In [None]:
# creates local dir
!mkdir -p preprocessing

In [None]:
%%writefile preprocessing/preprocessing_dataset.py
# Import necessary libraries
import sys
import subprocess
import os
import warnings
import time
import argparse
import boto3
import pandas as pd
import sagemaker
import json
import glob
import os
import tensorflow as tf
import logging
import pathlib
from sagemaker.s3 import S3Downloader
from sagemaker.session import Session
from sklearn.model_selection import train_test_split
from defusedxml.ElementTree import parse

start_time = time.time()  # Record the start time for timing purposes

from sklearn.model_selection import train_test_split

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

# This set will store unique class names from the dataset
class_ids = set()


# Function to parse an XML file and return a dictionary with image and annotation details
def xml_data_parser(xml_file):
    # Check if the XML file exists
    if os.path.isfile(xml_file):
        with open(xml_file) as f:
            # tree = ET.parse(f)
            tree = parse(f)
            root = tree.getroot()

        annotation_dict = {}
        filename = root.findall("filename")[0].text
        width = root.findall("size")[0].find("width").text
        height = root.findall("size")[0].find("height").text
        depth = root.findall("size")[0].find("depth").text
        annotation_dict["image"] = filename
        annotation_dict["width"] = width
        annotation_dict["height"] = height
        annotation_dict["annotation"] = []
        for i in range(len(root.findall("object"))):
            annotation = {}
            xmin = root.findall("object")[i].find("bndbox").find("xmin").text
            ymin = root.findall("object")[i].find("bndbox").find("ymin").text
            xmax = root.findall("object")[i].find("bndbox").find("xmax").text
            ymax = root.findall("object")[i].find("bndbox").find("ymax").text
            name = root.findall("object")[i].find("name").text
            annotation["category"] = name
            annotation["bbox"] = [xmin, ymin, xmax, ymax]
            annotation_dict["annotation"].append(annotation)
            class_ids.add(name)  # Add the class name to the set
        return annotation_dict
    else:
        print(f"{xml_file} doesnt exists")
        return {}


# Function to parse the dataset files and return a list of dictionaries
def parse_dataset_files(base_path, filename):
    # Load the file and create a list of filenames
    with open(filename, "r") as f:
        dataset_lines = f.readlines()
        dataset_files = [i.strip() for i in dataset_lines]

    # Create a list of XML file paths
    xml_files = [
        f"{base_path}/VOCdevkit/VOC2012/Annotations/{dataset_files[i]}.xml"
        for i in range(len(dataset_files))
    ]
    result = [xml_data_parser(xml_file) for xml_file in xml_files]
    return result


# Function to dump and load a dataset from/to a JSONL file
def dump_and_load_dataset(dataset, filename):
    # Dumps the list of dicts to a JSONL file
    with open(filename, "w") as outfile:
        json.dump(dataset, outfile)
    # Loads the JSONL file to a list of dicts structure
    with open(filename, "r") as f:
        output_dataset = json.load(f)
    return output_dataset


# Function to map class labels to integers
def class_text_to_int(label):
    if label in class_mapping_by_label.keys():
        return class_mapping_by_label[label]


# TFRecords helper functions
def image_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()]))


def int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def int64_list_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


def bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def bytes_list_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))


def float_list_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))


# Function to create a TFRecord example from a sample
def create_example(image, image_path, group):
    width = int(group["width"])
    height = int(group["height"])
    classes_text = []
    classes = []
    xmin = []
    ymin = []
    xmax = []
    ymax = []

    for index, row in enumerate(group["annotation"]):
        xmin.append(float(row["bbox"][0]))
        ymin.append(float(row["bbox"][1]))
        xmax.append(float(row["bbox"][2]))
        ymax.append(float(row["bbox"][3]))
        classes.append(class_text_to_int(row["category"]))
        classes_text.append(row["category"].encode("utf8"))

    feature = tf.train.Example(
        features=tf.train.Features(
            feature={
                "height": int64_feature(height),
                "width": int64_feature(width),
                "filename": bytes_feature(group["image"].encode("utf8")),
                "image": image_feature(image),
                "object/bbox/xmin": float_list_feature(xmin),
                "object/bbox/xmax": float_list_feature(xmax),
                "object/bbox/ymin": float_list_feature(ymin),
                "object/bbox/ymax": float_list_feature(ymax),
                "object/text": bytes_list_feature(classes_text),
                "object/label": int64_list_feature(classes),
            }
        )
    )

    return feature


# Helper function to parse TFRecordDataset back to tensors
def parse_tfrecord_fn(example):
    feature_description = {
        "height": tf.io.FixedLenFeature((), tf.int64),
        "width": tf.io.FixedLenFeature((), tf.int64),
        "filename": tf.io.FixedLenFeature((), tf.string),
        "image": tf.io.FixedLenFeature((), tf.string),
        "object/bbox/xmin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/xmax": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymin": tf.io.VarLenFeature(tf.float32),
        "object/bbox/ymax": tf.io.VarLenFeature(tf.float32),
        "object/text": tf.io.VarLenFeature(tf.string),
        "object/label": tf.io.VarLenFeature(tf.int64),
    }
    # Parse the single example
    example = tf.io.parse_single_example(example, feature_description)
    # Preprocess the image and bounding box data
    example["image"] = tf.cast(tf.io.decode_jpeg(example["image"], channels=3), tf.float32)
    example["filename"] = tf.cast(example["filename"], tf.string)
    example["object/bbox/xmin"] = tf.sparse.to_dense(example["object/bbox/xmin"])
    example["object/bbox/xmax"] = tf.sparse.to_dense(example["object/bbox/xmax"])
    example["object/bbox/ymin"] = tf.sparse.to_dense(example["object/bbox/ymin"])
    example["object/bbox/ymax"] = tf.sparse.to_dense(example["object/bbox/ymax"])
    example["object/text"] = tf.sparse.to_dense(example["object/text"])
    example["object/label"] = tf.sparse.to_dense(example["object/label"])
    # Combine the bounding box coordinates into a single tensor
    example["object/bbox"] = tf.stack(
        [
            example["object/bbox/xmin"],
            example["object/bbox/ymin"],
            example["object/bbox/xmax"],
            example["object/bbox/ymax"],
        ],
        axis=1,
    )

    return example


# Helper function to prepare sample in the expected format
def prepare_sample(inputs):
    image = inputs["image"]
    boxes = inputs["object/bbox"]
    bounding_boxes = {
        "classes": inputs["object/label"],
        "boxes": boxes,
    }
    return {"images": image, "bounding_boxes": bounding_boxes}


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    args, _ = parser.parse_known_args()
    print("Received arguments {}".format(args))

    base_dir = "/opt/ml/processing"
    pathlib.Path(f"{base_dir}/input").mkdir(parents=True, exist_ok=True)
    output_dir = base_dir + "/output"
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

    logger.info("downloading file")

    input_path = f"{base_dir}/input/"

    # untar the partitioned data files into the data folder
    logger.info("extracting files")
    subprocess.run(["tar", "-xf", f"{input_path}/VOCtrainval_11-May-2012.tar", "-C", input_path])

    # Creation of datasets (list of dicts)
    logger.info("Creating Json files")
    train_dataset = parse_dataset_files(
        input_path, f"{input_path}/VOCdevkit/VOC2012/ImageSets/Main/train.txt"
    )
    val_dataset = parse_dataset_files(
        input_path, f"{input_path}/VOCdevkit/VOC2012/ImageSets/Main/val.txt"
    )
    print(
        f"{len(train_dataset)} samples for training and {len(val_dataset)} samples for validation before loading"
    )

    # Dump of dicts to jsonl files and load for testing purposes
    train_filename = f"{output_dir}/train_labels_VOC.jsonl"
    val_filename = f"{output_dir}/val_labels_VOC.jsonl"
    train_dataset = dump_and_load_dataset(train_dataset, train_filename)
    # val_dataset=dump_and_load_dataset(val_dataset,val_filename)

    # Split the train dataset into train and validation sets
    training_dataset, val_dataset = train_test_split(train_dataset, test_size=0.33, random_state=42)

    print(
        f"{len(train_dataset)} samples for training and {len(val_dataset)} samples for validation after loading"
    )

    # creation of a dict to map classes with class ids
    class_ids = sorted(list(class_ids))
    class_mapping = dict(zip(range(len(class_ids)), class_ids))
    class_mapping_by_label = dict(zip(class_ids, range(len(class_ids))))

    logger.info("Creating TFRecords")
    # creates output folders
    tfrecords_dir = f"{output_dir}/tfrecords"
    if not os.path.exists(tfrecords_dir + "/train"):
        os.makedirs(tfrecords_dir + "/train")
    if not os.path.exists(tfrecords_dir + "/val"):
        os.makedirs(tfrecords_dir + "/val")

    num_samples = 1024  # number of samples on each TFRecord file
    num_tfrecords_train = len(train_dataset) // num_samples
    num_tfrecords_val = len(val_dataset) // num_samples
    if len(train_dataset) % num_samples:
        num_tfrecords_train += 1  # add one record if there are any remaining samples
    if len(val_dataset) % num_samples:
        num_tfrecords_val += 1  # add one record if there are any remaining samples

    images_dir = f"{input_path}/VOCdevkit/VOC2012/JPEGImages"
    logger.info("Creating Training Records")
    for tfrec_num in range(num_tfrecords_train):
        samples = train_dataset[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]

        with tf.io.TFRecordWriter(
            tfrecords_dir + "/train/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
        ) as writer:
            for sample in samples:
                image_path = f"{images_dir}/{sample['image']}"
                image = tf.io.decode_jpeg(tf.io.read_file(image_path))
                example = create_example(image, image_path, sample)
                writer.write(example.SerializeToString())
    logger.info("Creating validation records")
    for tfrec_num in range(num_tfrecords_val):
        samples = val_dataset[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]

        with tf.io.TFRecordWriter(
            tfrecords_dir + "/val/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
        ) as writer:
            for sample in samples:
                image_path = f"{images_dir}/{sample['image']}"
                image = tf.io.decode_jpeg(tf.io.read_file(image_path))
                example = create_example(image, image_path, sample)
                writer.write(example.SerializeToString())
    logger.info(f"Processed data save on {output_dir}")

In [None]:
# output_path = processing_output_filename
s3_dataset_uri = S3_dataset_bucket

In [None]:
%%time
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Define the S3 path where the output of the processing job will be stored
processing_job_output_path = f"s3://{bucket}/{prefix}/data/processing/output"
# Run the processing job
tp.run(
    # Specify the Python script to run for preprocessing
    code="preprocessing_dataset.py",
    # Specify the Python script folder for preprocessing
    source_dir="preprocessing",
    # Define the input data for the processing job
    inputs=[
        ProcessingInput(
            source=s3_dataset_uri,  # The S3 URI of the input data
            destination="/opt/ml/processing/input",  # The local path where the input data will be copied
        )
    ],
    # Define the output location for the processed data
    outputs=[
        ProcessingOutput(
            output_name="processed_data",  # A name for the output
            source="/opt/ml/processing/output",  # The local path where the processed data will be saved
            destination=processing_job_output_path,  # The S3 path where the processed data will be uploaded
        )
    ],
)

# Describe the processing job to get details about its status, output, etc.
preprocessing_job_description = tp.jobs[-1].describe()

In [None]:
preprocessing_job_description["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]

Congratulations! You have preprocessed the data. Now you can find the processed data at the S3 URI from the preprocessing job outputs.

### Accessing Processing Job Results

Once a SageMaker processing job has finished running, the results including any output artifacts are stored in an S3 location specified when the job was created. To determine where these results are located, we can view the 'ProcessingOutputConfig' field returned when describing the processing job: 

#### Find the output of Processing Job

In [None]:
processing_job_output_uri = preprocessing_job_description["ProcessingOutputConfig"]["Outputs"][0][
    "S3Output"
]["S3Uri"]
processing_job_output_uri

In [None]:
# Store the S3 bucket string from the preprocessing step
# This value will be available for use in a second Jupyter notebook
# It allows you to access the output from the preprocessing job
# without having to hardcode the S3 path again
%store processing_job_output_uri

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|object_detection_with_tensorflow_and_tfrecords|1_object_detection_preprocessing.ipynb)
