# Tensorflow Object Detection API and AWS Sagemaker

In this notebook, you will train and evaluate different models using the [Tensorflow Object Detection API](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/) and [AWS Sagemaker](https://aws.amazon.com/sagemaker/). 

If you ever feel stuck, you can refer to this [tutorial](https://aws.amazon.com/blogs/machine-learning/training-and-deploying-models-using-tensorflow-2-with-the-object-detection-api-on-amazon-sagemaker/).

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [1]:
%%capture
%pip install tensorflow_io sagemaker -U

In [2]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Save the IAM role in a variable called `role`. This would be useful when training the model.

In [3]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::118177437226:role/service-role/AmazonSageMaker-ExecutionRole-20260119T161592


In [4]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
          'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://udacity-nd0013-proj1-bucket/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the repository
* build the docker image and push it 
* print the container name

In [5]:
%%bash

# clone the repo and get the scripts
git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
cp docker/models/research/object_detection/model_main_tf2.py source_dir

fatal: destination path 'docker/models' already exists and is not an empty directory.


In [6]:
# build and push the docker image. This code can be commented out after being run once.
# This will take around 10 mins.
image_name = 'tf2-object-detection'
#!sh ./docker/build_and_push.sh $image_name

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [7]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

118177437226.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20260121145704


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be adjusted if you were to experiment with other architectures.

In [8]:
%%bash
mkdir /tmp/checkpoint
mkdir source_dir/checkpoint
wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001
efficientdet_d1_coco17_tpu-32/checkpoint/checkpoint
efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.index


mkdir: cannot create directory ‘source_dir/checkpoint’: File exists
--2026-01-22 12:31:17--  http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 192.178.218.207, 172.253.62.207, 172.253.115.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|192.178.218.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51839363 (49M) [application/x-tar]
Saving to: ‘/tmp/efficientdet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 11.1M 4s
    50K .......... .......... .......... .......... ..........  0% 20.8M 3s
   100K .......... .......... .......... .......... ..........  0% 22.9M 3s
   150K .......... .......... .......... .......... ..........  0% 20.9M 3s
   200K .......... .......... .......... .......... ..........  0% 73.5M 2s
   250K .......... .......... .......... .......... ..........  0%  125M 2s
  

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [None]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)
estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir": "/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.g5.xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

.
2026-01-21 15:04:45 Downloading - Downloading the training image............
2026-01-21 15:06:37 Training - Training image download completed. Training in progress.[34m2026-01-21 15:06:43,461 sagemaker-training-toolkit INFO     Provided path: /opt/ml/code  is empty, unzipping[0m
[34m2026-01-21 15:06:44,346 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 15:06:44,381 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 15:06:44,416 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 15:06:44,430 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train",
        "val": "/opt/ml/input/data/val"
    },
    "current_host": "algo-1",
    "current_instance_group": "homoge

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)


In [9]:
%%bash
# Download alternative 1 checkpoint
mkdir -p source_dir/mobilenet/checkpoint
wget -O /tmp/ssd-mobilenetv2.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/ssd-mobilenetv2.tar.gz --strip-components 2 --directory source_dir/mobilenet/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


--2026-01-22 12:31:18--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.167.207, 142.251.163.207, 172.253.139.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.167.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/ssd-mobilenetv2.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 17.8M 1s
    50K .......... .......... .......... .......... ..........  0% 29.8M 1s
   100K .......... .......... .......... .......... ..........  0% 70.4M 1s
   150K .......... .......... .......... .......... ..........  0% 32.3M 1s
   200K .......... .......... .......... .......... ..........  1% 36.8M 1s
   250K .......... .......... .......... .......... ..........  1%  242M 1s
   300K .......... .......... .......... .........

In [None]:
custom_suffix = "mobilenet"
new_tensorboard_s3_prefix = f"{tensorboard_s3_prefix}{custom_suffix}"
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=new_tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)
estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir": "/opt/training",        
        "pipeline_config_path": "pipeline_ssdmobilenetv2.config",
        "num_train_steps": "10000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.g5.xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection2'
)

estimator.fit(inputs)

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: tf2-object-detection2-2026-01-22-17-08-31-851


2026-01-22 17:08:43 Starting - Starting the training job
2026-01-22 17:08:43 Pending - Training job waiting for capacity......

In [11]:
%%bash
# Download alternative 2 checkpoint
mkdir -p source_dir/resnet50/checkpoint
wget -O /tmp/ssd-resnet50-v1.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/ssd-resnet50-v1.tar.gz --strip-components 2 --directory source_dir/resnet50/checkpoint ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint


ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


--2026-01-22 12:50:30--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.253.139.207, 142.251.163.207, 142.251.167.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|172.253.139.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 244817203 (233M) [application/x-tar]
Saving to: ‘/tmp/ssd-resnet50-v1.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 11.1M 21s
    50K .......... .......... .......... .......... ..........  0% 16.2M 18s
   100K .......... .......... .......... .......... ..........  0% 50.4M 13s
   150K .......... .......... .......... .......... ..........  0% 34.6M 12s
   200K .......... .......... .......... .......... ..........  0% 69.4M 10s
   250K .......... .......... .......... .......... ..........  0% 54.5M 9s
   300K .......... .......... .......... .......

In [15]:
new_tensorboard_s3_prefix = f"{tensorboard_s3_prefix}{custom_suffix}"
new_tensorboard_s3_prefix

's3://udacity-nd0013-proj1-bucket/logs/mobilenet'

In [14]:
custom_suffix = "resnet50"
new_tensorboard_s3_prefix = f"{tensorboard_s3_prefix}{custom_suffix}"
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=new_tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)
estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir": "/opt/training",        
        "pipeline_config_path": "pipeline_ssdresnet50.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.g5.xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection3'
)

estimator.fit(inputs)

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: tf2-object-detection3-2026-01-21-17-26-20-442


2026-01-21 17:26:35 Starting - Starting the training job
2026-01-21 17:26:35 Pending - Training job waiting for capacity............
2026-01-21 17:28:11 Pending - Preparing the instances for training...
2026-01-21 17:28:59 Downloading - Downloading the training image............
2026-01-21 17:30:50 Training - Training image download completed. Training in progress.[34m2026-01-21 17:31:00,128 sagemaker-training-toolkit INFO     Provided path: /opt/ml/code  is empty, unzipping[0m
[34m2026-01-21 17:31:04,179 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 17:31:04,218 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 17:31:04,257 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2026-01-21 17:31:04,272 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_par

## Improve on the initial model

Most likely, this initial experiment did not yield optimal results. However, you can make multiple changes to the `pipeline.config` file to improve this model. One obvious change consists in improving the data augmentation strategy. The [`preprocessor.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/preprocessor.proto) file contains the different data augmentation method available in the Tf Object Detection API. Justify your choices of augmentations in the write-up.

Keep in mind that the following are also available:
* experiment with the optimizer: type of optimizer, learning rate, scheduler etc
* experiment with the architecture. The Tf Object Detection API model zoo offers many architectures. Keep in mind that the pipeline.config file is unique for each architecture and you will have to edit it.
* visualize results on the test frames using the `2_deploy_model` notebook available in this repository.

In the cell below, write down all the different approaches you have experimented with, why you have chosen them and what you would have done if you had more time and resources. Justify your choices using the tensorboard visualizations (take screenshots and insert them in your write-up), the metrics on the evaluation set and the generated animation you have created with [this tool](../2_run_inference/2_deploy_model.ipynb).

# 1. Overview and Model Selection
For this project, I evaluated SSD MobileNet v2 FPN-lite 640x640 and SSD ResNet50 v1 FPN 640x640. I ultimately selected MobileNet v2 as the best model for this problem. My decision was driven by the specific requirements of automotive environments: in a vehicle, electrical energy is a limited resource. Using a model with fewer parameters (~3M vs ~25M for ResNet50) leads to lower computational demand, which results in higher energy efficiency and less heat generation—critical factors for integrated vehicle hardware. This "lightweight" approach allows for faster inference without sacrificing the accuracy needed for real-time safety.

# 2. Evolution of the Training Strategy

My experimental process involved several critical adjustments based on initial observations:

* Learning Rate & Optimizer: In the initial run with EfficientNet, I used the provided pipeline.config default settings. I noticed that the warmup period was set to 2,000 steps, which was exactly the total training duration. As a result, I only observed a rising learning rate and no actual training at a stable peak. To correct this, I switched to a Manual Step Learning Rate schedule and changed the optimizer to Adam. Adam provided faster initial convergence, which was vital given the limited compute budget and the need to move past the "warmup-only" phase.

* Data Augmentation Refinement: Initially, I experimented with a broad suite of augmentations, including 90° rotations and random_black_patches.

    - However, I decided to remove these specific options. The reasoning was that 90° rotations made the task unnecessarily difficult for the model to learn within the short training window, especially if the target objects have a natural upright orientation.

    - I also removed the black patches because they can hide too much information during the early stages of learning. For a model to learn to detect an object despite partial occlusion, it first needs to master the basic features of that object. Focusing on simpler augmentations like horizontal flips and cropping allowed the model to converge more effectively within my budget.

# 3. Results and Metrics (mAP)
The models showed a clear difference in learning efficiency:

| Metric              | SSD MobileNet v2 (Step 10k) | SSD ResNet50 (Step 2k) |
|---------------------|-----------------------------|------------------------|
| mAP (all)           | 0.126                       | 0.039                  |
| mAP @ 0.50 IoU      | 0.263                       | 0.084                  |
| mAP (Large Objects) | 0.572                       | 0.207                  |
| Total Loss          | ~0.36                       | ~1.10                  |

<img src="../tensorboard/tensorboard_loss.png">

# 4. Discussion of Losses and Behavior
Validation vs. Training Loss: The training loss for MobileNet reached ~0.36, while the validation loss was higher at ~0.79. This gap is expected as the training loss includes penalties for augmented "noise," whereas the validation loss reflects performance on clean images.

Loss Trends and Early Stopping: I observed that the Total Loss reached a minimum and then began to rise again. This behavior is a clear indicator of Overfitting, where the model starts to memorize the training data rather than generalizing. To prevent the model from losing its ability to generalize to new data, Early Stopping should be applied at the point where the validation loss is at its lowest.

Expected Behavior: The ResNet50 was clearly "under-trained" at 2,000 steps (high regularization loss of ~0.51). MobileNet’s streamlined architecture allowed it to utilize the 10,000 steps more effectively for localization and classification.

# 5. Future Improvements
To further improve performance:

* Anchor Box Optimization: The low mAP for small objects (0.053) suggests that the default anchor scales in the configuration do not match the size of smaller targets in the dataset.

* Resolution: Increasing the input resolution could improve the Recall for small objects, provided the energy budget of the target hardware allows for the increased compute.

* Early Stopping Implementation: Automating the training termination when the validation loss starts to diverge to ensure the best possible generalization.

# Conclusion
SSD MobileNet v2 FPN-lite is the suggested model. It proved to be far more efficient to train and offers the energy-conscious performance necessary for automotive applications.