# Tensorflow Object Detection API and AWS Sagemaker

## Dataset

We are using the [Waymo Open Dataset](https://waymo.com/open/) for this project. The dataset has already been exported using the tfrecords format. The files have been created following the format described [here](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#create-tensorflow-records). You can find data stored on [AWS S3](https://aws.amazon.com/s3/), AWS Object Storage. The images are saved with a resolution of 640x640.

In [1]:
%%capture
%pip install tensorflow_io sagemaker -U

In [1]:
import os
import sagemaker
from sagemaker.estimator import Estimator
from framework import CustomFramework

Save the IAM role in a variable called `role`. This would be useful when training the model.

In [2]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::157430746956:role/service-role/AmazonSageMaker-ExecutionRole-20230329T033942


In [3]:
# The train and val paths below are public S3 buckets created by Udacity for this project
inputs = {'train': 's3://cd2688-object-detection-tf2/train/', 
        'val': 's3://cd2688-object-detection-tf2/val/'} 

# Insert path of a folder in your personal S3 bucket to store tensorboard logs.
tensorboard_s3_prefix = 's3://object-detection-in-urban-env/logs/'

## Container

To train the model, you will first need to build a [docker](https://www.docker.com/) container with all the dependencies required by the TF Object Detection API. The code below does the following:
* clone the Tensorflow models repository
* get the exporter and training scripts from the the repository
* build the docker image and push it 
* print the container name

In [5]:
%%bash

# clone the repo and get the scripts
# git clone https://github.com/tensorflow/models.git docker/models

# get model_main and exporter_main files from TF2 Object Detection GitHub repository
# cp docker/models/research/object_detection/exporter_main_v2.py source_dir 
# cp docker/models/research/object_detection/model_main_tf2.py source_dir

Cloning into 'docker/models'...


In [6]:
# build and push the docker image. This code can be commented after being ran once.
# # This will take around 10 mins.
# image_name = 'tf2-object-detection'
# !sh ./docker/build_and_push.sh $image_name

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building image with name tf2-object-detection
Sending build context to Docker daemon  723.5MB
Step 1/17 : FROM tensorflow/tensorflow:2.9.0-gpu
2.9.0-gpu: Pulling from tensorflow/tensorflow

[1B17ec1767: Pulling fs layer 
[1B9ecd2bff: Pulling fs layer 
[1B4ae53552: Pulling fs layer 
[1B2d09b8c4: Pulling fs layer 
[1B0d530989: Pulling fs layer 
[1B81af025b: Pulling fs layer 
[1Bc129f45e: Pulling fs layer 
[1B8fcb70c6: Pulling fs layer 
[1B9aa4a247: Pulling fs layer 
[1B3100c8d1: Pulling fs layer 
[1B3a6b487b: Pulling fs layer 
[1Be8773234: Pulling fs layer 
[1B36c9476c: Pulling fs layer 
[3Be8773234: Extracting  163.2MB/583.3MBB[14A[2K[13A[2K[12A[2K[14A[2K[11A[2K[10A[2K[14A[2K[14A[2K[9A[2K[14A[2K[9A[2K[8A[2K[9A[2K[9A[2K[14A[2K[9A[2K[14A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[5A[2K[9A[2K[5A[2K[8A[2K[8A[2K[14A[2K[4A[2K[8A

To verify that the image was correctly pushed to the [Elastic Container Registry](https://aws.amazon.com/ecr/), you can look at it in the AWS webapp. For example, below you can see that three different images have been pushed to ECR. You should only see one, called `tf2-object-detection`.
![ECR Example](../data/example_ecr.png)


In [6]:
# display the container name
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
    container = f.readlines()[0][:-1]

print(container)

157430746956.dkr.ecr.us-east-1.amazonaws.com/tf2-object-detection:20230329035020


## Pre-trained model from model zoo

As often, we are not training from scratch and we will be using a pretrained model from the TF Object Detection model zoo. You can find pretrained checkpoints [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). Because your time is limited for this project, we recommend to only experiment with the following models:
* SSD MobileNet V2 FPNLite 640x640	
* SSD ResNet50 V1 FPN 640x640 (RetinaNet50)	
* Faster R-CNN ResNet50 V1 640x640	
* EfficientDet D1 640x640	
* Faster R-CNN ResNet152 V1 640x640	

In the code below, the EfficientDet D1 model is downloaded and extracted. This code should be ajusted if you were to experiment with other architectures.

In [7]:
# def download_model(name,url):
#     zip_file = f"/tmp/{name}.tar.gz"
#     Checkpoint = f"{name}/checkpoint"
#     %%bash
#     !mkdir -p /tmp/checkpoint
#     !mkdir -p source_dir/checkpoint
#     !wget -O $zip_file $url
#     !tar -zxvf $zip_file --strip-components 2 --directory source_dir/checkpoint $Checkpoint

In [4]:
%%bash
%%bash
mkdir -p /tmp/checkpoint
mkdir -p source_dir/checkpoint
wget -O /tmp/mobilenet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
tar -zxvf /tmp/mobilenet.tar.gz --strip-components 2 --directory source_dir/checkpoint ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint

# %%bash
# %%bash
# mkdir /tmp/checkpoint
# mkdir source_dir/checkpoint
# wget -O /tmp/efficientdet.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
# tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.data-00000-of-00001
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/checkpoint
ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8/checkpoint/ckpt-0.index


bash: line 1: fg: no job control
--2023-04-05 03:24:40--  http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.253.63.128, 2607:f8b0:4004:c19::80
Connecting to download.tensorflow.org (download.tensorflow.org)|172.253.63.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20518283 (20M) [application/x-tar]
Saving to: ‘/tmp/mobilenet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 15.5M 1s
    50K .......... .......... .......... .......... ..........  0% 29.4M 1s
   100K .......... .......... .......... .......... ..........  0% 27.8M 1s
   150K .......... .......... .......... .......... ..........  0% 38.6M 1s
   200K .......... .......... .......... .......... ..........  1% 97.9M 1s
   250K .......... .......... .......... .......... ..........  1% 94.2M 1s
   300K .......... .......... ..........

In [11]:
!ls source_dir/checkpoint

checkpoint  ckpt-0.data-00000-of-00001	ckpt-0.index


In [8]:
# %%bash
# mkdir /tmp/checkpoint
# mkdir source_dir/checkpoint
# wget -O /tmp/MobileNet_V2.tar.gz http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
# tar -zxvf /tmp/efficientdet.tar.gz --strip-components 2 --directory source_dir/checkpoint efficientdet_d1_coco17_tpu-32/checkpoint

efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.data-00000-of-00001
efficientdet_d1_coco17_tpu-32/checkpoint/checkpoint
efficientdet_d1_coco17_tpu-32/checkpoint/ckpt-0.index


--2023-03-29 04:12:27--  http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.253.62.128, 2607:f8b0:4004:c08::80
Connecting to download.tensorflow.org (download.tensorflow.org)|172.253.62.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51839363 (49M) [application/x-tar]
Saving to: ‘/tmp/efficientdet.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 11.7M 4s
    50K .......... .......... .......... .......... ..........  0% 21.4M 3s
   100K .......... .......... .......... .......... ..........  0% 22.6M 3s
   150K .......... .......... .......... .......... ..........  0% 74.8M 2s
   200K .......... .......... .......... .......... ..........  0% 61.3M 2s
   250K .......... .......... .......... .......... ..........  0% 55.5M 2s
   300K .......... .......... .......... .......... ..........  0% 70.9M 2s
   350K ..

## Edit pipeline.config file

The [`pipeline.config`](source_dir/pipeline.config) in the `source_dir` folder should be updated when you experiment with different models. The different config files are available [here](https://github.com/tensorflow/models/tree/master/research/object_detection/configs/tf2).

>Note: The provided `pipeline.config` file works well with the `EfficientDet` model. You would need to modify it when working with other models.

## Launch Training Job

Now that we have a dataset, a docker image and some pretrained model weights, we can launch the training job. To do so, we create a [Sagemaker Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html), where we indicate the container name, name of the config file, number of training steps etc.

The `run_training.sh` script does the following:
* train the model for `num_train_steps` 
* evaluate over the val dataset
* export the model

Different metrics will be displayed during the evaluation phase, including the mean average precision. These metrics can be used to quantify your model performances and compare over the different iterations.

You can also monitor the training progress by navigating to **Training -> Training Jobs** from the Amazon Sagemaker dashboard in the Web UI.

In [10]:
tensorboard_output_config = sagemaker.debugger.TensorBoardOutputConfig(
    s3_output_path=tensorboard_s3_prefix,
    container_local_output_path='/opt/training/'
)

estimator = CustomFramework(
    role=role,
    image_uri=container,
    entry_point='run_training.sh',
    source_dir='source_dir/',
    hyperparameters={
        "model_dir":"/opt/training",        
        "pipeline_config_path": "pipeline.config",
        "num_train_steps": "2000",    
        "sample_1_of_n_eval_examples": "1"
    },
    instance_count=1,
    instance_type='ml.trn1.2xlarge',
    tensorboard_output_config=tensorboard_output_config,
    disable_profiler=True,
    base_job_name='tf2-object-detection'
)

estimator.fit(inputs)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating training-job with name: tf2-object-detection-2023-04-05-04-02-25-230


2023-04-05 04:02:26 Starting - Starting the training job...
2023-04-05 04:02:41 Starting - Preparing the instances for training...
2023-04-05 04:03:24 Downloading - Downloading input data...
2023-04-05 04:03:49 Training - Downloading the training image...............
2023-04-05 04:06:00 Training - Training image download completed. Training in progress..[34m2023-04-05 04:06:29,118 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-04-05 04:06:29,120 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-05 04:06:29,132 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-04-05 04:06:29,134 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-05 04:06:29,145 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-04-05 04:06:29,147 sagemaker-training-tool

You should be able to see your model training in the AWS webapp as shown below:
![ECR Example](../data/example_trainings.png)
