<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [21]</a>'.</span>

In [1]:
# Parameters
kms_key = "arn:aws:kms:us-west-2:521695447989:key/6e9984db-50cf-4c7e-926c-877ec47a8b25"


# Amazon SageMaker Object Detection Incremental Training

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Data Preparation](#Data-Preparation)
  1. [Download data](#Download-data)
  2. [Convert data into RecordIO](#Convert-data-into-RecordIO)
  3. [Upload data to S3](#Upload-data-to-S3)
4. [Intial Training](#Initial-Training)
5. [Incremental Training](#Incremental-Training)
6. [Hosting](#Hosting)
7. [Inference](#Inference)

## Introduction

In this example, we will show you how to train an object detector by re-using a model you previously trained in the SageMaker. With this model re-using ability, you can save the training time when you update the model with new data or improving the model quality with the same data. In the first half of this notebook ([Intial Training](#Initial-Training)), we will follow the [training with RecordIO format example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/object_detection_pascalvoc_coco/object_detection_recordio_format.ipynb) to train a object detection model on the [Pascal VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/). In the second half, we will show you how you can re-use the trained model and improve its quality without repeating the entire training process.

## Setup

To train the Object Detection algorithm on Amazon SageMaker, we need to setup and authenticate the use of AWS services. To begin with we need an AWS account role with SageMaker access. This role is used to give SageMaker access to your data in S3 will automatically be obtained from the role used to start the notebook.

In [2]:
%%time
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()

arn:aws:iam::521695447989:role/ProdBuildSystemStack-ReleaseBuildRoleFB326D49-QK8LUA2UI1IC
CPU times: user 1.13 s, sys: 326 ms, total: 1.46 s
Wall time: 1.47 s


We also need the S3 bucket that you want to use for training and to store the tranied model artifacts. In this notebook, we require a custom bucket that exists so as to keep the naming clean. You can end up using a default bucket that SageMaker comes with as well.

In [3]:
bucket = sess.default_bucket()  # Use the default bucket. You can also customize bucket name.
prefix = "DEMO-ObjectDetection"

Lastly, we need the Amazon SageMaker Object Detection docker image, which is static and need not be changed.

In [4]:
from sagemaker.amazon.amazon_estimator import get_image_uri

training_image = get_image_uri(sess.boto_region_name, "object-detection", repo_version="latest")
print(training_image)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


433757028032.dkr.ecr.us-west-2.amazonaws.com/object-detection:1


## Data Preparation
[Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) was a popular computer vision challenge and they released annual challenge datasets for object detection from 2005 to 2012. In this notebook, we will use the data sets from 2007 and 2012, named as VOC07 and VOC12 respectively. Cumulatively, we have more than 20,000 images containing about 50,000 annotated objects. These annotated objects are grouped into 20 categories.

While using the Pascal VOC dateset, please be aware of the database usage rights:
"The VOC data includes images obtained from the "flickr" website. Use of these images must respect the corresponding terms of use: 
* "flickr" terms of use (https://www.flickr.com/help/terms)"

### Download data
Let us download the Pascal VOC datasets from 2007 and 2012.

In [5]:
%%time

# Download the dataset
!wget -P /tmp http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
!wget -P /tmp http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
!wget -P /tmp http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
# # Extract the data.
!tar -xf /tmp/VOCtrainval_11-May-2012.tar && rm /tmp/VOCtrainval_11-May-2012.tar
!tar -xf /tmp/VOCtrainval_06-Nov-2007.tar && rm /tmp/VOCtrainval_06-Nov-2007.tar
!tar -xf /tmp/VOCtest_06-Nov-2007.tar && rm /tmp/VOCtest_06-Nov-2007.tar

--2021-06-08 00:15:27--  http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 

129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... 

failed: No route to host.


--2021-06-08 00:15:30--  http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... 

failed: No route to host.


--2021-06-08 00:15:33--  http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... 

failed: No route to host.


tar: /tmp/VOCtrainval_11-May-2012.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


tar: /tmp/VOCtrainval_06-Nov-2007.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


tar: /tmp/VOCtest_06-Nov-2007.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


CPU times: user 148 ms, sys: 44.9 ms, total: 192 ms
Wall time: 10.4 s


### Convert data into RecordIO
[RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html) is a highly efficient binary data format from [MXNet](https://mxnet.incubator.apache.org/) that makes it easy and simple to prepare the dataset and transfer to the instance that will run the training job. To generate a RecordIO file, we will use the tools from MXNet. The provided tools will first generate a list file and then use the [im2rec tool](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) to create the [RecordIO](https://mxnet.incubator.apache.org/architecture/note_data_loading.html) file. More details on how to generate RecordIO file for object detection task, see the [MXNet example](https://github.com/apache/incubator-mxnet/tree/master/example/ssd).

We will combine the training and validation sets from both 2007 and 2012 as the training data set, and use the test set from 2007 as our validation set.

In [6]:
!python tools/prepare_dataset.py --dataset pascal --year 2007,2012 --set trainval --target VOCdevkit/train.lst
!rm -rf VOCdevkit/VOC2012
!python tools/prepare_dataset.py --dataset pascal --year 2007 --set test --target VOCdevkit/val.lst --no-shuffle
!rm -rf VOCdevkit/VOC2007

Traceback (most recent call last):
  File "tools/prepare_dataset.py", line 31, in <module>
    from pascal_voc import PascalVoc
  File "/opt/ml/processing/input/tools/pascal_voc.py", line 23, in <module>
    import cv2
ModuleNotFoundError: No module named 'cv2'


Traceback (most recent call last):
  File "tools/prepare_dataset.py", line 31, in <module>
    from pascal_voc import PascalVoc
  File "/opt/ml/processing/input/tools/pascal_voc.py", line 23, in <module>
    import cv2
ModuleNotFoundError: No module named 'cv2'


Along with this notebook, we have provided tools that can directly generated the RecordIO files so that you do not need to do addtional work. These tools work with the Pascal datasets lst format, which is also quite the common among most datasets. If your data are stored in a different format or the annotation of your data is in a different format than the Pascal VOC dataset, you can also create the RecordIO by first generating the .lst file and then using the im2rec tool provided by MXNet. To make things clear, we will explain the definition of a .lst file so that you can prepare it in your own way. The following example is the first three lines of the .lst file we just generated for the Pascal VOC dataset.

In [7]:
!head -n 3 VOCdevkit/train.lst > example.lst
f = open("example.lst", "r")
lst_content = f.read()
print(lst_content)

head: cannot open 'VOCdevkit/train.lst' for reading: No such file or directory





As can be seen that each line in the .lst file represents the annotations for a image. A .lst file is a **tab**-delimited file with multiple columns. The rows of the file are annotations of the of image files. The first column specifies a unique image index. The second column specifies the header size of the current row. In the above example .lst file, 2 from the second column means the second and third columns are header information, which will not be considered as label and bounding box information of the image specified by the current row.

The third column specifies the label width of a single object. In the first row of above sample .lst file, 5 from the third row means each object within an image will have 5 numbers to describe its label information, including class index, and bounding box coordinates. If there are multiple objects within one image, all the label information should be listed in one line. The annotation information for each object is represented as ``[class_index, xmin, ymin, xmax, ymax]``. 

The classes should be labeled with successive numbers and start with 0. The bounding box coordinates are ratios of its top-left (xmin, ymin) and bottom-right (xmax, ymax) corner indices to the overall image size. Note that the top-left corner of the entire image is the origin (0, 0). The last column specifies the relative path of the image file.

After generating the .lst file, the RecordIO can be created by running the following command:

In [8]:
# python /tools/im2rec.py --pack-label --num-thread 4 your_lst_file_name /your_image_folder

### Upload data to S3
Upload the data to the S3 bucket. We do this in multiple channels. Channels are simply directories in the bucket that differentiate between training and validation data. Let us simply call these directories `train` and `validation`.

In [9]:
%%time

# Upload the RecordIO files to train and validation channels
train_channel = prefix + "/train"
validation_channel = prefix + "/validation"

s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, validation_channel)

sess.upload_data(path="VOCdevkit/train.rec", bucket=bucket, key_prefix=train_channel)
sess.upload_data(path="VOCdevkit/val.rec", bucket=bucket, key_prefix=validation_channel)

FileNotFoundError: [Errno 2] No such file or directory: 'VOCdevkit/train.rec'

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [10]:
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)

## Initial Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [11]:
od_model = sagemaker.estimator.Estimator(
    training_image,
    role,
    train_instance_count=1,
    train_instance_type="ml.p3.2xlarge",
    train_volume_size=50,
    train_max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    sagemaker_session=sess,
)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


The object detection algorithm at its core is the [Single-Shot Multi-Box detection algorithm (SSD)](https://arxiv.org/abs/1512.02325). This algorithm uses a `base_network`, which is typically a [VGG](https://arxiv.org/abs/1409.1556) or a [ResNet](https://arxiv.org/abs/1512.03385). The Amazon SageMaker object detection algorithm supports VGG-16 and ResNet-50 now. It also has a lot of options for hyperparameters that help configure the training job. The next step in our training, is to setup these hyperparameters and data channels for training the model. Consider the following example definition of hyperparameters. See the SageMaker Object Detection [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html) for more details on the hyperparameters.

One of the hyperparameters here for instance is the `epochs`. This defines how many passes of the dataset we iterate over and determines that training time of the algorithm. In this example, we train the model for `5` epochs to generate a basic model for the PASCAL VOC dataset.

In [12]:
od_model.set_hyperparameters(
    base_network="resnet-50",
    use_pretrained_model=1,
    num_classes=20,
    mini_batch_size=16,
    epochs=5,
    learning_rate=0.001,
    lr_scheduler_step="3,6",
    lr_scheduler_factor=0.1,
    optimizer="rmsprop",
    momentum=0.9,
    weight_decay=0.0005,
    overlap_threshold=0.5,
    nms_threshold=0.45,
    image_shape=300,
    label_width=350,
    num_training_samples=16551,
)

Now that the hyperparameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [13]:
train_data = sagemaker.session.s3_input(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.session.s3_input(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


We have our `Estimator` object, we have set the hyperparameters for this object and we have our data channels linked with the algorithm. The only remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the `Estimator` classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take time, depending on the size of the data. Therefore it might be a few minutes before we start getting data logs for our training jobs. The data logs will also print out Mean Average Precision (mAP) on the validation data, among other losses, for every run of the dataset once or one epoch. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Training job completed" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [14]:
od_model.fit(inputs=data_channels, logs=True)

2021-06-08 00:15:40 Starting - Starting the training job.

.

.


2021-06-08 00:16:06 Starting - Launching requested ML instancesProfilerReport-1623111339: InProgress
.

.

.

.

.

.


2021-06-08 00:17:07 Starting - Preparing the instances for training.

.

.

.

.

.

.

.

.


2021-06-08 00:18:34 Downloading - Downloading input data.

.

.

.

.

.


2021-06-08 00:19:27 Training - Downloading the training image.

.

.


2021-06-08 00:20:11 Training - Training image download completed. Training in progress..

[34mDocker entrypoint called with argument(s): train[0m
[34m[06/08/2021 00:20:14 INFO 139763492013888] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/default-input.json: {'base_network': 'vgg-16', 'use_pretrained_model': '0', 'num_classes': '', 'mini_batch_size': '32', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'sgd', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}[0m
[34m[06/08/2021 00:20:14 INFO 139763492013888] Merging with provided configuration from /opt/ml/i

[34m[00:20:22] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:340: ImageDetRecordIOParser: /opt/ml/input/data/train/train.rec, label padding width: 350[0m
[34m[00:20:23] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:283: ImageDetRecordIOParser: /opt/ml/input/data/validation/val.rec, use 7 threads for decoding..[0m
[34m[00:20:25] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:340: ImageDetRecordIOParser: /opt/ml/input/data/validation/val.rec, label padding width: 350[0m


[34m[06/08/2021 00:20:27 INFO 139763492013888] Number of GPUs being used: 1[0m
[34m[06/08/2021 00:20:27 INFO 139763492013888] Using [gpu(0)] as training context.[0m
[34m[06/08/2021 00:20:27 INFO 139763492013888] Number of GPUs being used: 1[0m
[34m[06/08/2021 00:20:27 INFO 139763492013888] Create Store: device[0m
[34m[06/08/2021 00:20:27 INFO 139763492013888] Using (gpu(0)) as training context.[0m
[34m[06/08/2021 00:20:27 INFO 139763492013888] Start training from pretrained model 1.[0m
[34m[00:20:27] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...[0m
[34m[00:20:27] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded![0m
[34m[06/08/2021 00:20:28 INFO 1397634920

[34m[06/08/2021 00:20:39 INFO 139763492013888] Creating a new state instance.[0m
[34m#metrics {"StartTime": 1623111639.4975927, "EndTime": 1623111639.4976678, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "Meta": "init_train_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Reset Count": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Number of Records Since Last Reset": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Number of Batches Since Last Reset": {"sum": 0.0, "count": 1, "min": 0, "max": 0}}}
[0m
[34m[00:20:39] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/operator/nn/./cudnn/./c

[34m[06/08/2021 00:21:01 INFO 139763492013888] Epoch:    0, batches:    100, num_examples:   1600, 72.5 samples/sec, epoch time so far:  0:00:22.067521[0m


[34m[06/08/2021 00:21:19 INFO 139763492013888] Epoch:    0, batches:    200, num_examples:   3200, 80.4 samples/sec, epoch time so far:  0:00:39.817381[0m


[34m[06/08/2021 00:21:36 INFO 139763492013888] Epoch:    0, batches:    300, num_examples:   4800, 84.1 samples/sec, epoch time so far:  0:00:57.089647[0m


[34m[06/08/2021 00:21:53 INFO 139763492013888] Epoch:    0, batches:    400, num_examples:   6400, 86.1 samples/sec, epoch time so far:  0:01:14.296925[0m


[34m[06/08/2021 00:22:12 INFO 139763492013888] Epoch:    0, batches:    500, num_examples:   8000, 85.8 samples/sec, epoch time so far:  0:01:33.250690[0m


[34m[06/08/2021 00:22:30 INFO 139763492013888] Epoch:    0, batches:    600, num_examples:   9600, 86.8 samples/sec, epoch time so far:  0:01:50.537062[0m


[34m[06/08/2021 00:22:47 INFO 139763492013888] Epoch:    0, batches:    700, num_examples:   11200, 87.6 samples/sec, epoch time so far:  0:02:07.825017[0m


[34m[06/08/2021 00:23:04 INFO 139763492013888] Epoch:    0, batches:    800, num_examples:   12800, 88.4 samples/sec, epoch time so far:  0:02:24.791444[0m


[34m[06/08/2021 00:23:22 INFO 139763492013888] Epoch:    0, batches:    900, num_examples:   14400, 88.3 samples/sec, epoch time so far:  0:02:43.149788[0m


[34m[06/08/2021 00:23:39 INFO 139763492013888] Epoch:    0, batches:    1000, num_examples:   16000, 88.7 samples/sec, epoch time so far:  0:03:00.341104[0m
[34m[06/08/2021 00:23:45 INFO 139763492013888] #quality_metric: host=algo-1, epoch=0, batch=1035 train cross_entropy <loss>=(1.1592809385995073)[0m
[34m[06/08/2021 00:23:45 INFO 139763492013888] #quality_metric: host=algo-1, epoch=0, batch=1035 train smooth_l1 <loss>=(0.5428036072072824)[0m
[34m[06/08/2021 00:23:45 INFO 139763492013888] Round of batches complete[0m
[34m[06/08/2021 00:23:45 INFO 139763492013888] Updated the metrics[0m


[34m[06/08/2021 00:24:43 INFO 139763492013888] #quality_metric: host=algo-1, epoch=0, validation mAP <score>=(0.19905460940354514)[0m
[34m[06/08/2021 00:24:43 INFO 139763492013888] Updating the best model with validation-mAP=0.19905460940354514[0m
[34m[06/08/2021 00:24:44 INFO 139763492013888] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:24:44 INFO 139763492013888] #progress_metric: host=algo-1, completed 20.0 % of epochs[0m
[34m#metrics {"StartTime": 1623111639.4980025, "EndTime": 1623111884.0310733, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 0, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}

[34m[06/08/2021 00:25:02 INFO 139763492013888] Epoch:    1, batches:    100, num_examples:   1600, 89.0 samples/sec, epoch time so far:  0:00:17.982581[0m


[34m[06/08/2021 00:25:19 INFO 139763492013888] Epoch:    1, batches:    200, num_examples:   3200, 89.1 samples/sec, epoch time so far:  0:00:35.902736[0m


[34m[06/08/2021 00:25:37 INFO 139763492013888] Epoch:    1, batches:    300, num_examples:   4800, 90.2 samples/sec, epoch time so far:  0:00:53.222602[0m


[34m[06/08/2021 00:25:54 INFO 139763492013888] Epoch:    1, batches:    400, num_examples:   6400, 90.2 samples/sec, epoch time so far:  0:01:10.919142[0m


[34m[06/08/2021 00:26:12 INFO 139763492013888] Epoch:    1, batches:    500, num_examples:   8000, 90.5 samples/sec, epoch time so far:  0:01:28.444747[0m


[34m[06/08/2021 00:26:29 INFO 139763492013888] Epoch:    1, batches:    600, num_examples:   9600, 90.7 samples/sec, epoch time so far:  0:01:45.897447[0m


[34m[06/08/2021 00:26:47 INFO 139763492013888] Epoch:    1, batches:    700, num_examples:   11200, 90.6 samples/sec, epoch time so far:  0:02:03.603351[0m


[34m[06/08/2021 00:27:05 INFO 139763492013888] Epoch:    1, batches:    800, num_examples:   12800, 90.7 samples/sec, epoch time so far:  0:02:21.101812[0m


[34m[06/08/2021 00:27:23 INFO 139763492013888] Epoch:    1, batches:    900, num_examples:   14400, 90.5 samples/sec, epoch time so far:  0:02:39.187824[0m


[34m[06/08/2021 00:27:40 INFO 139763492013888] Epoch:    1, batches:    1000, num_examples:   16000, 90.5 samples/sec, epoch time so far:  0:02:56.753671[0m


[34m[06/08/2021 00:27:45 INFO 139763492013888] #quality_metric: host=algo-1, epoch=1, batch=1034 train cross_entropy <loss>=(0.9489656753341774)[0m
[34m[06/08/2021 00:27:45 INFO 139763492013888] #quality_metric: host=algo-1, epoch=1, batch=1034 train smooth_l1 <loss>=(0.43904152421094456)[0m
[34m[06/08/2021 00:27:45 INFO 139763492013888] Round of batches complete[0m
[34m[06/08/2021 00:27:46 INFO 139763492013888] Updated the metrics[0m


[34m[06/08/2021 00:28:42 INFO 139763492013888] #quality_metric: host=algo-1, epoch=1, validation mAP <score>=(0.28848972282396856)[0m
[34m[06/08/2021 00:28:42 INFO 139763492013888] Updating the best model with validation-mAP=0.28848972282396856[0m
[34m[06/08/2021 00:28:43 INFO 139763492013888] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:28:43 INFO 139763492013888] #progress_metric: host=algo-1, completed 40.0 % of epochs[0m
[34m#metrics {"StartTime": 1623111884.0313184, "EndTime": 1623112123.1270182, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 1, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}

[34m[06/08/2021 00:29:01 INFO 139763492013888] Epoch:    2, batches:    100, num_examples:   1600, 89.1 samples/sec, epoch time so far:  0:00:17.954888[0m


[34m[06/08/2021 00:29:18 INFO 139763492013888] Epoch:    2, batches:    200, num_examples:   3200, 89.5 samples/sec, epoch time so far:  0:00:35.745898[0m


[34m[06/08/2021 00:29:36 INFO 139763492013888] Epoch:    2, batches:    300, num_examples:   4800, 90.3 samples/sec, epoch time so far:  0:00:53.139882[0m


[34m[06/08/2021 00:29:53 INFO 139763492013888] Epoch:    2, batches:    400, num_examples:   6400, 90.9 samples/sec, epoch time so far:  0:01:10.410421[0m


[34m[06/08/2021 00:30:11 INFO 139763492013888] Epoch:    2, batches:    500, num_examples:   8000, 90.5 samples/sec, epoch time so far:  0:01:28.351588[0m


[34m[06/08/2021 00:30:28 INFO 139763492013888] Epoch:    2, batches:    600, num_examples:   9600, 90.7 samples/sec, epoch time so far:  0:01:45.795781[0m


[34m[06/08/2021 00:30:46 INFO 139763492013888] Epoch:    2, batches:    700, num_examples:   11200, 90.9 samples/sec, epoch time so far:  0:02:03.230409[0m


[34m[06/08/2021 00:31:03 INFO 139763492013888] Epoch:    2, batches:    800, num_examples:   12800, 90.9 samples/sec, epoch time so far:  0:02:20.871359[0m


[34m[06/08/2021 00:31:21 INFO 139763492013888] Epoch:    2, batches:    900, num_examples:   14400, 90.8 samples/sec, epoch time so far:  0:02:38.557186[0m


[34m[06/08/2021 00:31:38 INFO 139763492013888] Epoch:    2, batches:    1000, num_examples:   16000, 91.0 samples/sec, epoch time so far:  0:02:55.769493[0m


[34m[06/08/2021 00:31:44 INFO 139763492013888] Update[3103]: Change learning rate to 1.00000e-05[0m
[34m[06/08/2021 00:31:44 INFO 139763492013888] #quality_metric: host=algo-1, epoch=2, batch=1035 train cross_entropy <loss>=(0.8826827490410961)[0m
[34m[06/08/2021 00:31:44 INFO 139763492013888] #quality_metric: host=algo-1, epoch=2, batch=1035 train smooth_l1 <loss>=(0.4071018180537036)[0m
[34m[06/08/2021 00:31:44 INFO 139763492013888] Round of batches complete[0m
[34m[06/08/2021 00:31:44 INFO 139763492013888] Updated the metrics[0m


[34m[06/08/2021 00:32:43 INFO 139763492013888] #quality_metric: host=algo-1, epoch=2, validation mAP <score>=(0.3345820201721826)[0m
[34m[06/08/2021 00:32:43 INFO 139763492013888] Updating the best model with validation-mAP=0.3345820201721826[0m
[34m[06/08/2021 00:32:43 INFO 139763492013888] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:32:43 INFO 139763492013888] #progress_metric: host=algo-1, completed 60.0 % of epochs[0m
[34m#metrics {"StartTime": 1623112123.127278, "EndTime": 1623112363.3974922, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 2, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "

[34m[06/08/2021 00:33:01 INFO 139763492013888] Epoch:    3, batches:    100, num_examples:   1600, 89.9 samples/sec, epoch time so far:  0:00:17.799003[0m


[34m[06/08/2021 00:33:19 INFO 139763492013888] Epoch:    3, batches:    200, num_examples:   3200, 89.7 samples/sec, epoch time so far:  0:00:35.672482[0m


[34m[06/08/2021 00:33:36 INFO 139763492013888] Epoch:    3, batches:    300, num_examples:   4800, 90.3 samples/sec, epoch time so far:  0:00:53.139785[0m


[34m[06/08/2021 00:33:53 INFO 139763492013888] Epoch:    3, batches:    400, num_examples:   6400, 90.7 samples/sec, epoch time so far:  0:01:10.572073[0m


[34m[06/08/2021 00:34:11 INFO 139763492013888] Epoch:    3, batches:    500, num_examples:   8000, 90.8 samples/sec, epoch time so far:  0:01:28.143912[0m


[34m[06/08/2021 00:34:28 INFO 139763492013888] Epoch:    3, batches:    600, num_examples:   9600, 91.1 samples/sec, epoch time so far:  0:01:45.413063[0m


[34m[06/08/2021 00:34:46 INFO 139763492013888] Epoch:    3, batches:    700, num_examples:   11200, 91.3 samples/sec, epoch time so far:  0:02:02.620887[0m


[34m[06/08/2021 00:35:03 INFO 139763492013888] Epoch:    3, batches:    800, num_examples:   12800, 91.5 samples/sec, epoch time so far:  0:02:19.815928[0m


[34m[06/08/2021 00:35:22 INFO 139763492013888] Epoch:    3, batches:    900, num_examples:   14400, 90.8 samples/sec, epoch time so far:  0:02:38.660342[0m


[34m[06/08/2021 00:35:39 INFO 139763492013888] Epoch:    3, batches:    1000, num_examples:   16000, 90.9 samples/sec, epoch time so far:  0:02:55.955788[0m
[34m[06/08/2021 00:35:44 INFO 139763492013888] #quality_metric: host=algo-1, epoch=3, batch=1034 train cross_entropy <loss>=(0.7979718979717247)[0m
[34m[06/08/2021 00:35:44 INFO 139763492013888] #quality_metric: host=algo-1, epoch=3, batch=1034 train smooth_l1 <loss>=(0.3725223473486877)[0m
[34m[06/08/2021 00:35:44 INFO 139763492013888] Round of batches complete[0m
[34m[06/08/2021 00:35:44 INFO 139763492013888] Updated the metrics[0m


[34m[06/08/2021 00:36:37 INFO 139763492013888] #quality_metric: host=algo-1, epoch=3, validation mAP <score>=(0.4314690367406807)[0m
[34m[06/08/2021 00:36:37 INFO 139763492013888] Updating the best model with validation-mAP=0.4314690367406807[0m
[34m[06/08/2021 00:36:37 INFO 139763492013888] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:36:37 INFO 139763492013888] #progress_metric: host=algo-1, completed 80.0 % of epochs[0m
[34m#metrics {"StartTime": 1623112363.3977618, "EndTime": 1623112597.6371493, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 3, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, 

[34m[06/08/2021 00:36:55 INFO 139763492013888] Epoch:    4, batches:    100, num_examples:   1600, 91.8 samples/sec, epoch time so far:  0:00:17.422129[0m


[34m[06/08/2021 00:37:12 INFO 139763492013888] Epoch:    4, batches:    200, num_examples:   3200, 90.9 samples/sec, epoch time so far:  0:00:35.190087[0m


[34m[06/08/2021 00:37:30 INFO 139763492013888] Epoch:    4, batches:    300, num_examples:   4800, 91.3 samples/sec, epoch time so far:  0:00:52.573873[0m


[34m[06/08/2021 00:37:47 INFO 139763492013888] Epoch:    4, batches:    400, num_examples:   6400, 91.7 samples/sec, epoch time so far:  0:01:09.794477[0m


[34m[06/08/2021 00:38:04 INFO 139763492013888] Epoch:    4, batches:    500, num_examples:   8000, 91.9 samples/sec, epoch time so far:  0:01:27.022665[0m


[34m[06/08/2021 00:38:22 INFO 139763492013888] Epoch:    4, batches:    600, num_examples:   9600, 91.9 samples/sec, epoch time so far:  0:01:44.491156[0m


[34m[06/08/2021 00:38:39 INFO 139763492013888] Epoch:    4, batches:    700, num_examples:   11200, 92.0 samples/sec, epoch time so far:  0:02:01.728222[0m


[34m[06/08/2021 00:38:56 INFO 139763492013888] Epoch:    4, batches:    800, num_examples:   12800, 92.1 samples/sec, epoch time so far:  0:02:18.996809[0m


[34m[06/08/2021 00:39:14 INFO 139763492013888] Epoch:    4, batches:    900, num_examples:   14400, 91.9 samples/sec, epoch time so far:  0:02:36.644035[0m


[34m[06/08/2021 00:39:31 INFO 139763492013888] Epoch:    4, batches:    1000, num_examples:   16000, 92.1 samples/sec, epoch time so far:  0:02:53.722785[0m
[34m[06/08/2021 00:39:36 INFO 139763492013888] #quality_metric: host=algo-1, epoch=4, batch=1035 train cross_entropy <loss>=(0.7741237876253803)[0m
[34m[06/08/2021 00:39:36 INFO 139763492013888] #quality_metric: host=algo-1, epoch=4, batch=1035 train smooth_l1 <loss>=(0.3593430273048518)[0m
[34m[06/08/2021 00:39:36 INFO 139763492013888] Round of batches complete[0m
[34m[06/08/2021 00:39:36 INFO 139763492013888] Updated the metrics[0m


[34m[06/08/2021 00:40:31 INFO 139763492013888] #quality_metric: host=algo-1, epoch=4, validation mAP <score>=(0.4523574392628644)[0m
[34m[06/08/2021 00:40:31 INFO 139763492013888] Updating the best model with validation-mAP=0.4523574392628644[0m
[34m[06/08/2021 00:40:31 INFO 139763492013888] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:40:31 INFO 139763492013888] #progress_metric: host=algo-1, completed 100.0 % of epochs[0m
[34m#metrics {"StartTime": 1623112597.6374252, "EndTime": 1623112831.2886467, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 4, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0},


2021-06-08 00:40:52 Uploading - Uploading generated training model


2021-06-08 00:41:12 Completed - Training job completed
ProfilerReport-1623111339: IssuesFound


Training seconds: 1343
Billable seconds: 1343


As you can see that it took about `18` minutes to reach a mAP around `0.4`. To improve the detection quality, you can start a new training job with an increased `epochs` to let the algorithm training for more iterations. However, the new training job will re-learn everything you have learned with the previous training job in the first `5` epochs. To avoid wasting the training resources and time, we can start the new training with a model that was generated in the previous SageMaker training jobs.

## Incremental Training
In this section, we start a new training job from the model obtained in previous section. We setup the estimator and hyperparameters similar to the previous training job. Note that SageMaker object detection algorithm currently only support the re-training feature with the same network, which means the new training job must have the same `base_network` and `num_classes` as the previous training job.

In [15]:
new_od_model = sagemaker.estimator.Estimator(
    training_image,
    role,
    train_instance_count=1,
    train_instance_type="ml.p3.2xlarge",
    train_volume_size=50,
    train_max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    sagemaker_session=sess,
)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [16]:
new_od_model.set_hyperparameters(
    base_network="resnet-50",
    num_classes=20,
    mini_batch_size=16,
    epochs=1,
    learning_rate=0.001,
    optimizer="rmsprop",
    momentum=0.9,
    image_shape=300,
    label_width=350,
    num_training_samples=16551,
)

We use the same training data from previous job. To use the pre-trained model, we just need to add a `model` channel to the `inputs` and set its content type to `application/x-sagemaker-model`.

In [17]:
# Use the same data for training and validation as the previous job.
train_data = sagemaker.session.s3_input(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.session.s3_input(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)

# Use the output model from the previous job.
s3_model_data = od_model.model_data

model_data = sagemaker.session.s3_input(
    s3_model_data,
    distribution="FullyReplicated",
    content_type="application/x-sagemaker-model",
    s3_data_type="S3Prefix",
)

# In addition to two data channels, add a 'model' channel for the training.
new_data_channels = {"train": train_data, "validation": validation_data, "model": model_data}

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Fit the new model with all three input channels.

In [18]:
new_od_model.fit(inputs=new_data_channels, logs=True)

2021-06-08 00:41:39 Starting - Starting the training job.

.

.


2021-06-08 00:42:03 Starting - Launching requested ML instancesProfilerReport-1623112898: InProgress
.

.

.

.

.

.


2021-06-08 00:43:04 Starting - Preparing the instances for training.

.

.

.

.

.

.

.

.


2021-06-08 00:44:34 Downloading - Downloading input data.

.

.

.

.

.


2021-06-08 00:45:24 Training - Downloading the training image.

.

[34mDocker entrypoint called with argument(s): train[0m
[34m[06/08/2021 00:45:51 INFO 139827178395456] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/default-input.json: {'base_network': 'vgg-16', 'use_pretrained_model': '0', 'num_classes': '', 'mini_batch_size': '32', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'sgd', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}[0m
[34m[06/08/2021 00:45:51 INFO 139827178395456] Merging with provided configuration from /opt/ml/i


2021-06-08 00:46:04 Training - Training image download completed. Training in progress.[34m[00:45:58] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:340: ImageDetRecordIOParser: /opt/ml/input/data/train/train.rec, label padding width: 350[0m
[34m[00:45:59] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:283: ImageDetRecordIOParser: /opt/ml/input/data/validation/val.rec, use 7 threads for decoding..[0m
[34m[00:46:01] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/io/iter_image_det_recordio.cc:340: ImageDetRecordIOParser: /opt/ml/input/data/validation/val.rec, label padding width: 350[0m
[34m[06/08/2021 00:46:03 INFO 139827178395456] Number of GPUs being used: 1[0m
[34m[06/08/2021 00:46:

[34m[06/08/2021 00:46:17 INFO 139827178395456] Creating a new state instance.[0m
[34m#metrics {"StartTime": 1623113177.8059494, "EndTime": 1623113177.8060446, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "Meta": "init_train_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Reset Count": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Number of Records Since Last Reset": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Number of Batches Since Last Reset": {"sum": 0.0, "count": 1, "min": 0, "max": 0}}}
[0m
[34m[00:46:17] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.4.x_ecl_Cuda_9.x.135.0/AL2_x86_64/generic-flavor/src/src/operator/nn/./cudnn/./c

[34m[06/08/2021 00:46:40 INFO 139827178395456] Epoch:    0, batches:    100, num_examples:   1600, 70.8 samples/sec, epoch time so far:  0:00:22.609019[0m


[34m[06/08/2021 00:46:57 INFO 139827178395456] Epoch:    0, batches:    200, num_examples:   3200, 80.0 samples/sec, epoch time so far:  0:00:40.021103[0m


[34m[06/08/2021 00:47:15 INFO 139827178395456] Epoch:    0, batches:    300, num_examples:   4800, 83.0 samples/sec, epoch time so far:  0:00:57.816701[0m


[34m[06/08/2021 00:47:33 INFO 139827178395456] Epoch:    0, batches:    400, num_examples:   6400, 85.1 samples/sec, epoch time so far:  0:01:15.199314[0m


[34m[06/08/2021 00:47:50 INFO 139827178395456] Epoch:    0, batches:    500, num_examples:   8000, 86.3 samples/sec, epoch time so far:  0:01:32.655668[0m


[34m[06/08/2021 00:48:07 INFO 139827178395456] Epoch:    0, batches:    600, num_examples:   9600, 87.1 samples/sec, epoch time so far:  0:01:50.168201[0m


[34m[06/08/2021 00:48:25 INFO 139827178395456] Epoch:    0, batches:    700, num_examples:   11200, 87.5 samples/sec, epoch time so far:  0:02:08.007687[0m


[34m[06/08/2021 00:48:43 INFO 139827178395456] Epoch:    0, batches:    800, num_examples:   12800, 87.8 samples/sec, epoch time so far:  0:02:25.761494[0m


[34m[06/08/2021 00:49:00 INFO 139827178395456] Epoch:    0, batches:    900, num_examples:   14400, 88.4 samples/sec, epoch time so far:  0:02:42.912719[0m


[34m[06/08/2021 00:49:18 INFO 139827178395456] Epoch:    0, batches:    1000, num_examples:   16000, 88.6 samples/sec, epoch time so far:  0:03:00.500626[0m
[34m[06/08/2021 00:49:23 INFO 139827178395456] #quality_metric: host=algo-1, epoch=0, batch=1035 train cross_entropy <loss>=(0.8369018467318383)[0m
[34m[06/08/2021 00:49:23 INFO 139827178395456] #quality_metric: host=algo-1, epoch=0, batch=1035 train smooth_l1 <loss>=(0.3878045983785223)[0m
[34m[06/08/2021 00:49:23 INFO 139827178395456] Round of batches complete[0m
[34m[06/08/2021 00:49:23 INFO 139827178395456] Updated the metrics[0m


[34m[06/08/2021 00:50:19 INFO 139827178395456] #quality_metric: host=algo-1, epoch=0, validation mAP <score>=(0.38044882677318465)[0m
[34m[06/08/2021 00:50:19 INFO 139827178395456] Updating the best model with validation-mAP=0.38044882677318465[0m
[34m[06/08/2021 00:50:19 INFO 139827178395456] Saved checkpoint to "/opt/ml/model/model_algo_1-0000.params"[0m
[34m[06/08/2021 00:50:19 INFO 139827178395456] #progress_metric: host=algo-1, completed 100.0 % of epochs[0m
[34m#metrics {"StartTime": 1623113177.8064299, "EndTime": 1623113419.280929, "Dimensions": {"Algorithm": "AWS/Object Detection", "Host": "algo-1", "Operation": "training", "epoch": 0, "Meta": "training_data_iter"}, "Metrics": {"Total Records Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Total Batches Seen": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Records Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}, "Max Batches Seen Between Resets": {"sum": 0.0, "count": 1, "min": 0, "max": 0}


2021-06-08 00:51:11 Uploading - Uploading generated training model
2021-06-08 00:51:11 Completed - Training job completed


Training seconds: 397
Billable seconds: 397


Instead of repeating the first `5` epochs from the previous job, we started the training with the trained model and improved the results with only one epoch. In this way, models pre-trained in SageMaker can now be re-used to improve the training efficiency.

## Hosting
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same instance (or type of instance) that we used to train. Training is a prolonged and compute heavy job that require a different of compute and memory requirements that hosting typically do not. We can choose any type of instance we want to host the model. In our case we chose the `ml.p3.2xlarge` instance to train, but we choose to host the model on the less expensive cpu instance, `ml.m4.xlarge`. The endpoint deployment can be accomplished as follows:

In [19]:
object_detector = new_od_model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

!

## Inference
Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, let us download an image from [PEXELS](https://www.pexels.com/) which the algorithm has so-far not seen. 

In [20]:
!wget -O test.jpg https://images.pexels.com/photos/980382/pexels-photo-980382.jpeg
file_name = "test.jpg"

with open(file_name, "rb") as image:
    f = image.read()
    b = bytearray(f)
    ne = open("n.txt", "wb")
    ne.write(b)

--2021-06-08 00:58:57--  https://images.pexels.com/photos/980382/pexels-photo-980382.jpeg
Resolving images.pexels.com (images.pexels.com)... 104.17.209.102, 104.17.208.102, 2606:4700::6811:d166, ...
Connecting to images.pexels.com (images.pexels.com)|104.17.209.102|:443... connected.


HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2021-06-08 00:58:57 ERROR 503: Service Temporarily Unavailable.



Let us use our endpoint to try to detect objects within this image. Since the image is `jpeg`, we use the appropriate `content_type` to run the prediction job. The endpoint returns a JSON file that we can simply load and peek into.

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [21]:
import json

object_detector.content_type = "image/jpeg"
results = object_detector.predict(b)
detections = json.loads(results)
print(detections)

AttributeError: can't set attribute

The results are in a format that is similar to the .lst format with an addition of a confidence score for each detected object. The format of the output can be represented as `[class_index, confidence_score, xmin, ymin, xmax, ymax]`. Typically, we don't consider low-confidence predictions.

We have provided additional script to easily visualize the detection outputs. You can visualize the high-confidence predictions with bounding box by filtering out low-confidence detections using the script below:

In [None]:
def visualize_detection(img_file, dets, classes=[], thresh=0.6):
    """
    visualize detections in one image
    Parameters:
    ----------
    img : numpy.array
        image, in bgr format
    dets : numpy.array
        ssd detections, numpy.array([[id, score, x1, y1, x2, y2]...])
        each row is one object
    classes : tuple or list of str
        class names
    thresh : float
        score threshold
    """
    import random
    import matplotlib.pyplot as plt
    import matplotlib.image as mpimg

    img = mpimg.imread(img_file)
    plt.imshow(img)
    height = img.shape[0]
    width = img.shape[1]
    colors = dict()
    for det in dets:
        (klass, score, x0, y0, x1, y1) = det
        if score < thresh:
            continue
        cls_id = int(klass)
        if cls_id not in colors:
            colors[cls_id] = (random.random(), random.random(), random.random())
        xmin = int(x0 * width)
        ymin = int(y0 * height)
        xmax = int(x1 * width)
        ymax = int(y1 * height)
        rect = plt.Rectangle(
            (xmin, ymin),
            xmax - xmin,
            ymax - ymin,
            fill=False,
            edgecolor=colors[cls_id],
            linewidth=3.5,
        )
        plt.gca().add_patch(rect)
        class_name = str(cls_id)
        if classes and len(classes) > cls_id:
            class_name = classes[cls_id]
        plt.gca().text(
            xmin,
            ymin - 2,
            "{:s} {:.3f}".format(class_name, score),
            bbox=dict(facecolor=colors[cls_id], alpha=0.5),
            fontsize=12,
            color="white",
        )
    plt.show()

For the sake of this notebook, we trained the model with only one epoch. This implies that the results might not be optimal. To achieve better detection results, you can try to tune the hyperparameters and train the model for more epochs.

In [None]:
object_categories = [
    "aeroplane",
    "bicycle",
    "bird",
    "boat",
    "bottle",
    "bus",
    "car",
    "cat",
    "chair",
    "cow",
    "diningtable",
    "dog",
    "horse",
    "motorbike",
    "person",
    "pottedplant",
    "sheep",
    "sofa",
    "train",
    "tvmonitor",
]

# Setting a threshold 0.20 will only plot detection results that have a confidence score greater than 0.20.
threshold = 0.20

# Visualize the detections.
visualize_detection(file_name, detections["prediction"], object_categories, threshold)

## Delete the Endpoint
Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.

In [None]:
sagemaker.Session().delete_endpoint(object_detector.endpoint)