# SageMakerCV TensorFlow Tutorial

SageMakerCV is a collection of computer vision tools developed to take full advantage of Amazon SageMaker by providing state of the art model accuracy, training speed, and training cost reductions. SageMakerCV is based on the lessons we learned from developing the record breaking computer vision models we announced at Re:Invent in 2019 and 2020, along with talking to our customers and understanding the challenges they faced in training their own computer vision models.

The tutorial in this notebook walks through using SageMakerCV to train Mask RCNN on the COCO dataset. The only prerequisite is to setup SageMaker studio, the instructions for which can be found in [Onboard to Amazon SageMaker Studio Using Quick Start](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). Everything else, from getting the COCO data to launching a distributed training cluster, is included here.

## Setup and Roadmap

Before diving into the tutorial itself, let's take a minute to discuss the various tools we'll be using.

#### SageMaker Studio
[SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) is a machine learning focused IDE where you can interactively develop models and launch SageMaker training jobs all in one place. SageMaker Studio provides a Jupyter Lab like environment, but with a number of enhancements. We'll just scratch the surface here. See the [SageMaker Studio Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) for more details.

For our purposes, the biggest difference from regular Jupyter Lab is that SageMaker Studio allows you to change your compute resources as needed, by connecting notebooks to Docker containers on different ML instances. This is a little confusing to just describe, so let's walk through an example.

Once you've completed the setup on [Onboard to Amazon SageMaker Studio Using Quick Start](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html), go to the [SageMaker Console](https://us-west-2.console.aws.amazon.com/sagemaker) and click `Open SageMaker Studio` near the top right of the page.

<img src="../assets/SageMaker_console.png" style="width: 600px">

If you haven't yet created a user, do so via the link at the top left of the page. Give it any name you like. For execution role, you can either use an existing SageMaker role, or create a new one. If you're unsure, create a new role. On the `Create IAM Role` window, make sure to select `Any S3 Bucket`. 

<img src="../assets/Create_IAM_role.png" style="width: 600px">

Back on the SageMaker Studio page, select `Open Studio` next to the user you just created.

<img src="../assets/Studio_domain.png" style="width: 600px">

This will take a couple minutes to start up the first time. Once it starts, you'll have a Jupyter Lab like interface running on a small instance with an attached EBS volume. Let's start by taking a look at the `Launcher` tab.

<img src="../assets/Studio_launcher.png" style="width: 750px">

If you don't see the `Launcher`, you can bring one up by clicking the `+` on the menu bar in the upper left corner.

<img src="../assets/Studio_menu_bar.png" style="width: 600px">

The `Launcher` gives you access to all kinds of tools. This is where you can create new notebooks, text files, or get a terminal for your instance. Try the `System Terminal`. This gives you a new terminal tab for your Studio instance. It's useful for things like downloading data or cloning github repos into studio. For example, you can run `aws s3 ls` to browse your current S3 buckets. Go ahead and clone this repo onto Studio with 

`git clone https://github.com/aws-samples/amazon-sagemaker-cv`

Let's look at the launcher one more time. Bring another one up with the `+`. Notice you have an option for `Select a SageMaker image` above the button to launch a notebook. This allows you to select a Docker image that will launch on a new instance. The notebook you create will be attached to that new instance, along with the EBS volume on your Studio instance. Let's try it out. On the `Launcher` page, click the drop down menu next to `Select a SageMaker Image` and select `TensorFlow 2.3 Python 3.7 (Optimzed for GPU)`, then click the `Notebook` button below the dropdown.

<img src="../assets/Select_tensorflow_image.png" style="width: 600px">

Take a look at the upper righthand corner of the notebook. 

<img src="../assets/notebook_tensorflow_kernel.png" style="width: 600px">

The `Ptyhon 3 (TensorFlow 2.3 Python 3.7 GPU Optimized)` refers to the kernel associated with this notebook. The `Unknown` refers to the current instance type. Click `Unknown` and select `ml.g4dn.xlarge`.

<img src="../assets/instance_types.png" style="width: 600px">

This will launch a `ml.g4dn.xlarge` instance and attach this notebook to it. This will take a couple of minutes, because Studio needs to download the PyTorch Docker image to the new instance. Once an instance has started, launching new notebooks with the same instance type and kernel is immediate. You'll also see the `Unknown` replaced with and instance description `4 vCPU + 16 GiB + 1 GPU`. You can also change instance as needed. Say you want to run your notebook on a `ml.p3dn.24xlarge` to get 8 GPUs. To change instances, just click the instance description. To get more instances in the menu, deselect `Fast launch only`.

Once your notebook is up and running, you can also get a terminal into your new instance.

<img src="../assets/Launch_terminal.png" style="width: 600px">

This can be useful for customizing your image with setup scripts, pip installing new packages, or using mpi to launch multi GPU training jobs. Click to get a terminal and run `ls`. Note that you have the same directories as your main Studio instance. Studio will attach the same EBS volume to all the instances you start, so all your files and data are shared across any notebooks you start. This means that you can prototype a model on a single GPU instance, then switch to a multi GPU instance while still having access to all of your data and scripts.

Finally, when you want to shut down instances, click the circle with a square in it on the left hand side.

<img src="../assets/running_instances.png" style="width: 600px">

This shows your current running instances, and the Docker containers attached to those instances. To shut them down, just click the power button to their right.

Now that we've explored studio a bit, let's get started with SageMakerCV. If you followed the instructions above to clone the repo, you should have `amazon-sagemaker-cv` in the file browser on the left. Navigate to `amazon-sagemaker-cv/pytorch/tutorial.ipynb` to open this notebook on your instance. If you still have a `g4dn` running, it should automatically attach to it.

The rest of this notebook is broken into 4 sections.

- Installing SageMakerCV and Downloading the COCO Data

Since we're using the base AWS Deep Learning Container image, we need to add the SageMakerCV tools. Then we'll download the COCO dataset and upload it to S3.

- Prototyping in Studio

We'll walk through how to train a model on Studio, how SageMakerCV is structured, and how you can add your own models and features.

- Launching a SageMaker Training Job

There's lots of bells and whistles available to train your models fast, an on large datasets. We'll put a lot of those together to launch a high performance training job. Specifically, we'll create a training job with 4 P4d.24xlarge instances connected with 400 GB EFA, and streaming our training data from S3, so we don't have to load the dataset onto the instances before training. You could even use this same configuration to train on a dataset that wouldn't fit on the instances. If you'd rather only launch a smaller (or larger) training cluster, we'll discuss how to modify configuration.

- Testing Our Model

Finally, we'll take the output trained Mask RCNN model and visualize its performance in Studio.

#### Installing SageMakerCV

To install SageMakerCV on the PyTorch Studio Docker, just run `pip install -e .` in the `amazon-sagemaker-cv/pytorch` directory. You can do this with either an image terminal, or by running the paragraph below. Note that we use the `-e` option. This will keep the SageMakerCV modules editable, so any changes you make will be launched on your training job.

In [3]:
!pip install -e .

Obtaining file:///root/amazon-sagemaker-cv/tensorflow
Collecting pycocotools@ https://aws-smcv-us-west-2.s3.us-west-2.amazonaws.com/utils/binaries/cocoapi/pycocotools-2.0%2Bnv0.6.0-cp37-cp37m-linux_x86_64.whl
  Downloading https://aws-smcv-us-west-2.s3.us-west-2.amazonaws.com/utils/binaries/cocoapi/pycocotools-2.0%2Bnv0.6.0-cp37-cp37m-linux_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 52.4 MB/s eta 0:00:01
Installing collected packages: sagemakercv
  Attempting uninstall: sagemakercv
    Found existing installation: sagemakercv 0.1
    Uninstalling sagemakercv-0.1:
      Successfully uninstalled sagemakercv-0.1
  Running setup.py develop for sagemakercv
Successfully installed sagemakercv-0.1
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.[0m


***
### Setup on S3 and Download COCO data

Next we need to setup an S3 bucket for all our data and results. Enter a name for your S3 bucket below. You can either create a new bucket, or use an existing bucket. If you use an existing bucket, make sure it's in the same region where you plan to run training. For new buckets, we'll specify that it needs to be in the current SageMaker region. By default we'll put everything in an S3 location on your bucket named `smcv-tutorial`, and locally in `/root/smcv-tutorial`, but you can change these locations. 

In [4]:
S3_BUCKET = 'sagemaker-smcv-tensorflow-tutorial' # Don't include s3:// in your bucket name
S3_DIR = 'smcv-tensorflow-tutorial'
LOCAL_DATA_DIR = '/root/smcv-tensorflow-tutorial' #for reasons detailed in Destributed Training, do not put this dir in your source dir

In [5]:
import os
import zipfile
from pathlib import Path
from s3fs import S3FileSystem
from concurrent.futures import ThreadPoolExecutor
import boto3
from botocore.client import ClientError
from tqdm import tqdm

In [6]:
s3 = boto3.resource('s3')
boto_session = boto3.session.Session()
region = boto_session.region_name

# Check if bucket exists. If it doesn't, create it.

try:
    bucket = s3.meta.client.head_bucket(Bucket=S3_BUCKET)
    print(f"S3 Bucket {S3_BUCKET} Exists")
except ClientError:
    print(f"Creating Bucket {S3_BUCKET}")
    bucket = s3.create_bucket(Bucket=S3_BUCKET, CreateBucketConfiguration={'LocationConstraint': region})

S3 Bucket sagemaker-smcv-tensorflow-tutorial Exists


***

Next we'll download the COCO data to Studio, unzip the files, and upload to S3. The reason we want the data in two places is that it's convenient to have the data locally on Studio for prototyping. We also want to unarchive the data before moving it to S3 so that we can stream it to our training instances instead of downloading it all at once.

Once this is finished, you'll have copies of the COCO data on your Studio instance, and in S3. Be careful not to open the `data/coco/train2017` dir in the Studio file browser. It contains 118287 images, and can cause your web browser to crash. If you need to browse these files, use the terminal.

This only needs to be done once, and only if you don't already have the data. The COCO 2017 dataset is about 20GB, so this step takes around 30 minutes to complete. The next paragraph sets up all the file directories we'll use for downloading, and later in training. 

In [7]:
COCO_URL="http://images.cocodataset.org"
ANNOTATIONS_ZIP="annotations_trainval2017.zip"
TRAIN_ZIP="train2017.zip"
VAL_ZIP="val2017.zip"
COCO_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'coco')
TF_RECORD_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'coco', 'tfrecord')
os.makedirs(COCO_DIR, exist_ok=True)
os.makedirs(TF_RECORD_DIR, exist_ok=True)
S3_DATA_LOCATION=os.path.join("s3://", S3_BUCKET, S3_DIR, "data", "coco")
S3_WEIGHTS_LOCATION=os.path.join("s3://", S3_BUCKET, S3_DIR, "data", "weights", "resnet")
WEIGHTS_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'weights')
os.makedirs(WEIGHTS_DIR, exist_ok=True)
R50_WEIGHTS_SRC="s3://aws-smcv-us-west-2/tensorflow/pretrained-weights/resnet/"

***

And this paragraph will download everything, and take around 30 minutes to complete.

In [None]:
print("Downloading annotations")
!wget -O $COCO_DIR/$ANNOTATIONS_ZIP $COCO_URL/annotations/$ANNOTATIONS_ZIP
!unzip $COCO_DIR/$ANNOTATIONS_ZIP -d $COCO_DIR
!aws s3 cp --recursive $COCO_DIR/annotations $S3_DATA_LOCATION/annotations

print("Downloading COCO training data")
!wget -O $COCO_DIR/$TRAIN_ZIP $COCO_URL/zips/$TRAIN_ZIP

# train data has ~128000 images. Unzip is too slow, about 1.5 hours beceause of disk read and write speed on the EBS volume. 
# This technique is much faster because it grabs all the zip metadata at once, then uses threading to unzip multiple files at once.
print("Unzipping COCO training data")
train_zip = zipfile.ZipFile(os.path.join(COCO_DIR, TRAIN_ZIP))
jpeg_files = [image.filename for image in train_zip.filelist if image.filename.endswith('.jpg')]
os.makedirs(os.path.join(COCO_DIR, 'train2017'))
with ThreadPoolExecutor() as executor:
    threads = list(tqdm(executor.map(lambda x: train_zip.extract(x, COCO_DIR), jpeg_files), total=len(jpeg_files)))

print("Downloading COCO validation data")
!wget -O $COCO_DIR/$VAL_ZIP $COCO_URL/zips/$VAL_ZIP
# switch to also threading
!unzip -q $COCO_DIR/$VAL_ZIP -d $COCO_DIR
val_images = [i for i in Path(os.path.join(COCO_DIR, 'val2017')).glob("*.jpg")]
    
!apt-get -y update && apt install -y protobuf-compiler
!cd sagemakercv/data/coco && ./process_coco_tfrecord.sh $COCO_DIR $TF_RECORD_DIR


tfrecord_train = list(Path(TF_RECORD_DIR).glob('train-*.tfrecord'))
tfrecord_val = list(Path(TF_RECORD_DIR).glob('val-*.tfrecord'))
s3fs = S3FileSystem()
print("Uploading training tfrecords to S3")
with ThreadPoolExecutor() as executor:
    threads = list(tqdm(executor.map(lambda record: s3fs.put(record.as_posix(), 
                                     os.path.join(S3_DATA_LOCATION, 'tfrecord', 'train2017', record.name)), 
                                     tfrecord_train), total=len(tfrecord_train)))
print("Uploading validation tfrecords to S3")
with ThreadPoolExecutor() as executor:
    threads = list(tqdm(executor.map(lambda record: s3fs.put(record.as_posix(), 
                                     os.path.join(S3_DATA_LOCATION, 'tfrecord', 'val2017', record.name)), 
                                     tfrecord_val), total=len(tfrecord_val)))

print("Downloading Resnet Weights")
s3fs.get(R50_WEIGHTS_SRC, WEIGHTS_DIR, recursive=True)
s3fs.put(WEIGHTS_DIR, S3_WEIGHTS_LOCATION, recursive=True)

print("Finished!")

***
### Training on Studio

Now that we have the data, we can get to training a Mask RCNN model to detect objects in the COCO dataset images. 

Since training on a single GPU can take days, we'll just train for a couple thousands steps, and run a single evaluation to make sure our model is at least starting to learn something. We'll train a full model on a larger cluster of GPUs in a SageMaker training job.

The reason we first want to train in Studio is that we want to dig a bit into the SageMakerCV framework, and talk about the model architecture, since we expect many users will want to modify models for their own use cases.

#### Mask RCNN

First, just a very brief overview of Mask RCNN. If you would like a more in depth examination, we recommend taking a look at the [original paper](https://arxiv.org/abs/1703.06870), the [feature pyramid paper](https://arxiv.org/abs/1612.03144) which describes a popular architectural change we'll use in our model, and blog posts from [viso.ai](https://viso.ai/deep-learning/mask-r-cnn/), [tryo labs](https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/), [Jonathan Hui](https://jonathan-hui.medium.com/image-segmentation-with-mask-r-cnn-ebe6d793272), and [Lilian Weng](https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html).

Mask RCNN is a two stage object detection model that locates objects in images by places bounding boxes around, and segmentation masks over, any object for which the model is trained to find. It also provides classifcations for each object.

<img src="../assets/traffic.png" style="width: 1200px">

Mask RCNN is called a two stage model because it performs detection in two steps. The first identified any objects in the image, versus background. The second stage determines the specific class of each object, and applies the segmentation mask. Below is an architectural diagram of the model. Let's walk through each step.

<img src="../assets/mask_rcnn_arch.jpeg" style="width: 1200px">
Credit: Jonathan Hui

The `Convolution Network` is often referred to as the model backbone. This is a pretrained image classification model, commonly ResNet, which has been trained on a large image classification dataset, like ImageNet. The classification layer is removed, and instead the backbone outputs a set of convolution feature maps. The idea is, the classification model learned to identify objects in the process of classifying images, and now we can use that information to build a more complex model that can find those objects in the image. We want to pretrain because training the backbone at the same time as training the object detector tends to be very unstable.

One additional component that is sometimes added to the backbone is a `Fearure Pyramid Network`. This take the outputs of the backbone, and combines them to together into a new set of feature maps by perform both up and down convolutions. The idea is that the different sized feature maps will help the model detect images of different sizes. The feature pyramid also helps with this, by allowing the different feature maps to share information with each other.

The outputs of the feature pyramid are then passed to the `Region Proposal Network` which is responsible for finding regions of the image that might contain an object (this is the first of the two stages). The RPN will output several hundred thousand regions, each with a probability of containing an object. We'll typically take the top few thousand most likely regions. Because these several thousand regions will usually have a lot of overlap, we perform [non-max supression](https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c), which removed regions with large areas of overlap. This gives us a set of `regions of interest` regions of the image that we think might contain an image.

Next, we use those regions to crop out the corresponding sections of the feature maps that came from the feature pyramid network using a technique called [ROI align](https://firiuza.medium.com/roi-pooling-vs-roi-align-65293ab741db).

We pass our cropped feature maps to the `box head` which classifies each region into either a specific object category, or as background. It also refines the position of the bounding box. In Mask RCNN, we also pass the feature maps to a `mask head` which produces a segmentation mask over the object.

#### SageMakerCV Internals

An important feature of Mask RCNN is its multiple heads. One head constructs a bounding box, while another creates a mask. These are referred to as the `ROI heads`. It's common for users to extend this and other two stage models by adding their own ROI heads. For example, a keypoint head it common. Doing so means modifying SageMakerCV's internals, so let's talk about those for a second. 

The high level Mask RCNN model can be found in `amazon-sageamaker-cv/pytorch/sagemakercv/detection/detector/generatlized_rcnn.py`. If you trace through the forward function, you'll see that the model first passes an image through the backbone (which also contains the feature pyramid), then the RPN in the graphable module. Then results are then passed through non-max suppression, and into the roi heads. 

Probably the most important feature to be aware of are the `build` imports at the top. Each section of the model has an associated build function `(build_backbone, build_rpn, build_roi_heads)`. These functions simplify building the model by letting us pass in a single configuration file for building all the different pieces.  

For example, if you open `amazon-sageamaker-cv/pytorch/sagemakercv/detection/roi_heads/roi_heads.py`, you'll find the `build_roi_heads` function at the bottom. To add a new head, you would write a torch module with its own build function, and call the build function from here.

For example, say you want to add a keypoint head to the model. An example keypoint module and associate build function is in `amazon-sageamaker-cv/pytorch/sagemakercv/detection/roi_heads/keypoint_head/keypoint_head.py`. To enable the keypoint head, you would set `cfg.MODEL.KEYPOINY_ON=True` and add the keypoint parameters to your configuration yaml file.

SageMakerCV uses similar build functions for the optimizers and schedulers, which you can add to or modify in the `amazon-sageamaker-cv/pytorch/sagemakercv/training/optimizers/` directory. 

Finally, data loading tools are located in `amazon-sagemaker-cv/pytorch/data/`. Here you can add a new dataset, sampler, and preprocessing data transformations. Data loaders are constructed in the `build.py` file. Notice the `@DATASETS.register("COCO")` decorator at the top of the COCO `make_coco_dataloader` function. This adds the function to a dictionary of datasets, so that when you specify `COCO` in yout configuration file, the `make_data_loader` knows which data loader to create.

#### Setting Up Training

Let's actually use some of these functions to train a model.

Start by importing the default configuration file.

In [8]:
from configs import cfg

***
We use the [yacs](https://github.com/rbgirshick/yacs) format for configuration files. If you want to see the entire config, run `print(cfg.dump())` but this prints out a lot, and to not overwhelm you with too much information, we'll just focus on the bits we want to change for this model.

***
First, let's put in all the file directories for the data and weights we downloaded in the previous section, as well as an output directory for the model results.

In [None]:
cfg.PATHS.TRAIN_FILE_PATTERN = os.path.join(TF_RECORD_DIR, "train*")
cfg.PATHS.VAL_FILE_PATTERN = os.path.join(TF_RECORD_DIR, "val*")
cfg.PATHS.WEIGHTS = os.path.join(WEIGHTS_DIR, "model.ckpt-112603")
cfg.PATHS.VAL_ANNOTATIONS = os.path.join(COCO_DIR, "annotations", "instances_val2017.json")
cfg.PATHS.OUT_DIR = os.path.join(LOCAL_DATA_DIR, "output")

# create output dir if it doesn't exist
os.makedirs(cfg.PATHS.OUT_DIR, exist_ok=True)

***
This section specifies model details, including the type of model, and internal hyperparameters. We wont cover the details of all of these, but more information can be found in this blog posts listed above, as well as the original paper.

In [None]:
cfg.LOG_INTERVAL = 50 # Number of training steps between logging interval
cfg.MODEL.DENSE.PRE_NMS_TOP_N_TRAIN = 2000 # Top regions of interest to select before NMS
cfg.MODEL.DENSE.POST_NMS_TOP_N_TRAIN = 1000 # Top regions of interest to select after NMS
cfg.MODEL.RCNN.ROI_HEAD = "StandardRoIHead"
cfg.MODEL.FRCNN.LOSS_TYPE = "giou"

***
Next we set up the configuration for training, including the optimizer, hyperparameters, batch size, and training length. Batch size is global, so if you set a batch size of 64 across 8 GPUs, it will be a batch size of 8 per GPU. SageMakerCV currently supports the following optimizere: SGD (stochastic gradient descent), Adam, Lamb, and NovoGrad [link - this speeds up training by allowing increased batch sizes], and the following learning rate schedulers: stepwise and cosine decay. New, custom optimizers and schedulers can be added by modifying the `sagemakercv/training/builder.py` file.

For training on Studio, we'll just run for a few hundred steps. We'll be using SageMaker training instances for the full training on multiple GPUs.



In [None]:
cfg.INPUT.TRAIN_BATCH_SIZE = 4 # Training batch size
cfg.INPUT.EVAL_BATCH_SIZE = 4 # Training batch size
cfg.SOLVER.SCHEDULE = "CosineDecay" # Learning rate schedule, either CosineDecay or PiecewiseConstantDecay
cfg.SOLVER.OPTIMIZER = "NovoGrad" # Optimizer type NovoGrad or Momentum
cfg.SOLVER.LR = .002 # Base learning rate after warmup
cfg.SOLVER.BETA_1 = 0.9 # NovoGrad beta 1 value
cfg.SOLVER.BETA_2 = 0.5 # NovoGRad beta 2 value
cfg.SOLVER.MAX_ITERS = 2500 # Total training steps
cfg.SOLVER.WARMUP_STEPS = 250 # warmup steps
cfg.SOLVER.XLA = True # Train with XLA
cfg.SOLVER.FP16 = True # Train with mixed precision enables
cfg.SOLVER.TF32 = False # Train with TF32 data type enabled, only available on Ampere GPUs and TF 2.4 and up

In [None]:
cfg.HOOKS=["CheckpointHook",
           "IterTimerHook",
           "TextLoggerHook"]

In [29]:
import yaml
from contextlib import redirect_stdout

In [None]:
local_config_file = f"configs/local-config-studio.yaml"
with open(local_config_file, 'w') as outfile:
    with redirect_stdout(outfile): print(cfg.dump())

In [None]:
cfg.merge_from_file(local_config_file)

In [None]:
from sagemakercv.detection import build_detector
from sagemakercv.training import build_optimizer, build_scheduler, build_trainer
from sagemakercv.data import build_dataset
from sagemakercv.utils.dist_utils import get_dist_info, MPI_size, is_sm_dist
from sagemakercv.utils.runner import Runner, build_hooks
import tensorflow as tf

In [None]:
rank, local_rank, size, local_size = get_dist_info()
devices = tf.config.list_physical_devices('GPU')
for device in devices:
    tf.config.experimental.set_memory_growth(device, True)
tf.config.set_visible_devices([devices[local_rank]], 'GPU')
logical_devices = tf.config.list_logical_devices('GPU')
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": cfg.SOLVER.FP16})
tf.config.optimizer.set_jit(cfg.SOLVER.XLA)
if int(tf.__version__.split('.')[1])>=4:
    tf.config.experimental.enable_tensor_float_32_execution(cfg.SOLVER.TF32)

In [None]:
dataset = iter(build_dataset(cfg))

In [None]:
detector = build_detector(cfg)

In [None]:
features, labels = next(dataset)

In [None]:
result = detector(features, training=False)

In [None]:
optimizer = build_optimizer(cfg)

In [None]:
trainer = build_trainer(cfg, detector, optimizer, dist='smd' if is_sm_dist() else 'hvd')

In [None]:
runner = Runner(trainer, cfg)
hooks = build_hooks(cfg)
for hook in hooks:
    runner.register_hook(hook)

In [None]:
runner.run(dataset)

In [None]:
from sagemakercv.utils.visualization import build_image, restore_image
from sagemakercv.data.coco.coco_labels import coco_categories
import matplotlib.pyplot as plt

In [None]:
features, labels = next(dataset)

In [None]:
result = detector(features, training=False)

In [None]:
image_num = 1

In [None]:
image = restore_image(result['images'][image_num], features['image_info'][image_num])
boxes = result['detection_boxes'][image_num]
classes = result['detection_classes'][image_num]
scores = result['detection_scores'][image_num]

In [None]:
detection_image = build_image(image, boxes, scores, classes, coco_categories, threshold=0.8)

In [None]:
plt.figure(figsize = (15, 15))
plt.imshow(detection_image)

In [19]:
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from datetime import datetime

In [20]:
# explain region. Don't launch a training job in VA with S3  bucket in OR
os.environ['AWS_DEFAULT_REGION'] = region # This is the region we set at the beginning, when creating the S3 bucket for our data

# this is all for naming
user_id="jbsnyder-smcv-tutorial" # This is used for naming your training job, and organizing your results on S3. It can be anything you like.
date_str=datetime.now().strftime("%d-%m-%Y")
time_str=datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

In [21]:
# specify training type, s3 src and nodes
instance_type="ml.p4d.24xlarge" # This can be any of 'ml.p3dn.24xlarge', 'ml.p4d.24xlarge', 'ml.p3.16xlarge', 'ml.p3.8xlarge', 'ml.p3.2xlarge', 'ml.g4dn.12xlarge'
nodes=4
s3_location=os.path.join("s3://", S3_BUCKET, S3_DIR)
role=get_execution_role() #give Sagemaker permission to launch nodes on our behalf
source_dir='.'
entry_point='train.py'

In [22]:
cfg.LOG_INTERVAL = 50 # Number of training steps between logging interval
cfg.MODEL.DENSE.PRE_NMS_TOP_N_TRAIN = 2000 # Top regions of interest to select before NMS
cfg.MODEL.DENSE.POST_NMS_TOP_N_TRAIN = 1000 # Top regions of interest to select after NMS
cfg.MODEL.RCNN.ROI_HEAD = "StandardRoIHead"
cfg.MODEL.FRCNN.LOSS_TYPE = "giou"

In [23]:
cfg.INPUT.TRAIN_BATCH_SIZE = 384 # Training batch size
cfg.INPUT.EVAL_BATCH_SIZE = 128 # Training batch size
cfg.SOLVER.SCHEDULE = "CosineDecay" # Learning rate schedule, either CosineDecay or PiecewiseConstantDecay
cfg.SOLVER.OPTIMIZER = "NovoGrad" # Optimizer type NovoGrad or Momentum
cfg.SOLVER.LR = .042 # Base learning rate after warmup
cfg.SOLVER.BETA_1 = 0.9 # NovoGrad beta 1 value
cfg.SOLVER.BETA_2 = 0.3 # NovoGRad beta 2 value
cfg.SOLVER.MAX_ITERS = 5000 # Total training steps
cfg.SOLVER.WARMUP_STEPS = 500 # warmup steps
cfg.SOLVER.XLA = True # Train with XLA
cfg.SOLVER.FP16 = True # Train with mixed precision enables
cfg.SOLVER.TF32 = True # Train with TF32 data type enabled, only available on Ampere GPUs and TF 2.4 and up

In [24]:
cfg.HOOKS=["CheckpointHook",
           "IterTimerHook",
           "TextLoggerHook",
           "CocoEvaluator"]

In [25]:
if nodes>1 and instance_type in ['ml.p3dn.24xlarge', 'ml.p4d.24xlarge', 'ml.p3.16xlarge']:
    distribution = { "smdistributed": { "dataparallel": { "enabled": True } } } 
else:
    custom_mpi_options = mpi_options = [
         '-x FI_EFA_USE_DEVICE_RDMA=1',
    ]
    distribution = {
    "mpi": {
        "enabled": True,
        "processes_per_host": processes_per_host,
        "custom_mpi_options": " ".join(custom_mpi_options)
        }
    }

In [26]:
job_name = f'{user_id}-{time_str}'
output_path = os.path.join(s3_location, "sagemaker-output", date_str, job_name)
code_location = os.path.join(s3_location, "sagemaker-code", date_str, job_name)

In [27]:
channels = {'train2017': os.path.join(s3_location, 'data', 'coco', 'tfrecord', 'train2017'),
            'val2017': os.path.join(s3_location, 'data', 'coco', 'tfrecord', 'val2017'),
            'annotations': os.path.join(s3_location, 'data', 'coco', 'annotations'),
            'weights': os.path.join(s3_location, 'data', 'weights', 'resnet')}

In [28]:
CHANNELS_DIR='/opt/ml/input/data/' # on node
cfg.PATHS.TRAIN_FILE_PATTERN = os.path.join(CHANNELS_DIR, "train2017", "train*")
cfg.PATHS.VAL_FILE_PATTERN = os.path.join(CHANNELS_DIR, "val2017", "val*")
cfg.PATHS.WEIGHTS = os.path.join(CHANNELS_DIR, "weights", "model.ckpt-112603")
cfg.PATHS.VAL_ANNOTATIONS = os.path.join(CHANNELS_DIR, "annotations", "instances_val2017.json")
cfg.PATHS.OUT_DIR = '/opt/ml/checkpoints'

In [30]:
dist_config_file = f"configs/dist-training-config.yaml"
with open(dist_config_file, 'w') as outfile:
    with redirect_stdout(outfile): print(cfg.dump())

In [31]:
hyperparameters = {"config": dist_config_file}

In [32]:
estimator = TensorFlow(
                entry_point=entry_point, 
                source_dir=source_dir, 
                py_version='py37',
                framework_version='2.4.1',
                role=role,
                instance_count=nodes,
                instance_type=instance_type,
                distribution=distribution,
                output_path=output_path,
                checkpoint_s3_uri=output_path,
                model_dir=output_path,
                hyperparameters=hyperparameters,
                volume_size=500,
                disable_profiler=True,
                debugger_hook_config=False,
                code_location=code_location,
)

In [33]:
estimator.fit(channels, wait=True, job_name=job_name)

2021-10-27 20:25:23 Starting - Starting the training job...
2021-10-27 20:25:26 Starting - Launching requested ML instances............
2021-10-27 20:27:43 Starting - Preparing the instances for training.................................
2021-10-27 20:33:14 Downloading - Downloading input data.........
2021-10-27 20:34:53 Training - Training image download completed. Training in progress..[34m2021-10-27 20:34:53.484974: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-10-27 20:34:53.489043: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-10-27 20:34:53.592201: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-10-27 20:34:53.696329: W tensorflow/core/profiler/internal/smprofiler_timeline.cc: