# Amazon SageMaker Experiment Trials for Distributed Training of Mask-RCNN


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

---


This notebook is a step-by-step tutorial on Amazon SageMaker Experiment Trials for distributed training of [Mask R-CNN](https://arxiv.org/abs/1703.06870) implemented in [TensorFlow](https://www.tensorflow.org/) framework. 

Concretely, we will describe the steps for SageMaker Experiment Trials for training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) and [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) in [Amazon SageMaker](https://aws.amazon.com/sagemaker/) using [Amazon S3](https://aws.amazon.com/s3/) and [Amazon EFS](https://aws.amazon.com/s3/) as data sources.

The outline of steps is as follows:

1. Stage COCO 2017 dataset on [Amazon S3](https://aws.amazon.com/s3/). 
2. Stage COCO 2017 dataset data in [Amazon EFS](https://aws.amazon.com/s3/), if EFS is attached to the notebook.
2. Build SageMaker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
3. Configure data input channels
4. Configure hyper-prarameters
5. Define training metrics
6. Define training job 
7. Define SageMaker Experiment Trials to start the training jobs

## Initialize SageMaker Session

First, let us specify the ```s3_bucket``` that we will use throughout the notebook. The ```s3_bucket``` must be located in the region of this notebook instance. If you do not specify you S3 bucket name in `s3_bucket`, default SageMaker bucket is used, if it exists. We will also initialize the SageMaker session.

In [None]:
import os
import time
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow.estimator import TensorFlow

s3_bucket  = None # your-s3-bucket-name

role = get_execution_role() # you may provide a pre-existing role ARN here
print(f"SageMaker Execution Role: {role}")

session = boto3.session.Session()
aws_region = session.region_name
print(f"AWS Region: {aws_region}")

sagemaker_session = sagemaker.session.Session(boto_session=session)

if s3_bucket is None:
    s3_bucket = sagemaker_session.default_bucket()
    
print(f"Using S3 bucket: {s3_bucket}")

try:
    s3_client = boto3.client('s3')
    response = s3_client.get_bucket_location(Bucket=s3_bucket)
    bucket_region = response['LocationConstraint']
    bucket_region = 'us-east-1' if bucket_region is None else bucket_region
    
    print(f"Bucket region: {bucket_region}")
except:
    print(f"Access Error: Check if '{s3_bucket}' S3 bucket is in '{aws_region}' region")
    
sts = boto3.client("sts")
aws_account_id = sts.get_caller_identity()["Account"]

print(f"Account: {aws_account_id}")

## Check for Attached EFS File-system

We check to see if an EFS file-system is attached to this notebook. If an EFS file-system is attached to this notebook, we use the attached EFS file-system for data input, otherwise, we use Amazon S3.

**Note:**
If you created this notebook instance using the [stack-sm.sh](stack-sm.sh) script in this repository, an EFS file-system is automatically attached to this notebook. 

In [None]:
import re
notebook_attached_efs=!df -kh | grep 'fs-' | sed 's/\(fs-[0-9a-z]*\)\.efs\..*/\1/'

efs_enabled = False
if notebook_attached_efs and re.match(r'fs-[0-9a-z]+', notebook_attached_efs[0]):
    efs_enabled=True
    print(f"SageMaker notebook has attached EFS: {notebook_attached_efs}")
else:
    print("No EFS file-system is attached to this notebook")


## Stage COCO 2017 dataset on Amazon S3

We use [COCO 2017](http://cocodataset.org/#home) dataset. This step downloads COCO 2017 training and validation dataset to this notebook instance, extracts the files from the dataset, and uploads the extracted files to your Amazon S3 bucket. Expected time to execute this step is 30 minutes.

In [None]:
%%time

import sys, os, subprocess

key="mask-rcnn/sagemaker/input/train/pretrained-models/ImageNet-R50-AlignPadding.npz"
response = None

try:
    response = s3_client.head_object(Bucket=s3_bucket, Key=key)
except:
    pass

file_size = response.get('ContentLength', 0) if response else 0

if file_size == 0:
    print(f"Uploading data to s3://{s3_bucket}/mask-rcnnS/sagemaker/input/train/")
    print(f"Estimated time: 30 minutes")
    subprocess.check_call(['./prepare-s3-bucket.sh', s3_bucket], 
                          stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
    print(f"Uploaded data to s3://{s3_bucket}/mask-rcnn/sagemaker/input/train/")
else:
    print(f"Data already available in {s3_bucket} bucket")

## Stage COCO 2017 dataset on Amazon EFS

Next, we stage [COCO 2017](http://cocodataset.org/#home) dataset on Amazon EFS, if EFS file-system is attached. The [prepare-efs.sh](prepare-efs.sh) script executes this step. The expected time to execute this step is 30 minutes.

In [None]:
%%time
import sys, os, subprocess

if efs_enabled:
    # Specify relative directory path for input data on the EFS file system.
    file_system_directory_path = "mask-rcnn/sagemaker/input/train"
    print(f"EFS file-system data input path: {file_system_directory_path}")
    train_path = os.path.join(os.getenv('HOME'), 'efs', file_system_directory_path)
    
    if not os.path.exists(train_path):
        print(f"Staging data on efs file-system: {train_path}")
        subprocess.check_call(['./prepare-efs.sh', s3_bucket], 
                              stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
    else:
        print(f"Data already available in efs file-system: {train_path}")

## Specify Model Type

We have a choice of two different models:

1. [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation supports a maximum per-GPU batch size of 1.

2. [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) is an optimized implementation that supports a maximum per GPU batch size of 4, assuming per GPU memory of 32 GB.

Below, set the `model_type` to `"aws-samples-mask-rcnn"`, or `"tensorpack-mask-rcnn"`.


In [None]:
# Select the model type you want to use
model_type = "tensorpack-mask-rcnn" # "aws-samples-mask-rcnn"

## Build and push SageMaker Training Image to ECR

Next, we build and push the training image to Amazon ECR, based on the selected model type. This may take several minutes on first-time build on this notebook. We also set the `training_script` based on the selected model type.

**Note:**
For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the [stack-sm.sh](stack-sm.sh) script, the IAM Role attached to this notebook instance is already setup with full access to ECR service. 

In [None]:
%%time
import sys, os, subprocess

with open("training-image-build.log", "w") as logfile:
    if "tensorpack" in model_type:
        print("Building and pushing Tensorpack Faster-RCNN/Mask-RCNN docker image to ECR")
        subprocess.check_call(['./container-script-mode/build_tools/build_and_push.sh', 
                               aws_region], stdout=logfile, stderr=subprocess.STDOUT)
        
        image_tag = !cat ./container-script-mode/build_tools/set_env.sh \
            | grep 'IMAGE_TAG' | sed 's/.*IMAGE_TAG=\(.*\)/\1/'
        
        image_name="mask-rcnn-tensorpack-sagemaker-script-mode"
        full_name=f"{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com/{image_name}"
        tensorpack_image = f"{full_name}:{image_tag[0]}"
        training_image = tensorpack_image
        training_script= "tensorpack-mask-rcnn.py"

    else:
        print("Building and pushing AWS Samples Mask R-CNN docker image to ECR")
        subprocess.check_call(['./container-optimized-script-mode/build_tools/build_and_push.sh',
                               aws_region], stdout=logfile, stderr=subprocess.STDOUT)
        
        image_tag = !cat ./container-optimized-script-mode/build_tools/set_env.sh \
            | grep 'IMAGE_TAG' | sed 's/.*IMAGE_TAG=\(.*\)/\1/'
        
        image_name="mask-rcnn-tensorflow-sagemaker-script-mode"
        full_name=f"{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com/{image_name}"
        aws_samples_image = f"{full_name}:{image_tag[0]}"
       
        training_image = aws_samples_image
        training_script= "aws-mask-rcnn.py" 

print(f"Training Image: {training_image}")
print(f"Training Script: {training_script}")


## Define SageMaker Data Channels

We define `train` data channels for Amazon S3, and Amazon EFS. For any given training job, you need to either use Amazon S3 `train` data channel, or use Amazon EFS `train` data channel.

### Define S3 Train Data Channel

We first define S3 `train` data channel below.

In [None]:
from sagemaker.inputs import TrainingInput

prefix = "mask-rcnn/sagemaker"  # prefix in your S3 bucket

s3train = f"s3://{s3_bucket}/{prefix}/input/train"
train_input = TrainingInput(
    s3_data=s3train, distribution="FullyReplicated", s3_data_type="S3Prefix", input_mode="File"
)

s3_data_channels = {"train": train_input}

### Define Amazon EFS Train Data Channel 

Next, we define the *train* data channel using EFS file-system, if Amazon EFS file-system attached to this notebook.

In [None]:
from sagemaker.inputs import FileSystemInput

if efs_enabled:
    # Specify EFS file system id.
    file_system_id = notebook_attached_efs[0]
    print(f"EFS file-system-id: {file_system_id}")

    # Specify the access mode of the mount of the directory associated with the file system.
    # Directory must be mounted  'ro'(read-only).
    file_system_access_mode = "ro"

    # Specify your file system type
    file_system_type = "EFS"

    train = FileSystemInput(
        file_system_id=file_system_id,
        file_system_type=file_system_type,
        directory_path=f"/{file_system_directory_path}",
        file_system_access_mode=file_system_access_mode,
    )

### Define Model Output Location

Next, we define the model output location in S3 bucket.

In [None]:
prefix = "mask-rcnn/sagemaker"  # prefix in your bucket
s3_output_location = f"s3://{s3_bucket}/{prefix}/output"
print(f"Model output location: {s3_output_location}")

## Define Security Group and Subnets

If an EFS file-system is attached to this notebook, we retrieve the security groups and subnets associated with the EFS file-system mount-targets, and use them in defining the training job.

**Note:**
For this step, the IAM Role attached to this notebook instance needs permission to describe EFS mount targets, and mount target security groups. If you created this notebook instance using the [stack-sm.sh](stack-sm.sh) script, the IAM Role attached to this notebook instance is already setup with required permissions. 


In [None]:
import os
import boto3

security_group_ids=None
subnets=None

if efs_enabled:
    file_system_id = notebook_attached_efs[0]
    efs_client = boto3.client("efs")
    response = efs_client.describe_mount_targets(FileSystemId=file_system_id)
    mount_targets = response.get('MountTargets', [])
        
    for mount_target in mount_targets:
        subnet_id = mount_target['SubnetId']
        if subnets is None:
            subnets = [subnet_id]
        else:
            subnets.append(subnet_id)
        
        mt_id = mount_target['MountTargetId']
        response = efs_client.describe_mount_target_security_groups(MountTargetId=mt_id)
        security_groups = response['SecurityGroups']
        if security_group_ids is None:
            security_group_ids = security_groups
        else:  
            security_group_ids.extend(security_groups)
    
    
subnets = list(set(subnets)) if isinstance(subnets, list) else None
security_group_ids = list(set(security_group_ids)) if isinstance(security_group_ids, list) \
                        else None

print(f"Subnets: {subnets}")
print(f"Security groups: {security_group_ids}")


## Configure Hyper-parameters
Next, we define the hyper-parameters. 

Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.

The default learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.

### TensorPack Faster-RCNN/Mask-RCNN Hyper-parameters

| Hyper-parameter | Description | Default |
|-----------|-------------|---------------|
| backbone_weights | ResNet backbone pre-trained weights file | 'ImageNet-R50-AlignPadding.npz' |
| batch_norm | Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') | 'FreezeBN' |
| config: | Any hyper-parameter prefixed with **config:** is set as a model config parameter | - |
| data_train | Training data | 'coco_train2017' |
| data_val | Validation data | 'coco_val2017' |
| eval_period | Number of epochs period for evaluation during training | 1 |
| images_per_epoch | Images per epoch | 120000 |
| load_model | Pre-trained model to load | - |
| lr_schedule | Learning rate schedule in training steps | '[240000, 320000, 360000]' |
| mode_fpn | Use Feature Pyramid Network (FPN) mode | True |
| mode_mask | Compute masks | True |
| resnet_arch | Must be 'resnet50' or 'resnet101' | 'resnet50' |


### AWS Samples Mask-RCNN Hyper-parameters

| Hyper-parameter | Description | Default |
|-----------|-------------|---------------|
| backbone_weights | ResNet backbone pre-trained weights file | 'ImageNet-R50-AlignPadding.npz' |
| batch_norm | Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') | 'FreezeBN' |
| batch_size_per_gpu | Batch size per gpu, 1 - 6 | 16 GB: 2, 32 GB: 4, > 32 GB : 6|
| config: | Any hyper-parameter prefixed with **config:** is set as a model config parameter | - |
| data_train | Training data | 'train2017' |
| data_val | Validation data | 'val2017' |
| eval_period | Number of epochs period for evaluation during training | 1 |
| lr_schedule | Learning rate schedule in training steps | '[(16, 0.1), (20, 0.01), (24, None)]' |
| images_per_epoch | Images per epoch | 120000 |
| load_model | Pre-trained model to load | - |
| mode_fpn | Use Feature Pyramid Network (FPN) mode. Must be True. | True |
| mode_mask | Compute masks | True |
| resnet_arch | Must be 'resnet50' or 'resnet101' | 'resnet50' |


In [None]:
hyperparameters = {
    "mode_fpn": "True",
    "mode_mask": "True",
    "eval_period": 1,
    "batch_norm": "FreezeBN",
}

## Define Training Metrics
Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console.

In [None]:
metric_definitions = [
    {"Name": "fastrcnn_losses/box_loss", "Regex": ".*fastrcnn_losses/box_loss:\\s*(\\S+).*"},
    {"Name": "fastrcnn_losses/label_loss", "Regex": ".*fastrcnn_losses/label_loss:\\s*(\\S+).*"},
    {
        "Name": "fastrcnn_losses/label_metrics/accuracy",
        "Regex": ".*fastrcnn_losses/label_metrics/accuracy:\\s*(\\S+).*",
    },
    {
        "Name": "fastrcnn_losses/label_metrics/false_negative",
        "Regex": ".*fastrcnn_losses/label_metrics/false_negative:\\s*(\\S+).*",
    },
    {
        "Name": "fastrcnn_losses/label_metrics/fg_accuracy",
        "Regex": ".*fastrcnn_losses/label_metrics/fg_accuracy:\\s*(\\S+).*",
    },
    {
        "Name": "fastrcnn_losses/num_fg_label",
        "Regex": ".*fastrcnn_losses/num_fg_label:\\s*(\\S+).*",
    },
    {"Name": "maskrcnn_loss/accuracy", "Regex": ".*maskrcnn_loss/accuracy:\\s*(\\S+).*"},
    {
        "Name": "maskrcnn_loss/fg_pixel_ratio",
        "Regex": ".*maskrcnn_loss/fg_pixel_ratio:\\s*(\\S+).*",
    },
    {"Name": "maskrcnn_loss/maskrcnn_loss", "Regex": ".*maskrcnn_loss/maskrcnn_loss:\\s*(\\S+).*"},
    {"Name": "maskrcnn_loss/pos_accuracy", "Regex": ".*maskrcnn_loss/pos_accuracy:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/IoU=0.5", "Regex": ".*mAP\\(bbox\\)/IoU=0\\.5:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/IoU=0.5:0.95", "Regex": ".*mAP\\(bbox\\)/IoU=0\\.5:0\\.95:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/IoU=0.75", "Regex": ".*mAP\\(bbox\\)/IoU=0\\.75:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/large", "Regex": ".*mAP\\(bbox\\)/large:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/medium", "Regex": ".*mAP\\(bbox\\)/medium:\\s*(\\S+).*"},
    {"Name": "mAP(bbox)/small", "Regex": ".*mAP\\(bbox\\)/small:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/IoU=0.5", "Regex": ".*mAP\\(segm\\)/IoU=0\\.5:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/IoU=0.5:0.95", "Regex": ".*mAP\\(segm\\)/IoU=0\\.5:0\\.95:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/IoU=0.75", "Regex": ".*mAP\\(segm\\)/IoU=0\\.75:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/large", "Regex": ".*mAP\\(segm\\)/large:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/medium", "Regex": ".*mAP\\(segm\\)/medium:\\s*(\\S+).*"},
    {"Name": "mAP(segm)/small", "Regex": ".*mAP\\(segm\\)/small:\\s*(\\S+).*"},
]

## Define SageMaker Experiment

To define SageMaker Experiment, we first install `sagemaker-experiments` package.

In [None]:
! pip install --upgrade pip
! pip install sagemaker-experiments

Next, we import the SageMaker Experiment modules.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
import time

Next, we define a `Tracker` for tracking input data used in the SageMaker Trials in this Experiment. Specify the S3 URL of your dataset in the `value` below and change the name of the dataset if you are using a different dataset.

In [None]:
sm = session.client("sagemaker")
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
    # we can log the s3 uri to the dataset used for training
    tracker.log_input(
        name="coco-2017-dataset",
        media_type="s3/uri",
        value=f"s3://{s3_bucket}/{prefix}/input/train",  # specify S3 URL to your dataset
    )

Next, we create a SageMaker Experiment.

In [None]:
mrcnn_experiment = Experiment.create(
    experiment_name=f"mask-rcnn-experiment-{int(time.time())}",
    description="Mask R-CNN experiment",
    sagemaker_boto_client=sm,
)
print(mrcnn_experiment)

## Define SageMaker Experiment Trials

Next, we define SageMaker experiment trials for the experiment we just defined. For each experiment trial, we use SageMaker [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html) API to define a SageMaker Training Job that uses SageMaker script mode. 

### Download Pre-trained Model Weights for Trial

For one of the trials in our experiment, we use the ResNet-101 pretrained model weights: [ImageNet-R101-AlignPadding.npz](http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz), so we download and stage the model weights.

In [None]:
import urllib
from  tempfile import NamedTemporaryFile
import sys, os, subprocess

with NamedTemporaryFile(mode="w+b", prefix="ImageNet-R101-AlignPadding", suffix=".npz") as file:
    print("Downloading ImageNet-R101-AlignPadding.npz")
    imagenet_101_url = "http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz"
    urllib.request.urlretrieve(imagenet_101_url, file.name)
    
    file.seek(0)
    print("Uploading ImageNet-R101-AlignPadding.npz to S3")
    s3_client.upload_file(file.name, f"{s3_bucket}",
        "mask-rcnn/sagemaker/input/train/pretrained-models/ImageNet-R101-AlignPadding.npz")
    
    file.seek(0)
    if efs_enabled:
        print("Copying ImageNet-R101-AlignPadding.npz to EFS file-system")
        home = os.getenv('HOME')
        dst_path = os.path.join(home, "efs", file_system_directory_path,
                                "pretrained-models/ImageNet-R101-AlignPadding.npz")
        subprocess.check_call(['sudo', 'cp', file.name, dst_path])
    


### Define SageMaker TensorFlow Estimator for Trials

Next, we use SageMaker TensorFlow Estimator API to define a SageMaker Training Job for each SageMaker Trial we need to run within the SageMaker Experiment.

We recommend using 16 GPUs for each training job, so we set ```instance_count=2```. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```volume_size = 100```. 

Next, we will iterate through the Trial parameters and start two trials, one for ResNet architecture `resnet50`, and a second Trial for `resnet101`.

In [None]:
trial_params = [ ('resnet50', 'ImageNet-R50-AlignPadding.npz'), 
                ('resnet101', 'ImageNet-R101-AlignPadding.npz')]

instance_type = 'ml.p3.16xlarge'  # You may optionally use 'ml.p3dn.24xlarge' or larger instance
assert instance_type in ['ml.p3.16xlarge', 'ml.p3dn.24xlarge']

if 'aws-samples' in model_type:
    hyperparameters['batch_size_per_gpu'] = 2 if instance_type == 'ml.p3.16xlarge' else 4

mpi_distribution = None
instance_count = 2 # Between 1 - 4
if instance_count > 1:
    device_min_sys_mem_mb = 2560
    custom_mpi_options = f"--verbose --output-filename /opt/ml/model/logs \
        -x TF_DEVICE_MIN_SYS_MEMORY_IN_MB={device_min_sys_mem_mb}"
    mpi_distribution = {"mpi": { "enabled": True, "custom_mpi_options": custom_mpi_options } }  
    
training_jobs = []
for resnet_arch, backbone_weights in trial_params:
    
    hyperparameters['resnet_arch'] = resnet_arch
    hyperparameters['backbone_weights'] = backbone_weights
    
    trial_name = f"mask-rcnn-script-mode-{resnet_arch}-{int(time.time())}"
    mrcnn_trial = Trial.create(
                        trial_name=trial_name, 
                        experiment_name=mrcnn_experiment.experiment_name,
                        sagemaker_boto_client=sm,
    )
    
    # associate the proprocessing trial component with the current trial
    mrcnn_trial.add_trial_component(tracker.trial_component)
    print(mrcnn_trial)
        
    mask_rcnn_estimator = TensorFlow(image_uri=training_image,
                                role=role, 
                                py_version='py3',
                                instance_count=instance_count, 
                                instance_type=instance_type,
                                distribution=mpi_distribution,
                                entry_point=training_script,
                                volume_size = 100,
                                max_run = 400000,
                                output_path=s3_output_location,
                                sagemaker_session=sagemaker_session, 
                                hyperparameters = hyperparameters,
                                metric_definitions = metric_definitions,
                                subnets=subnets,
                                security_group_ids=security_group_ids)
    
    if efs_enabled:
        # Specify directory path for log output on the EFS file system.
        # You need to provide normalized and absolute path below.
        # For example, '/mask-rcnn/sagemaker/output/log'
        # Log output directory must not exist
        file_system_directory_path = f'/mask-rcnn/sagemaker/output/{mrcnn_trial.trial_name}'
        print(f"EFS log directory:{file_system_directory_path}")

        # Create the log output directory. 
        # EFS file-system is mounted on '$HOME/efs' mount point for this notebook.
        home_dir=os.environ['HOME']
        local_efs_path = os.path.join(home_dir,'efs', file_system_directory_path[1:])
        print(f"Creating log directory on EFS: {local_efs_path}")

        assert not os.path.isdir(local_efs_path)
        ! sudo mkdir -p -m a=rw {local_efs_path}
        assert os.path.isdir(local_efs_path)

        # Specify the access mode of the mount of the directory associated with the file system. 
        # Directory must be mounted 'rw'(read-write).
        file_system_access_mode = 'rw'


        log = FileSystemInput(file_system_id=file_system_id,
                                        file_system_type=file_system_type,
                                        directory_path=file_system_directory_path,
                                        file_system_access_mode=file_system_access_mode)

    data_channels = {'train': train, 'log': log} \
        if (efs_enabled and security_group_ids and subnets) else s3_data_channels

    mask_rcnn_estimator.fit(inputs=data_channels, 
                            job_name=mrcnn_trial.trial_name,
                            logs=True,  
                            experiment_config={"TrialName": mrcnn_trial.trial_name, 
                                           "TrialComponentDisplayName": "Training"},
                            wait=False)

    training_jobs.append(mrcnn_trial.trial_name)
    
    # sleep in between starting two trials
    time.sleep(2)

### Check Training Jobs are Completed

Next we check that the training jobs have completed. We can not analyze the experiment trials until the training jobs have completed.

In [None]:
import boto3
import time

client = boto3.client('sagemaker')
status = 'InProgress'

for training_job in training_jobs:
    response = client.describe_training_job(TrainingJobName=training_job)
    print(response)
    status = response['TrainingJobStatus']
    if status != 'Completed':
        break
        
if status != 'Completed':
    print(f"Training jobs {training_jobs} have not yet completed")

### Define Search Expression for Training Metrics

Below we define search expression for the training metrics of interest that we want to glean from the trials' data.

In [None]:
search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        },
        {
            "Name": "metrics.maskrcnn_loss/accuracy.max",
            "Operator": "LessThan",
            "Value": "1",
        },
    ],
}

### Analyze Trial Data

Below we analyze the experiment trial analytics.


In [None]:
from sagemaker.analytics import ExperimentAnalytics


if status == 'Completed':
    trial_component_analytics = ExperimentAnalytics(
        sagemaker_session=sagemaker_session,
        experiment_name=mrcnn_experiment.experiment_name,
        search_expression=search_expression,
        sort_by="metrics.maskrcnn_loss/accuracy.max",
        sort_order="Descending",
        parameter_names=["resnet_arch"],
    )
    

In [None]:
if status == 'Completed':
    analytic_table = trial_component_analytics.dataframe()
    for col in analytic_table.columns:
        print(col)

In [None]:
if status == 'Completed':
    bbox_map = analytic_table[
        ["resnet_arch", "mAP(bbox)/small - Max", "mAP(bbox)/medium - Max", "mAP(bbox)/large - Max"]
    ]
    bbox_map

In [None]:
if status == 'Completed':
    segm_map = analytic_table[
        ["resnet_arch", "mAP(segm)/small - Max", "mAP(segm)/medium - Max", "mAP(segm)/large - Max"]
    ]
    segm_map

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)
