## Setup

The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.
The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK if running this notebook in sagemaker studio

- profile = aws profile
- role = predefined role arn

In [3]:
%%time
import sagemaker
from sagemaker import get_execution_role
import os
import boto3 
import time
import shutil
import random

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
role='arn:aws:iam::395166463292:role/service-role/AmazonSageMaker-ExecutionRole-20200714T182988'
from PIL import Image

profile = 'crayon-site'
region_name='us-east-2'
bucket = 'st-crayon-dev'
prefix = 'sagemaker/labelbox/'

from botocore.exceptions import ProfileNotFound

try:
    boto3.setup_default_session(profile_name=profile)
except ProfileNotFound:
    print("crayon-site profile not found. Using default aws profile.")

session = boto3.session.Session(profile_name = profile, region_name = region_name)
sess = sagemaker.Session(session,default_bucket=bucket)
print(sess.boto_session)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


Session(region_name='us-east-2')
CPU times: user 540 ms, sys: 198 ms, total: 738 ms
Wall time: 1.16 s


## Set up the Experiment
Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work.  Think of it as a “folder” for organizing your “files”.

In [3]:
sm = session.client(service_name = 'sagemaker')

experiment_name = f'labelbox-semantic-segmentation-all'

experiments = []

for exp in Experiment.list(sagemaker_boto_client=sm):
    experiments.append(exp.experiment_name)

print(f'List of experiments : {experiments}')

if experiment_name not in experiments:
    experiment = Experiment.create(experiment_name=experiment_name,
                                   description="semantic segmentation of drone pictures",
                                   sagemaker_boto_client=sm)

print(f'Experiment used for notebook = {experiment_name}')

List of experiments : ['site-tech-drone-img-seg-full-res', 'labelbox-semantic-segmentation512', 'site-tech-drone-img-seg']
Experiment used for notebook = labelbox-semantic-segmentation512


### Track Experiment
Create a Trial for each training run to track it's inputs, parameters, and metrics.

In [34]:
# Setting up parameters used for the hyperparameters and to name the trial
num_training_samples = 2397 # 2397 and 11294
base_size = 256
crop_size = 224
num_epochs = 70
algorithm = 'fcn'
tiles_type = 'tiles_1024'
base_trial_name = f"I{tiles_type.replace('_','-')}-B{base_size}-C{crop_size}-E{num_epochs}-{algorithm}"
# Add some randomness to each trial name
rand_string = str(np.random.randint(1000)).zfill(4)
trial_name = f'ssl-{base_trial_name}-{rand_string}'
# Creating the trial
ss_trial = Trial.create(trial_name = trial_name,
                          experiment_name = experiment_name,
                          sagemaker_boto_client = sm,
                          tags = [{'Key': 'experiment_name', 'Value': 'aws-ss-drone-dataset'}])
ss_trial

Trial(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f82f0e2d100>,trial_name='semantic-segmentation-labelbox-dataset-1024-1597188988',experiment_name='labelbox-semantic-segmentation512',tags=[{'Key': 'experiment_name', 'Value': 'aws-ss-drone-dataset'}],trial_arn='arn:aws:sagemaker:us-east-2:395166463292:experiment-trial/semantic-segmentation-labelbox-dataset-1024-1597188988',response_metadata={'RequestId': '1dbff408-fdf7-4885-b316-fce4bb147e61', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1dbff408-fdf7-4885-b316-fce4bb147e61', 'content-type': 'application/x-amz-json-1.1', 'content-length': '127', 'date': 'Tue, 11 Aug 2020 23:36:28 GMT'}, 'RetryAttempts': 0})

In [None]:
s3_output_location = f's3://{bucket}/{prefix}output'
s3_output_checkpoints = f's3://{bucket}/{prefix}output/checkpoints'
s3_output_location

## Training image
Since we are using prebaked aws semantic segmentation algo, we need the Amazon SageMaker Semantic Segmentaion docker image. This is static and does not need to be changed.

In [45]:
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(sess.boto_region_name, 'semantic-segmentation', repo_version="latest")
print (training_image)



825641698319.dkr.ecr.us-east-2.amazonaws.com/semantic-segmentation:latest


## Training (using Pipe Mode)

In File-mode training data is downloaded to an encrypted EBS volume prior to commencing training. Once downloaded, the training algorithm simply trains by reading the downloaded training data files.

On the other hand, in Pipe-mode the input data is transferred to the algorithm while it is training. This poses a few significant advantages over File-mode:


*  In File-mode, training startup time is proportional to size of the input data. In Pipe-mode, the startup delay is constant, independent of the size of the input data. This translates to much faster training startup for training jobs with large GB/PB-scale training datasets.
* You do not need to allocate (and pay for) a large disk volume to be able to download the dataset.
* Throughput on IO-bound Pipe-mode algorithms can be multiple times faster than on equivalent File-mode algorithms.

However, these advantages come at a cost - a more complicated programming model than simply reading from files on a disk. This notebook aims to clarify what you need to do in order to use Pipe-mode in your custom training algorithm.



### Prepare manifest file for data
Manifest files should need to be created (but this only needs to be done once). There is a separate notebook `semantic_seg_create_manifest.ipynb` that performs this operation.

### Prepare handshake 
We need to setup the data channels between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. In pipe mode, the data channels are the manifest files which contain the locations of the images and annotations in `s3`.

In [47]:
manifest_train = f'data/raw/imgs/{tiles_type}/manifests/manifest_file_train_imgs.json'
manifest_train = f's3://{bucket}/{manifest_train}'
manifest_val = f'data/raw/imgs/{tiles_type}/manifests/manifest_file_val_imgs.json'
manifest_val = f's3://{bucket}/{manifest_val}'

print(f'{manifest_train=}\n{manifest_val=}')

manifest_train='s3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json'
manifest_val='s3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_val_imgs.json'


In [48]:
distribution = 'FullyReplicated'
# Create sagemaker s3_input objects
train_data = sagemaker.session.s3_input(manifest_train, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")
validation_data = sagemaker.session.s3_input(manifest_val, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")


data_channels = {'train': train_data, 
                 'validation': validation_data}
data_channels



{'train': <sagemaker.inputs.s3_input at 0x7f82f23555e0>,
 'validation': <sagemaker.inputs.s3_input at 0x7f82f2355e20>}

### Train
To begin training, we have to create ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job. We name our training job as ``ssl-job-<train_params>``. For training, we need to select a gpu insatance type.

In [None]:
base_job_name = f'ssl-job-{base_trial_name}-{rand_string}'

ss_model = sagemaker.estimator.Estimator(training_image,
                                         role=role, 
                                         train_instance_count = 1, 
                                         train_instance_type = 'ml.p2.xlarge',
                                         train_volume_size = 10,
                                         train_max_run = 60 * 60 * 12,
                                         train_max_wait= 60 * 60 * 18,
                                         output_path = s3_output_location,
                                         checkpoint_s3_uri = s3_output_checkpoints,
                                         base_job_name = base_job_name,
                                         train_use_spot_instances=True,
                                         input_mode='Pipe',
                                         sagemaker_session = sess,
                                         enable_sagemaker_metrics=True)

The semantic segmentation algorithm at its core has two compoenents.

- An encoder (typically contains a CNN-based backbone)
- A decoder network. 

The encoder is typically a regular convolutional neural network (CNN). The choice of the particular CNN to be used is called the backbone. The backbone can be selected to utilize pre-training on another task or be trained totally from scratch. The two backbones available for the AWS semantic segmentation algorithm are Resnet-101 and Resnet-50 ([ResNets](https://arxiv.org/abs/1512.03385) 50 or 101). If you select pre-trained models for these two backbones, they have been trained on Imagenet classification task [classification task of ImageNet images](http://www.image-net.org/).

The decoder is a network that picks up the outputs of one or many layers from the backbone and reconstructs the segmentation mask from these inputs. Amazon SageMaker Semantic Segmentation algorithm comes with a choice of the [Fully-convolutional network (FCN)](https://arxiv.org/abs/1605.06211), the [Pyramid scene parsing (PSP) network](https://arxiv.org/abs/1612.01105) and [deeplab v3] (https://arxiv.org/abs/1706.05587).

The algorithm also has ample options for hyperparameters that help configure the training job. One of the important steps during experimentation is figuring out the best networks and hyperparamenters. Consider the example definition of hyperparameters in the cell below and see the SageMaker Semantic Segmentation [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html) for more details on the hyperparameters.

For instance, one of the hyperparameters here is the number of `epochs`. This defines how many times we pass through all the training examples in the dataset. This effects how long the model is allowed to train. Too little time and the model does not receive enough training and won't be powerful enough. If you let the training run too long, then it will just memorize the input and not generalize to unseen examples. There is always tradeoffs for each hyperparameter and the space needs to be explored to determine optimal hyperparameters.

In [52]:
ss_model.set_hyperparameters(backbone='resnet-50', # This is the encoder. Other option is resnet-50
                             algorithm=algorithm, # This is the decoder. Other option is 'fcn', 'psp' and 'deeplab'                             
                             use_pretrained_model='True', # Use the pre-trained model.
                             base_size=base_size,
                             crop_size=crop_size, # Size of image random crop.                              
                             num_classes=8, # Pascal has 21 classes. This is a mandatory parameter.
                             epochs=num_epochs, # Number of epochs to run.
                             learning_rate=0.001,                             
                             optimizer='rmsprop', # Other options include 'adam', 'rmsprop', 'nag', 'adagrad'.
                             lr_scheduler='poly', # Other options include 'cosine' and 'step'.                           
                             mini_batch_size=16, # Setup some mini batch size.
                             validation_mini_batch_size=16,
                             early_stopping=True, # Turn on early stopping. If OFF, other early stopping parameters are ignored.
                             early_stopping_patience=5, # Tolerate these many epochs if the mIoU doens't increase.
                             early_stopping_tolerance=.001,
                             early_stopping_min_epochs=10, # No matter what, run these many number of epochs.                             
                             num_training_samples=num_training_samples) # This is a mandatory parameter, 1464 in this case.

In [53]:
print(f'{train_data.config}\n{validation_data.config}')

{'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'annotation-ref']}}, 'ContentType': 'application/x-recordio', 'RecordWrapperType': 'RecordIO', 'InputMode': 'Pipe'}
{'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_val_imgs.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'annotation-ref']}}, 'ContentType': 'application/x-recordio', 'RecordWrapperType': 'RecordIO', 'InputMode': 'Pipe'}


In [56]:
ss_model.fit(inputs=data_channels,
             logs=True,
             experiment_config={
                "TrialName": ss_trial.trial_name,
                "TrialComponentDisplayName": "Training",
                },
             wait=True
            )

INFO:sagemaker:Creating training-job with name: ss-labelbox-train-pipe-1597192033-2020-08-12-00-27-42-908


2020-08-12 00:27:43 Starting - Starting the training job...
2020-08-12 00:27:45 Starting - Launching requested ML instances.........
2020-08-12 00:29:20 Starting - Preparing the instances for training......
2020-08-12 00:30:47 Downloading - Downloading input data
2020-08-12 00:30:47 Training - Downloading the training image.........
2020-08-12 00:32:11 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34mRunning custom environment configuration script[0m
[34m[08/12/2020 00:32:14 INFO 139679679547200] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'syncbn': u'False', u'gamma2': u'0.9', u'gamma1': u'0.9', u'early_stopping_min_epochs': u'5', u'epochs': u'10', u'_workers': u'16', u'lr_scheduler_factor': u'0.1', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0001', u'crop_size': u'240', u'use_p

[34m[08/12/2020 00:33:39 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 20 speed: 13.065683281 samples/sec learning_rate: 0.000994[0m
[34m[08/12/2020 00:34:37 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 40 speed: 13.0313129634 samples/sec learning_rate: 0.000988[0m
[34m[08/12/2020 00:35:36 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 60 speed: 12.5782996947 samples/sec learning_rate: 0.000982[0m
[34m[08/12/2020 00:36:35 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 80 speed: 12.527658761 samples/sec learning_rate: 0.000975[0m
[34m[08/12/2020 00:37:33 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 100 speed: 12.6394085764 samples/sec learning_rate: 0.000969[0m
[34m[08/12/2020 00:38:32 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 120 speed: 13.0143507414 samples/sec learning_rate: 0.000963[0m
[34m[08/12/2020 00:39:31 INFO 139679679547200] #progress_notice. epoch: 0, iterations: 

[34m[08/12/2020 00:56:17 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 20 speed: 12.6570275983 samples/sec learning_rate: 0.000857[0m
[34m[08/12/2020 00:57:15 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 40 speed: 13.6298626389 samples/sec learning_rate: 0.000851[0m
[34m[08/12/2020 00:58:13 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 60 speed: 13.082700599 samples/sec learning_rate: 0.000845[0m
[34m[08/12/2020 00:59:11 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 80 speed: 13.0006983804 samples/sec learning_rate: 0.000839[0m
[34m[08/12/2020 01:00:09 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 100 speed: 12.6132816288 samples/sec learning_rate: 0.000832[0m
[34m[08/12/2020 01:01:09 INFO 139679679547200] #progress_notice. epoch: 3, iterations: 120 speed: 13.0427200527 samples/sec learning_rate: 0.000826[0m
[34m[08/12/2020 01:02:06 INFO 139679679547200] #progress_notice. epoch: 3, iterations:

[34m[08/12/2020 01:18:47 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 20 speed: 12.9176014008 samples/sec learning_rate: 0.000718[0m
[34m[08/12/2020 01:19:46 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 40 speed: 13.0467086297 samples/sec learning_rate: 0.000712[0m
[34m[08/12/2020 01:20:45 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 60 speed: 13.006458375 samples/sec learning_rate: 0.000705[0m
[34m[08/12/2020 01:21:43 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 80 speed: 13.4489854772 samples/sec learning_rate: 0.000699[0m
[34m[08/12/2020 01:22:40 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 100 speed: 12.7613499687 samples/sec learning_rate: 0.000693[0m
[34m[08/12/2020 01:23:39 INFO 139679679547200] #progress_notice. epoch: 6, iterations: 120 speed: 13.0400615114 samples/sec learning_rate: 0.000686[0m
[34m[08/12/2020 01:24:38 INFO 139679679547200] #progress_notice. epoch: 6, iterations:

[34m[08/12/2020 01:41:25 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 20 speed: 13.0270025158 samples/sec learning_rate: 0.000576[0m
[34m[08/12/2020 01:42:23 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 40 speed: 12.6773152124 samples/sec learning_rate: 0.000569[0m
[34m[08/12/2020 01:43:23 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 60 speed: 13.7377858351 samples/sec learning_rate: 0.000563[0m
[34m[08/12/2020 01:44:22 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 80 speed: 13.0201657863 samples/sec learning_rate: 0.000556[0m
[34m[08/12/2020 01:45:22 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 100 speed: 12.6952586918 samples/sec learning_rate: 0.000550[0m
[34m[08/12/2020 01:46:20 INFO 139679679547200] #progress_notice. epoch: 9, iterations: 120 speed: 13.6584963961 samples/sec learning_rate: 0.000543[0m
[34m[08/12/2020 01:47:19 INFO 139679679547200] #progress_notice. epoch: 9, iterations

[34m[08/12/2020 02:04:09 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 20 speed: 12.6359934392 samples/sec learning_rate: 0.000430[0m
[34m[08/12/2020 02:05:08 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 40 speed: 12.9912304322 samples/sec learning_rate: 0.000423[0m
[34m[08/12/2020 02:06:06 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 60 speed: 12.7198555549 samples/sec learning_rate: 0.000416[0m
[34m[08/12/2020 02:07:04 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 80 speed: 13.1951254089 samples/sec learning_rate: 0.000409[0m
[34m[08/12/2020 02:08:02 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 100 speed: 13.0680087355 samples/sec learning_rate: 0.000403[0m
[34m[08/12/2020 02:09:01 INFO 139679679547200] #progress_notice. epoch: 12, iterations: 120 speed: 12.8971106024 samples/sec learning_rate: 0.000396[0m
[34m[08/12/2020 02:09:59 INFO 139679679547200] #progress_notice. epoch: 12, ite

[34m[08/12/2020 02:26:46 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 20 speed: 12.9446810464 samples/sec learning_rate: 0.000277[0m
[34m[08/12/2020 02:27:44 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 40 speed: 13.039372343 samples/sec learning_rate: 0.000270[0m
[34m[08/12/2020 02:28:42 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 60 speed: 12.9186756468 samples/sec learning_rate: 0.000263[0m
[34m[08/12/2020 02:29:40 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 80 speed: 12.9224742403 samples/sec learning_rate: 0.000256[0m
[34m[08/12/2020 02:30:39 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 100 speed: 12.9050693109 samples/sec learning_rate: 0.000249[0m
[34m[08/12/2020 02:31:37 INFO 139679679547200] #progress_notice. epoch: 15, iterations: 120 speed: 13.7233741381 samples/sec learning_rate: 0.000242[0m
[34m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max":