## Setup

The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.
The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK if running this notebook in sagemaker studio

- profile = aws profile
- role = predefined role arn

In [32]:
%%time
import sagemaker
from sagemaker import get_execution_role
import os
import boto3 
import time
import shutil
import random

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
role='arn:aws:iam::395166463292:role/service-role/AmazonSageMaker-ExecutionRole-20200714T182988'
from PIL import Image

profile = 'sites'
region_name='us-east-2'
bucket = 'st-crayon-dev'
prefix = 'sagemaker/labelbox/'

session = boto3.session.Session(profile_name = profile, region_name = region_name)
sess = sagemaker.Session(session,default_bucket=bucket)
print(sess.boto_session)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
Session(region_name='us-east-2')
CPU times: user 469 ms, sys: 578 ms, total: 1.05 s
Wall time: 3.31 s


## Set up the Experiment
Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work.  Think of it as a “folder” for organizing your “files”

In [36]:
sm = session.client(service_name = 'sagemaker')

experiment_name = f'site-tech-drone-img-seg'

experiments = []

for exp in Experiment.list(sagemaker_boto_client=sm):
    experiments.append(exp.experiment_name)

print(f'List of experiments : {experiments}')

if experiment_name not in experiments:
    experiment = Experiment.create(experiment_name=experiment_name,
                                   description="semantic segmentation of drone pictures",
                                   sagemaker_boto_client=sm)
experiment_name = experiments[1]                                   
print(f'Experiment used for notebook = {experiment_name}')

List of experiments : ['labelbox-semantic-segmentation512', 'site-tech-drone-img-seg']
Experiment used for notebook = site-tech-drone-img-seg


### Track Experiment
Create a Trial for each training run to track it's inputs, parameters, and metrics.

In [37]:
trial_name = f'semantic-segmentation-labelbox-dataset-1024-{int(time.time())}'
ss_trial = Trial.create(trial_name = trial_name,
                          experiment_name = experiment_name,
                          sagemaker_boto_client = sm,
                          tags = [{'Key': 'experiment_name', 'Value': 'aws-ss-drone-dataset'}])
ss_trial

Trial(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f22af756a90>,trial_name='semantic-segmentation-labelbox-dataset-1024-1596790424',experiment_name='site-tech-drone-img-seg',tags=[{'Key': 'experiment_name', 'Value': 'aws-ss-drone-dataset'}],trial_arn='arn:aws:sagemaker:us-east-2:395166463292:experiment-trial/semantic-segmentation-labelbox-dataset-1024-1596790424',response_metadata={'RequestId': 'cab39a80-bbef-427f-99b8-ff6b6b62b296', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'cab39a80-bbef-427f-99b8-ff6b6b62b296', 'content-type': 'application/x-amz-json-1.1', 'content-length': '127', 'date': 'Fri, 07 Aug 2020 08:53:44 GMT'}, 'RetryAttempts': 0})

In [45]:
s3_output_location = f's3://{bucket}/{prefix}output'
s3_output_location

's3://st-crayon-dev/sagemaker/labelbox/output'

## Training image
Since we are using prebaked aws semantic segmentation algo, we need the Amazon SageMaker Semantic Segmentaion docker image, which is static and need not be changed

In [38]:
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(sess.boto_region_name, 'semantic-segmentation', repo_version="latest")
print (training_image)

825641698319.dkr.ecr.us-east-2.amazonaws.com/semantic-segmentation:latest


## Training File Mode (We plan to depricate using file mode. 
## Therefore move to section Training Pipe mode)
To begin training, we have to create ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job. we name our training job as ``ss-labelbox-train``. For training we need a gpu insatance type

In [252]:
base_job_name = f'ss-labelbox-train-{int(time.time())}'

ss_model = sagemaker.estimator.Estimator(training_image,
                                         role=role, 
                                         train_instance_count = 1, 
                                         train_instance_type = 'ml.p2.8xlarge',
                                         train_volume_size = 10,
                                         train_max_run = 7200,
                                         train_max_wait=7200,
                                         output_path = s3_output_location,
                                         base_job_name = base_job_name,
                                         train_use_spot_instances=True,
                                         input_mode='File',
                                         sagemaker_session = sess,
                                         enable_sagemaker_metrics=True)



In [24]:
ss_model.set_hyperparameters(backbone='resnet-50', # This is the encoder. Other option is resnet-50
                             algorithm='deeplab', # This is the decoder. Other option is 'psp' and 'deeplab'                             
                             use_pretrained_model='True', # Use the pre-trained model.
                             crop_size=512, # Size of image random crop.     
                             base_size=1024,                        
                             num_classes=8, # Pascal has 21 classes. This is a mandatory parameter.
                             epochs=20, # Number of epochs to run.
                             learning_rate=0.0001,                             
                             optimizer='rmsprop', # Other options include 'adam', 'rmsprop', 'nag', 'adagrad'.
                             lr_scheduler='poly', # Other options include 'cosine' and 'step'.                           
                             mini_batch_size=32, # Setup some mini batch size.
                             validation_mini_batch_size=32,
                             early_stopping=True, # Turn on early stopping. If OFF, other early stopping parameters are ignored.
                             early_stopping_patience=5, # Tolerate these many epochs if the mIoU doens't increase.
                             early_stopping_tolerance=.001,
                             early_stopping_min_epochs=2, # No matter what, run these many number of epochs.                             
                             num_training_samples=1190) # This is a mandatory parameter, 1464 in this case.

In [26]:
train_channel = prefix + 'first-batch/train_tiles1024'
validation_channel = prefix + 'first-batch/valid_tiles1024'
train_annotation_channel = prefix + 'first-batch/train_annotation_tiles1024'
validation_annotation_channel = prefix + 'first-batch/valid_annotation_tiles1024'

s3_train_data = f's3://{bucket}/{train_channel}/'
s3_validation_data = f's3://{bucket}/{validation_channel}/'
s3_train_annotation = f's3://{bucket}/{train_annotation_channel}/'
s3_validation_annotation = f's3://{bucket}/{validation_annotation_channel}/'

In [27]:
print(f'{s3_train_data=}\n{s3_train_annotation=}\n{s3_validation_data=}\n{s3_validation_annotation=}')

s3_train_data='s3://st-crayon-dev/sagemaker/labelbox/first-batch/train_tiles1024/'
s3_train_annotation='s3://st-crayon-dev/sagemaker/labelbox/first-batch/train_annotation_tiles1024/'
s3_validation_data='s3://st-crayon-dev/sagemaker/labelbox/first-batch/valid_tiles1024/'
s3_validation_annotation='s3://st-crayon-dev/sagemaker/labelbox/first-batch/valid_annotation_tiles1024/'


In [28]:
distribution = 'FullyReplicated'
# Create sagemaker s3_input objects
train_data = sagemaker.session.s3_input(s3_train_data, distribution=distribution, 
                                        content_type='image/jpeg', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution=distribution, 
                                        content_type='image/jpeg', s3_data_type='S3Prefix')
train_annotation = sagemaker.session.s3_input(s3_train_annotation, distribution=distribution, 
                                        content_type='image/png', s3_data_type='S3Prefix')
validation_annotation = sagemaker.session.s3_input(s3_validation_annotation, distribution=distribution, 
                                        content_type='image/png', s3_data_type='S3Prefix')

data_channels = {'train': train_data, 
                 'validation': validation_data,
                 'train_annotation': train_annotation, 
                 'validation_annotation':validation_annotation}
data_channels



{'train': <sagemaker.inputs.s3_input at 0x7f1af8ef57f0>,
 'validation': <sagemaker.inputs.s3_input at 0x7f1af8ef58e0>,
 'train_annotation': <sagemaker.inputs.s3_input at 0x7f1af8ef5580>,
 'validation_annotation': <sagemaker.inputs.s3_input at 0x7f1af9c5cc40>}

In [29]:
ss_model.fit(inputs=data_channels,
             logs=True,
             experiment_config={
                "TrialName": ss_trial.trial_name,
                "TrialComponentDisplayName": "Training",
                },
             wait=True
            )

put_metric. host: algo-1, epoch: 9, validation throughput: 57.9324090311 samples/sec.[0m
[34m[08/02/2020 20:10:54 INFO 140367851128640] Serializing model to /opt/ml/model/model_best.params[0m
[34m[08/02/2020 20:10:54 INFO 140367851128640] #progress_metric: host=algo-1, completed 50 % of epochs[0m
[34m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 10, "sum": 10.0, "min": 10}}, "EndTime": 1596399054.551507, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/Sem

## Training Pipe Mode

In File-mode training data is downloaded to an encrypted EBS volume prior to commencing training. Once downloaded, the training algorithm simply trains by reading the downloaded training data files.

On the other hand, in Pipe-mode the input data is transferred to the algorithm while it is training. This poses a few significant advantages over File-mode:


*  In File-mode, training startup time is proportional to size of the input data. In Pipe-mode, the startup delay is constant, independent of the size of the input data. This translates to much faster training startup for training jobs with large GB/PB-scale training datasets.
* You do not need to allocate (and pay for) a large disk volume to be able to download the dataset.
* Throughput on IO-bound Pipe-mode algorithms can be multiple times faster than on equivalent File-mode algorithms.

However, these advantages come at a cost - a more complicated programming model than simply reading from files on a disk. This notebook aims to clarify what you need to do in order to use Pipe-mode in your custom training algorithm.



### Prepare manifest file for data
Lets look at the python script which create the manifest files for training in pipe mode

In [40]:
%load ../../../../src/site_tools/site_tools/site_data_manifestfiles.py

In [80]:
%run ../../../../src/site_tools/site_tools/site_data_manifestfiles.py

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
train_imgs 2361
data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json
{'ResponseMetadata': {'RequestId': 'DQ9S6PCZ5W0QENEW', 'HostId': 'df3JuU8Vw4BVnTafs4sTM4Bccy6PSZCgIVyiWbMfXp3ZvXT/OKdUHvbo618RbkTFND9tvfhPGjs=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'df3JuU8Vw4BVnTafs4sTM4Bccy6PSZCgIVyiWbMfXp3ZvXT/OKdUHvbo618RbkTFND9tvfhPGjs=', 'x-amz-request-id': 'DQ9S6PCZ5W0QENEW', 'date': 'Fri, 07 Aug 2020 10:43:57 GMT', 'etag': '"eedd93b4e6c878e557452d6ed700815c"', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'ETag': '"eedd93b4e6c878e557452d6ed700815c"'}
test_imgs 292
data/raw/imgs/tiles_1024/manifests/manifest_file_test_imgs.json
{'ResponseMetadata': {'RequestId': 'DFF37CA47DFF23C2', 'HostId': 'Ec+UAoEYsHmIughfufPtLLLRpE156TVMXCqQTlGcarkQSWyODbWXn3Z3qtVi2z7MSV1wHEWhfnQ

### Prepare handshake 
between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. In pipe mode data channels are the manifest files which contain `s3` location of images and annotations

In [81]:
manifest_train = 'data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json'
manifest_train = f's3://{bucket}/{manifest_train}'
manifest_val = 'data/raw/imgs/tiles_1024/manifests/manifest_file_val_imgs.json'
manifest_val = f's3://{bucket}/{manifest_val}'

print(f'{manifest_train=}\n{manifest_val=}')

manifest_train='s3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json'
manifest_val='s3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_val_imgs.json'


In [82]:
distribution = 'FullyReplicated'
# Create sagemaker s3_input objects
train_data = sagemaker.session.s3_input(manifest_train, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")
validation_data = sagemaker.session.s3_input(manifest_val, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")


data_channels = {'train': train_data, 
                 'validation': validation_data}
data_channels



{'train': <sagemaker.inputs.s3_input at 0x7f22ae64a400>,
 'validation': <sagemaker.inputs.s3_input at 0x7f22af7f1430>}

### Train
To begin training, we have to create ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job. we name our training job as ``ss-labelbox-train``. For training we need a gpu insatance type

In [83]:
base_job_name = f'ss-labelbox-train-pipe-{int(time.time())}'

ss_model = sagemaker.estimator.Estimator(training_image,
                                         role=role, 
                                         train_instance_count = 1, 
                                         train_instance_type = 'ml.p2.8xlarge',
                                         train_volume_size = 10,
                                         train_max_run = 7200,
                                         train_max_wait=7200,
                                         output_path = s3_output_location,
                                         base_job_name = base_job_name,
                                         train_use_spot_instances=True,
                                         input_mode='Pipe',
                                         sagemaker_session = sess,
                                         enable_sagemaker_metrics=True)



The semantic segmentation algorithm at its core has two compoenents.

- An encoder or backbone network,
- A decoder or algorithm network. 

The encoder or backbone network is typically a regular convolutional neural network that may or maynot have had their layers pre-trained on an alternate task such as the [classification task of ImageNet images](http://www.image-net.org/). The Amazon SageMaker Semantic Segmentation algorithm comes with two choices of pre-trained or to be trained-from-scratch backbone networks ([ResNets](https://arxiv.org/abs/1512.03385) 50 or 101). 

The decoder is a network that picks up the outputs of one or many layers from the backbone and reconstructs the segmentation mask from it. Amazon SageMaker Semantic Segmentation algorithm comes with a choice of the [Fully-convolutional network (FCN)](https://arxiv.org/abs/1605.06211), the [Pyramid scene parsing (PSP) network](https://arxiv.org/abs/1612.01105) and [deeplab v3] (https://arxiv.org/abs/1706.05587)

The algorithm also has ample options for hyperparameters that help configure the training job. The next step in our training, is to setup these networks and hyperparameters. Consider the following example definition of hyperparameters. See the SageMaker Semantic Segmentation [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html) for more details on the hyperparameters.

One of the hyperparameters here for instance is the `epochs`. This defines how many passes of the dataset we iterate over and determines that training time of the algorithm. Based on our tests, train the model for `x` epochs with similar settings should give us 'reasonable' segmentation results.

In [84]:
ss_model.set_hyperparameters(backbone='resnet-50', # This is the encoder. Other option is resnet-50
                             algorithm='deeplab', # This is the decoder. Other option is 'fcn', 'psp' and 'deeplab'                             
                             use_pretrained_model='True', # Use the pre-trained model.
                             base_size=1024,
                             crop_size=448, # Size of image random crop.                              
                             num_classes=8, # Pascal has 21 classes. This is a mandatory parameter.
                             epochs=20, # Number of epochs to run.
                             learning_rate=0.0001,                             
                             optimizer='rmsprop', # Other options include 'adam', 'rmsprop', 'nag', 'adagrad'.
                             lr_scheduler='poly', # Other options include 'cosine' and 'step'.                           
                             mini_batch_size=16, # Setup some mini batch size.
                             validation_mini_batch_size=16,
                             early_stopping=True, # Turn on early stopping. If OFF, other early stopping parameters are ignored.
                             early_stopping_patience=2, # Tolerate these many epochs if the mIoU doens't increase.
                             early_stopping_tolerance=.001,
                             early_stopping_min_epochs=2, # No matter what, run these many number of epochs.                             
                             num_training_samples=2361) # This is a mandatory parameter, 1464 in this case.

In [87]:
print(f'{train_data.config}\n{validation_data.config}')

{'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_train_imgs.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'annotation-ref']}}, 'ContentType': 'application/x-recordio', 'RecordWrapperType': 'RecordIO', 'InputMode': 'Pipe'}
{'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://st-crayon-dev/data/raw/imgs/tiles_1024/manifests/manifest_file_val_imgs.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'annotation-ref']}}, 'ContentType': 'application/x-recordio', 'RecordWrapperType': 'RecordIO', 'InputMode': 'Pipe'}


In [88]:
ss_model.fit(inputs=data_channels,
             logs=True,
             experiment_config={
                "TrialName": ss_trial.trial_name,
                "TrialComponentDisplayName": "Training",
                },
             wait=True
            )

INFO:sagemaker:Creating training-job with name: ss-labelbox-train-pipe-1596797102-2020-08-07-10-45-27-126
2020-08-07 10:45:27 Starting - Starting the training job...
2020-08-07 10:45:29 Starting - Launching requested ML instances......
2020-08-07 10:47:03 Starting - Preparing the instances for training............
2020-08-07 10:49:04 Downloading - Downloading input data
2020-08-07 10:49:04 Training - Downloading the training image.....[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34mRunning custom environment configuration script[0m
[34m[08/07/2020 10:49:59 INFO 139733208078144] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'syncbn': u'False', u'gamma2': u'0.9', u'gamma1': u'0.9', u'early_stopping_min_epochs': u'5', u'epochs': u'10', u'_workers': u'16', u'lr_scheduler_factor': u'0.1', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0001', u'crop_size'

## Deploy model : Go to notebook site_data_model_deployment.ipynb
code below is for expermentation and need to be cleaned before final delivery
