## Setup

The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.
The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK if running this notebook in sagemaker studio

- profile = aws profile
- role = predefined role arn

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role
import os
import boto3 
import time
import shutil
import random
import numpy

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
role='arn:aws:iam::395166463292:role/service-role/AmazonSageMaker-ExecutionRole-20200714T182988'
from PIL import Image

profile = 'crayon-site'
region_name='us-east-2'
bucket = 'st-crayon-dev'
prefix = 'sagemaker/labelbox/'

session = boto3.session.Session(profile_name = profile, region_name = region_name)
sess = sagemaker.Session(session,default_bucket=bucket)
print(sess.boto_session)

## Set up the Experiment
Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work.  Think of it as a “folder” for organizing your “files”

In [None]:
sm = session.client(service_name = 'sagemaker')

experiment_name = f'semantic-seg-background'

experiments = []

for exp in Experiment.list(sagemaker_boto_client=sm):
    experiments.append(exp.experiment_name)

print(f'List of experiments : {experiments}')

if experiment_name not in experiments:
    experiment = Experiment.create(experiment_name=experiment_name,
                                   description="semantic segmentation of drone pictures",
                                   sagemaker_boto_client=sm)

print(f'Experiment used for notebook = {experiment_name}')

### Track Experiment
Create a Trial for each training run to track it's inputs, parameters, and metrics.

In [None]:
num_training_samples = 2397 # 2397 and 11294
backbone = 'resnet-50'  # resnet-101
base_size = 512
crop_size = 384
num_epochs = 90
algorithm = 'fcn'  # 'psp' 'deeplab'
tiles_type = 'tiles_1024'
learning_rate = 0.0001
optimizer='adam' # 'sgd' 'adam', 'rmsprop', 'nag', 'adagrad'.
lrs='cosine' # 'poly' 'cosine' and 'step'.                           
rand_string = str(np.random.randint(100)).zfill(2)
base_trial_name = f"I{tiles_type.replace('_','-')}-B{base_size}-C{crop_size}-E{num_epochs}-{algorithm}-LR{str(learning_rate).split('.')[1]}-O{optimizer}-LRS{lrs}-{rand_string}"
# Add some randomness to each trial name
trial_name = f'ssl-{base_trial_name}'
ss_trial = Trial.create(trial_name = trial_name,
                          experiment_name = experiment_name,
                          sagemaker_boto_client = sm,
                          tags = [{'Key': 'experiment_name', 'Value': 'aws-ss-drone-dataset'}])
ss_trial

In [None]:
base_trial_name

In [None]:
s3_output_location = f's3://{bucket}/{prefix}output'
s3_output_checkpoints = f's3://{bucket}/{prefix}output/checkpoints/{base_trial_name}'
s3_output_location, s3_output_checkpoints

## Training image
Since we are using prebaked aws semantic segmentation algo, we need the Amazon SageMaker Semantic Segmentaion docker image, which is static and need not be changed

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(sess.boto_region_name, 'semantic-segmentation', repo_version="latest")
print (training_image)

## Training Pipe Mode

In File-mode training data is downloaded to an encrypted EBS volume prior to commencing training. Once downloaded, the training algorithm simply trains by reading the downloaded training data files.

On the other hand, in Pipe-mode the input data is transferred to the algorithm while it is training. This poses a few significant advantages over File-mode:


*  In File-mode, training startup time is proportional to size of the input data. In Pipe-mode, the startup delay is constant, independent of the size of the input data. This translates to much faster training startup for training jobs with large GB/PB-scale training datasets.
* You do not need to allocate (and pay for) a large disk volume to be able to download the dataset.
* Throughput on IO-bound Pipe-mode algorithms can be multiple times faster than on equivalent File-mode algorithms.

However, these advantages come at a cost - a more complicated programming model than simply reading from files on a disk. This notebook aims to clarify what you need to do in order to use Pipe-mode in your custom training algorithm.



### Prepare handshake 
between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. In pipe mode data channels are the manifest files which contain `s3` location of images and annotations

In [None]:
manifest_train = f'data/raw/imgs/{tiles_type}/manifests/manifest_file_train_imgs.json'
manifest_train = f's3://{bucket}/{manifest_train}'
manifest_val = f'data/raw/imgs/{tiles_type}/manifests/manifest_file_val_imgs.json'
manifest_val = f's3://{bucket}/{manifest_val}'

print(f'{manifest_train=}\n{manifest_val=}')

In [None]:
distribution = 'FullyReplicated'
# Create sagemaker s3_input objects
train_data = sagemaker.session.s3_input(manifest_train, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")
validation_data = sagemaker.session.s3_input(manifest_val, 
                                        distribution=distribution, 
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        attribute_names=['source-ref', 'annotation-ref'],
                                        input_mode='Pipe',
                                        record_wrapping="RecordIO")

# s3model = 's3://st-crayon-dev/sagemaker/labelbox/output/ssl-job-Itiles-1024-B256-C224-E80-fcn-N-2020-08-17-19-55-46-533/output/model.tar.gz'
# model_data = sagemaker.session.s3_input(s3model, distribution= 'FullyReplicated',s3_data_type='S3Prefix', input_mode='File', content_type='application/x-sagemaker-model')

data_channels = {'train': train_data, 
                 'validation': validation_data}
# data_channels = {'train': train_data, 'validation': validation_data, 'model': model_data}
data_channels

### Train
To begin training, we have to create ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job. we name our training job as ``ss-labelbox-train``. For training we need a gpu insatance type

In [None]:
base_job_name = f'ssl-job-{base_trial_name}'

ss_model = sagemaker.estimator.Estimator(training_image,
                                         role=role, 
                                         train_instance_count = 1, 
                                         train_instance_type = 'ml.p3.2xlarge',
                                         train_volume_size = 20,
                                         train_max_run = 60 * 60 * 12,
                                         train_max_wait= 60 * 60 * 18,
                                         output_path = s3_output_location,
                                         checkpoint_s3_uri = s3_output_checkpoints,
                                         base_job_name = base_job_name,
                                         train_use_spot_instances=True,
                                         input_mode='Pipe',
                                         sagemaker_session = sess,
#                                          model_uri=ss_model,
                                         enable_sagemaker_metrics=True)

The semantic segmentation algorithm at its core has two compoenents.

- An encoder or backbone network,
- A decoder or algorithm network. 

The encoder or backbone network is typically a regular convolutional neural network that may or maynot have had their layers pre-trained on an alternate task such as the [classification task of ImageNet images](http://www.image-net.org/). The Amazon SageMaker Semantic Segmentation algorithm comes with two choices of pre-trained or to be trained-from-scratch backbone networks ([ResNets](https://arxiv.org/abs/1512.03385) 50 or 101). 

The decoder is a network that picks up the outputs of one or many layers from the backbone and reconstructs the segmentation mask from it. Amazon SageMaker Semantic Segmentation algorithm comes with a choice of the [Fully-convolutional network (FCN)](https://arxiv.org/abs/1605.06211), the [Pyramid scene parsing (PSP) network](https://arxiv.org/abs/1612.01105) and [deeplab v3] (https://arxiv.org/abs/1706.05587)

The algorithm also has ample options for hyperparameters that help configure the training job. The next step in our training, is to setup these networks and hyperparameters. Consider the following example definition of hyperparameters. See the SageMaker Semantic Segmentation [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html) for more details on the hyperparameters.

One of the hyperparameters here for instance is the `epochs`. This defines how many passes of the dataset we iterate over and determines that training time of the algorithm. Based on our tests, train the model for `x` epochs with similar settings should give us 'reasonable' segmentation results.

In [None]:
ss_model.set_hyperparameters(backbone=backbone, # This is the encoder. Other option is resnet-50
                             algorithm=algorithm, # This is the decoder. Other option is 'fcn', 'psp' and 'deeplab'                             
                             use_pretrained_model=True, # Use the pre-trained model.
                             base_size=base_size,
                             crop_size=crop_size, # Size of image random crop.                              
                             num_classes=6, # Pascal has 21 classes. This is a mandatory parameter.
                             epochs=num_epochs, # Number of epochs to run.
                             learning_rate=learning_rate,                             
                             optimizer=optimizer, # Other options include 'adam', 'rmsprop', 'nag', 'adagrad'.
                             lr_scheduler=lrs, # Other options include 'cosine' and 'step'.                           
                             mini_batch_size=8, # Setup some mini batch size.
                             validation_mini_batch_size=8,
                             early_stopping=True, # Turn on early stopping. If OFF, other early stopping parameters are ignored.
                             early_stopping_patience=6, # Tolerate these many epochs if the mIoU doens't increase.
                             early_stopping_tolerance=.001,
                             early_stopping_min_epochs=12, # No matter what, run these many number of epochs.                             
                             num_training_samples=num_training_samples) # This is a mandatory parameter, 1464 in this case.

In [None]:
print(f'{train_data.config}\n{validation_data.config}')

In [None]:
ss_model.fit(inputs=data_channels,
             logs=True,
             experiment_config={
                "TrialName": ss_trial.trial_name,
                "TrialComponentDisplayName": "Training",
                },
             wait=True
            )