## Set up the notebook instance to support local mode
Currently you need to install docker-compose in order to use local mode (i.e., testing the container in the notebook instance without pushing it to ECR).

In [1]:
!/bin/bash setup.sh

The user has root access.
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


In [1]:
import sagemaker 
role = sagemaker.get_execution_role()
role

'arn:aws:iam::662612070855:role/service-role/AmazonSageMaker-ExecutionRole-20200303T115551'

## Set up the environment
We will set up a few things before starting the workflow. 

1. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket
2. specify the s3 bucket and prefix where training data set and model artifacts are stored

In [2]:
import tensorflow as tf

import os
import shlex
import boto3
import tarfile
import sagemaker
import subprocess
from time import gmtime, strftime
from sagemaker.estimator import Estimator

from tuning_utils import *

region = boto3.Session().region_name

sagemaker_session = sagemaker.Session()
smclient = boto3.client('sagemaker')

repository = 'sagemaker-ecr'
bucket = 'rsrch-cynamics-datasets'
prefix = 'autoencoders'
tensorflow_version = '2.1.0-py3'


role = sagemaker.get_execution_role()

#fixex attempts below - restart notebook solved it
#iam = boto3.client('iam')
#role = iam.get_role(RoleName='arn:aws:iam::662612070855:role/service-role/AmazonSageMaker-ExecutionRole-20200303T115551')['Role']['Arn']
#try:
#    role = sagemaker.get_execution_role()
#except ValueError:
#iam = boto3.client('iam')
#role = iam.get_role(RoleName='AmazonSageMakerFullAccess')['Role']['Arn']
#from sagemaker import get_execution_role
#sagemaker_session = sagemaker.Session()
#role = get_execution_role()






## Prepare the data

In [3]:
devices_models = [
('Sioux', 1, 1/1000,1)
]

features_version = 1
client, device, sr, model_version = devices_models[-1]
print(client, device, sr, model_version, f'v{features_version}')

channels = {
    'train': f's3://rsrch-cynamics-datasets/clients/{client}/sr={sr}/device={device}/version={features_version}/model={model_version}/type=train/',
    'val': f's3://rsrch-cynamics-datasets/clients/{client}/sr={sr}/device={device}/version={features_version}/model={model_version}/type=val/',
}
print(channels)


Sioux 1 0.001 1 v1
{'train': 's3://rsrch-cynamics-datasets/clients/Sioux/sr=0.001/device=1/version=1/model=1/type=train/', 'val': 's3://rsrch-cynamics-datasets/clients/Sioux/sr=0.001/device=1/version=1/model=1/type=val/'}


## Building the image
We will build the docker image using the Tensorflow versions on dockerhub. The full list of Tensorflow versions can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/


## Pushing the container to ECR
Now that we've tested the container locally and it works fine, we can move on to run the hyperparmeter tuning. Before kicking off the tuning job, you need to push the docker image to ECR first. 

The cell below will create the ECR repository, if it does not exist yet, and push the image to ECR.

In [4]:
%%time

def build_image(name, version):
    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)
    subprocess.check_call(shlex.split(cmd))

account = boto3.client('sts').get_caller_identity()['Account']

image_name = f'{account}.dkr.ecr.{region}.amazonaws.com/{repository}:{prefix}'

print('building image:'+image_name, end=' ')
build_image(image_name, tensorflow_version)
print('Done!')

# # If the repository doesn't exist in ECR, create it.
# exist_repo = !aws ecr describe-repositories --repository-names {repository} > /dev/null 2>&1

# if not exist_repo:
#     print('Creating')
#     !aws ecr create-repository --repository-name {repository} > /dev/null

# Get the login command from ECR and execute it directly
!$(aws ecr get-login --region {region} --no-include-email)

!docker push {image_name}



building image:662612070855.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-ecr:autoencoders Done!
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [662612070855.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-ecr]

[1B70a9d75f: Preparing 
[1B43a44c71: Preparing 
[1Bcba87b53: Preparing 
[1B8cb63302: Preparing 
[1Bafd6f834: Preparing 
[1B40119697: Preparing 
[1Bf89939b4: Preparing 
[1B4f91799b: Preparing 
[1B2e59dcc5: Preparing 
[1Bb2c9ed16: Preparing 
[1Bfb8f161b: Preparing 
[1B43ea46a8: Preparing 
[1Bfcc4a1a8: Preparing 
[12Bba87b53: Pushed   143.5MB/139.9MB4A[2K[6A[2K[5A[2K[3A[2K[1A[2K[12A[2K[14A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[12A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[12A[2K[12A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[13A[2K[13A[2K[12A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12A[2K[13A[2K[12

# Run experiment on all combinations

In [5]:
training_image = image_name
test = False
global_jobs = 'test-eldad' + ('-test' if test else '')

for features_version in [1,3]:
    for client, device, sr, model_version in devices_models:
        tuning_job_name = f'{global_jobs}-{client}-{device}-v{features_version}'
        channels = {
            'train': f's3://rsrch-cynamics-datasets/clients/{client}/sr={sr}/device={device}/version={features_version}/model={model_version}/type=train/',
            'val': f's3://rsrch-cynamics-datasets/clients/{client}/sr={sr}/device={device}/version={features_version}/model={model_version}/type=val/',
        }
        tuning_job_config = get_config(max_jobs=1 if test else 50, max_parallel=2)
        training_job_definition = get_definition(training_image, channels, bucket, 
                                                 global_jobs, tuning_job_name, 1 if test else 500,
                                                 client, device, sr, role)
        training_job_definition['StaticHyperParameters']['model_version'] = str(model_version)
        training_job_definition['StaticHyperParameters']['features_version'] = str(features_version)
        training_job_definition['ResourceConfig']['InstanceType'] = 'ml.m5.xlarge' if features_version == 1 else 'ml.m5.2xlarge'
        #display(training_job_definition['OutputDataConfig'])
        
        try:
            output = smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name,
                                                                HyperParameterTuningJobConfig=tuning_job_config,
                                                                TrainingJobDefinition=training_job_definition,
                                                               )

            status = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)['HyperParameterTuningJobStatus']
            display(status)
            if status =='InProgress':
                upload_code(training_job_definition['OutputDataConfig']['S3OutputPath'])
        except Exception as e:
            print(e)

'InProgress'

Uploading code to s3://rsrch-cynamics-datasets/TuningJobs/test-eldad/test-eldad-Sioux-1-v1




'InProgress'

Uploading code to s3://rsrch-cynamics-datasets/TuningJobs/test-eldad/test-eldad-Sioux-1-v3
