# Sagemaker

**Two different approaches**
- Using Boto3 (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#id180)
- Using Sagemaker Python SDK (https://sagemaker.readthedocs.io/en/stable/index.html)

**Sagemaker Boto3 Client**

In [None]:
client = boto3.client('sagemaker')

**Pre-Built AWS Containers**

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

**create_training_job(\*\*kwargs)**: starts model training job. After training completes, it saves the resulting model artifacts to an specified S3 location.

In [None]:
client.create_training_job(**create_training_params)

**describe_training_job(\*\*kwargs)**: returns information about a training job.

In [None]:
# getting model artifact location
client.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts']

**Model Artifacts**: output that results from training a model. Typically consist of trained parameters, a model definition that describes how to compute inferences, and other metadata.
- S3ModelArtifacts: the path of the S3 object that contains the model artifacts (s3://bucket-name/keynameprefix/model.tar.gz)

In [None]:
# create training job is a big json file with configurations on
# container, role, file location, etc
sm.create_training_job(**create_training_params)

**create_model(\*\*kwargs)**: creates a model in SageMaker. In the request, you name the model and describe a primary container. For the primary container, you specify the Docker image with the inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions (environment variables to set in the Docker container). 

In [None]:
model_name=job_name + '-mdl'
xgboost_hosting_container = {
    'Image': container,
    'ModelDataUrl': sm.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts'],
    'Environment': {'this': 'is'} # do I need this always???
}

create_model_response = sm.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer=xgboost_hosting_container)

After setting up the model, we need to create and **endpoint configuration** and then create the endpoit.

In [None]:
from time import gmtime, strftime

endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

**Endpoint** to serve the model:

In [None]:
%%time
import time

endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

## References

**Training Parameters**

In [None]:
create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/single-xgboost/".format(bucket, prefix),
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 20
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.1",
        "gamma":"1",
        "min_child_weight":"1",
        "silent":"0",
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "num_round": "20"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 60 * 60
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri":  "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/val/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        }
    ]
}

## Hyperparameter Tuning

**Tips**  
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/xgboost_direct_marketing/hpo_xgboost_direct_marketing_sagemaker_APIs.ipynb

- Recommendarion of less than 10% of the total number of training jobs for number of parallel jobs.
- Make sure your algorithm emits the selected optimization metric during training. If you use validation:auc, for example, make sure that you algorithm emits it. This is specially important for when using your own algorithms.

Example of JSON file for tuning configurations:

In [None]:
from time import gmtime, strftime, sleep
tuning_job_name = 'xgboost-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

print (tuning_job_name)

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta",
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight",
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha",            
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth",
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 20,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:auc",
      "Type": "Maximize"
    }
  }

Training job parameters:
- Container for the image of the algorithm
- Input configuration for training and validation data (e.g. using bucket URI - s3://...)
- Configuration of the output of the algorithm (e.g. bucket/folder)
- Static hyperparameters
- Type and number of instances
- Stopping condition for the training jobs (like maximum tuning time)
- When usign custom algorithms, you need to add **MetricDefinition** object (format of metrics through regex)

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(region, 'xgboost', repo_version='latest')
     
s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)
    
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m4.xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "100",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

Launch hyperparameter tuning calling create_hyper_parameter_tuning_job API

In [205]:
smclient = boto3.Session().client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

# Videogame Sales (Workshop)

In [189]:
import sys
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt   
import boto3
import sagemaker
from sklearn.datasets import dump_svmlight_file   

from botocore.exceptions import ClientError

In [82]:
BUCKET_NAME = 'sagemaker-workshop-dsb'
ROLE_NAME = 'sagemaker_role'
PREFIX = 'sagemaker/videogames-xgboost'

## Configuration

Let's create our own bucket:

In [83]:
s3 = boto3.resource('s3')

try:
    s3.meta.client.head_bucket(Bucket=BUCKET_NAME)
    exists = True
except ClientError:
    exists = False
    
if not exists:
    bucket = s3.create_bucket(Bucket=BUCKET_NAME)

Let's also create a role for SageMaker:

In [84]:
iam = boto3.resource('iam')
allowed_services = ["sagemaker.amazonaws.com"]
trust_policy = {
        'Version': '2012-10-17',
        'Statement': [{
                'Effect': 'Allow',
                'Principal': {'Service': service},
                'Action': 'sts:AssumeRole'
            } for service in allowed_services
        ]
    }

In [57]:
try:
    role = iam.create_role(
        RoleName=ROLE_NAME,
        AssumeRolePolicyDocument=json.dumps(trust_policy)
    )
    created = True
except ClientError:
    print("Role already exists")
    created = False

Finally, let's attach a policty to the role above:

In [59]:
iam_client = boto3.client('iam')

if created:
    iam_client.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
    RoleName=ROLE_NAME
    )


In [157]:
iam.Role(ROLE_NAME)

iam.Role(name='sagemaker_role')

## Data

Downloading data from a public repository: 

In [60]:
raw_data_filename = 'Video_Games_Sales_as_at_22_Dec_2016.csv'
data_bucket = 'sagemaker-workshop-pdx'

s3 = boto3.resource('s3')
s3.Bucket(data_bucket).download_file(raw_data_filename, './data/raw_data.csv')

Data preparation:

In [72]:
data = pd.read_csv('./data/raw_data.csv')
data['y'] = (data['Global_Sales'] > 1)
data = data.drop(['Name', 'Year_of_Release', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Count', 'User_Count', 'Developer'], axis=1)
data = data.dropna()
data['User_Score'] = data['User_Score'].apply(pd.to_numeric, errors='coerce')
data['User_Score'] = data['User_Score'].mask(np.isnan(data["User_Score"]), data['Critic_Score'] / 10.0)

In [73]:
if data['y'].dtype == bool:
    data['y'] = data['y'].apply(lambda y: 'yes' if y == True else 'no')
model_data = pd.get_dummies(data)

In [76]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   

Converting data to libSVM format and copying files to S3:

In [80]:
dump_svmlight_file(X=train_data.drop(['y_no', 'y_yes'], axis=1), y=train_data['y_yes'], f='./data/train.libsvm')
dump_svmlight_file(X=validation_data.drop(['y_no', 'y_yes'], axis=1), y=validation_data['y_yes'], f='./data/validation.libsvm')
dump_svmlight_file(X=test_data.drop(['y_no', 'y_yes'], axis=1), y=test_data['y_yes'], f='./data/test.libsvm')

In [136]:
# Accessing our bucket
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)

# uploading files
bucket.Object(PREFIX + '/train/train.libsvm').upload_file('./data/train.libsvm')
bucket.Object(PREFIX + '/validation/validation.libsvm').upload_file('./data/validation.libsvm')

We also need to create training channels for the training job.

In [140]:
s3_input_train = sagemaker.s3_input(s3_data=f's3://{BUCKET_NAME}/{PREFIX}/train',
                   content_type='libsvm')
s3_input_validation = sagemaker.s3_input(s3_data=f's3://{BUCKET_NAME}/{PREFIX}/validation',
                   content_type='libsvm')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


## Train

In this example, we will use one of the built-in algorithms. To build the estimator object, we need the algorithm image (xgboost in this case):

In [162]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# basically returns the URI (Uniform Resource Identifier)
container = get_image_uri(boto3.Session().region_name, 
                          'xgboost', 
                          '1.0-1')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [163]:
xgb = sagemaker.estimator.Estimator(container,
                                    ROLE_NAME,
                                    train_instance_count=1,
                                    train_instance_type='ml.c5.xlarge',
                                    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
                                    sagemaker_session=sagemaker.Session())

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [164]:
xgb.set_hyperparameters(max_depth=3,
                        eta=0.1,
                        subsample=0.5,
                        eval_metric='auc',
                        objective='binary:logistic',
                        scale_pos_weight=2.0,
                        num_round=100)

In [165]:
xgb.fit(inputs={'train':s3_input_train,
                'validation':s3_input_validation})

2020-07-30 19:30:01 Starting - Starting the training job...
2020-07-30 19:30:04 Starting - Launching requested ML instances......
2020-07-30 19:31:23 Starting - Preparing the instances for training...
2020-07-30 19:31:55 Downloading - Downloading input data...
2020-07-30 19:32:35 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34m[19:32:36] 5614x331 matrix with 33684 entries loaded from /opt/ml/input/data/train[0m
[34m[19:32:36]

## Host

In [166]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                          instance_type='ml.m5.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!

Calculating predictions on test data:

In [190]:
xgb_predictor.content_type = 'text/x-libsvm'
xgb_predictor.deserializer = None

def do_predict(data):
    payload = '\n'.join(data)
    response = xgb_predictor.predict(payload).decode('utf-8')
    result = response.split(',')
    preds = [float((num)) for num in result]
    preds = [round(num) for num in preds]
    return preds

def batch_predict(data, batch_size):
    items = len(data)
    arrs = []
    
    for offset in range(0, items, batch_size):
        if offset+batch_size < items:
            results = do_predict(data[offset:(offset+batch_size)])
            arrs.extend(results)
        else:
            arrs.extend(do_predict(data[offset:items]))
        sys.stdout.write('.')
    return(arrs)

In [191]:
%%time
import json

with open('./data/test.libsvm', 'r') as f:
    payload = f.read().strip()

labels = [int(line.split(' ')[0]) for line in payload.split('\n')]
test_data = [line for line in payload.split('\n')]
preds = batch_predict(test_data, 100)

print ('\nerror rate=%f' % ( sum(1 for i in range(len(preds)) if preds[i]!=labels[i]) /float(len(preds))))

.........
error rate=0.144458
Wall time: 624 ms


In [192]:
pd.crosstab(index=np.array(labels), columns=np.array(preds))

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,616,55
1,61,71


In [203]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)