## Comet.ml: Sagemaker Random Cut Forests Introduction Integration

The code below is taken directly from Amazon Sagemaker's official [An Introduction to SageMaker Random Cut Forests](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb) notebook.

The descriptive text has more or less been removed, but the code is identical. 

Follow along below to learn how to log Sagemaker training jobs to Comet.ml.

#### Install the comet_ml_sagemaker python package

Comet's SageMaker configuration is available to Enterprise customers only. If you are interested in learning more about Comet Enterprise, or are in a trial period with Comet.ml and would like to evaluate the SageMaker integration, please email support@comet.ml and credentials can be shared to download the correct packages.

### Select Amazon S3 Bucket

In [None]:
import boto3
import botocore
import sagemaker
import sys


bucket = "NAME_YOUR_BUCKET"  # <--- specify a bucket you have access to
prefix = "sagemaker/rcf-benchmarks"
execution_role = sagemaker.get_execution_role()


# check if the bucket exists
try:
    boto3.Session().client("s3").head_bucket(Bucket=bucket)
except botocore.exceptions.ParamValidationError as e:
    print(
        "Hey! You either forgot to specify your S3 bucket"
        " or you gave your bucket an invalid name!"
    )
except botocore.exceptions.ClientError as e:
    if e.response["Error"]["Code"] == "403":
        print("Hey! You don't have permission to access the bucket, {}.".format(bucket))
    elif e.response["Error"]["Code"] == "404":
        print("Hey! Your bucket, {}, doesn't exist!".format(bucket))
    else:
        raise
else:
    print("Training input/output will be stored in: s3://{}/{}".format(bucket, prefix))

### Obtain and Inspect Example Data

In [None]:
%%time

import pandas as pd
import urllib.request

data_filename = "nyc_taxi.csv"
data_source = "https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv"

urllib.request.urlretrieve(data_source, data_filename)
taxi_data = pd.read_csv(data_filename, delimiter=",")

### Training

#### Hyperparameters

In [None]:
from sagemaker import RandomCutForest

session = sagemaker.Session()

# specify general training job information
rcf = RandomCutForest(
    role=execution_role,
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    data_location="s3://{}/{}/".format(bucket, prefix),
    output_path="s3://{}/{}/output".format(bucket, prefix),
    num_samples_per_tree=512,
    num_trees=50,
)

# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1, 1)))

### Logging to Comet.ml

Define your Comet [REST API](https://www.comet.com/docs/rest-api/getting-started/) and your [workspace](https://www.comet.com/docs/user-interface/#workspaces). See the [configuration documentation](http://docs.comet.ml/python-sdk/advanced/#python-configuration) for info on both specifications.

In [None]:
COMET_REST_API = "YOUR_API_KEY"
COMET_WORKSPACE = "YOUR_WORKSPACE"

Import `comet_ml_sagemaker` package.

In [None]:
import comet_ml_sagemaker

### comet_ml_sagemaker.log_sagemaker_job(estimator/regressor, api_key, workspace, project_name)
Logs a Sagemaker job based on an estimator/regressor object 

* estimator/regressor = Sagemaker estimator/regressor object
* api_key = your Comet REST API key
* workspace = your Comet workspace
* project_name = your Comet project_name

In [None]:
# .log_sagemaker_job(regressor/estimator object from Sagemaker SDK, Comet Rest API key (optional, can be taken from usual config source), workspace (comet), project (comet))
# I have used the Sagemaker SDK to train a model. I have the estimator/regressor object. I want to log whatever I just trained
experiment = comet_ml_sagemaker.log_sagemaker_job(
    rcf, api_key=COMET_REST_API, workspace=COMET_WORKSPACE, project_name="sagemaker"
)
print(experiment.url)
experiment.add_tags(["random_forest"])

### comet_ml_sagemaker.log_sagemaker_job_by_name(job_name, api_key, workspace, project_name)
Logs a specific Sagemaker training job based on the jobname from the Sagemaker SDK.

* job_name = Cloudwatch/Sagemaker training job name
* api_key = your Comet REST API key
* workspace = your Comet workspace
* project_name = your Comet project_name

In [None]:
# I have the name of a completed training job I want to lob
# Same as .log_sagemaker_job, except instead of passing the regressor/estimator object, you pass the job name
SAGEMAKER_TRAINING_JOB_NAME = "SAGEMAKER_TRAINING_JOB_NAME"
experiment = comet_ml_sagemaker.log_sagemaker_job_by_name(
    SAGEMAKER_TRAINING_JOB_NAME,
    api_key=COMET_REST_API,
    workspace=COMET_WORKSPACE,
    project_name="sagemaker",
)
print(experiment.url)

### comet_ml_sagemaker.log_last_sagemaker_job(api_key, workspace, project_name)
Will log the last *started* Sagemaker training job based on the current config.

* api_key = your Comet REST API key
* workspace = your Comet workspace
* project_name = your Comet project_name

In [None]:
# Logs the last job for your current Amazon Region / S3
experiment = comet_ml_sagemaker.log_last_sagemaker_job(
    api_key=COMET_REST_API, workspace=COMET_WORKSPACE, project_name="sagemaker"
)
print(experiment.url)
experiment.add_tags(["random_forest"])

#### Note on SageMaker configuration

The Comet.ml Sagemaker configuration is using boto to find your training job data, please refer to the [boto documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html) to configure the region and/or credentials if needed.