# Forest Cover Type 2a): SageMaker Autopilot

In this notebook, we'll tackle our Forest Cover Type classification problem using [**Amazon SageMaker Autopilot**](https://aws.amazon.com/sagemaker/autopilot/): A service that automatically trains and tunes the best machine learning models for classification or regression, based on your data while allowing to maintain full control and visibility.

## Libraries and configuration

In [None]:
%load_ext autoreload
%autoreload 2

# External Dependencies:
import boto3
import numpy as np
import pandas as pd
import sagemaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# Local Dependencies:
import util

In [None]:
%store -r bucket_name
%store -r experiment_name
%store -r preproc_trial_component_name
%store -r project_id

bucket = boto3.resource("s3").Bucket(bucket_name)
role = sagemaker.get_execution_role()
smclient = boto3.client("sagemaker")
smsess = sagemaker.session.Session()

project_config = util.project.init(project_id)  # Read project stack parameters from the AWS SSM store
print(project_config)

## Training the model

For the purposes of **our Experiment**, the (best outcome of the) Autopilot approach is one trial to be compared against other qualitatively different approaches.

Autopilot will automatically log **its own Experiment** describing the different candidate pre-processing and modelling configurations it explored: We can think of this as a lower-level experiment contributing towards our overall Forest Cover exercise.

In [None]:
automl_trial = Trial.create(
    trial_name=util.append_timestamp("autopilot"), 
    experiment_name=experiment_name,
    sagemaker_boto_client=smclient,
)
automl_trial.add_trial_component(preproc_trial_component_name)

preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

With the [high-level SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/api/training/automl.html), defining and running an AutoML job is very similar to the `Estimator` API, but with higher-level parameters.

As always, it's possible to use the lower-level, cross-AWS [boto3 SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to achieve the same results with usually more verbose code. The alternative boto3 syntax can be seen in the [official Autopilot samples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/autopilot).

In [None]:
autoestimator = sagemaker.automl.automl.AutoML(
    role=role,
    sagemaker_session=smsess,
    target_attribute_name="Cover_Type",
    problem_type="MulticlassClassification",
    job_objective={ "MetricName": "Accuracy" },
    output_path=f"s3://{bucket_name}/automl",
    base_job_name="auto-forestcover",
    max_candidates=30,
    #max_runtime_per_training_job_in_seconds=None,
    #total_job_runtime_in_seconds=None,
    generate_candidate_definitions_only=False,
    tags=None,
)

Owing to the amount of parallel experimentation going on, Autopilot log streams can be a bit much... Instead, we'll asynchronously kick off the job then produce a simple status spinner in the cell below.

Note in particular that we **use the `preproc_trial_component` to set the source data location**: Anywhere we can directly create these links in our code will help to ensure the integrity of our records - even if cells are re-run in different orders during debugging and iteration.

In [None]:
autoestimator.fit(
    [preproc_trial_component.output_artifacts["train-csv"].value],
    wait=False,
    logs=False, #logs=True,  # Only works with wait=True
    # Might want to set the job name explicitly because the default gives you very few free prefix chars!
    #job_name=util.append_timestamp("auto-frstcv"),
)

auto_ml_job_name = autoestimator.current_job_name

In [None]:
def is_automl_status_done(status):
    if status["AutoMLJobStatus"] == "Completed":
        return True
    elif status["AutoMLJobStatus"] in ("Failed", "Stopped"):
        raise ValueError(f"Job ended in non-successful state '{status['AutoMLJobStatus']}'\n{status}")
    else:
        return False

util.progress.polling_spinner(
    autoestimator.describe_auto_ml_job,
    is_automl_status_done,
    fn_stringify_result=lambda status: f"{status['AutoMLJobStatus']} - {status['AutoMLJobSecondaryStatus']}",
    spinner_secs=0.4,
    poll_secs=30
)

## Logging in Our Experiment

Autopilot always creates a **Experiment** with associated Trials and Trial Components describing the detail of the flow it undertook.

For the purposes of **our Experiment** (as created in Notebook 1) which is to compare Autopilot with other methods, the Autopilot run is just one Trial and we only care about the best/selected results.

In [None]:
# describe_auto_ml_job() doesn't seem to give us anything to reconstruct what the Experiment name is, so
# we'll assume it was created with the AutoML job name + standard suffix:
automl_experiment = Experiment.load(f"{auto_ml_job_name}-aws-auto-ml-job")

In [None]:
# TODO: Extract relevant data from the 'best' Trial/Components of AutoML Experiment, and copy the info to a Trial in our Experiment
list(Trial.load(list(automl_experiment.list_trials())[0].trial_name).list_trial_components())

## Deploy

In [None]:
autoestimator.deploy(...)