# Forest Cover Type 2a): SageMaker Autopilot

In this notebook, we'll tackle our Forest Cover Type classification problem using [**Amazon SageMaker Autopilot**](https://aws.amazon.com/sagemaker/autopilot/): A service that automatically trains and tunes the best machine learning models for classification or regression, based on your data while allowing to maintain full control and visibility.

## Libraries and configuration

This notebook recovers configuration saved with `%store` from the first notebook, so if you've restarted your notebook instance / image you may need to re-run cells from the first notebook to re-save the values. See the [storemagic docs](https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html) for more details.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import os
import time

# External Dependencies:
import boto3
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sagemaker
import seaborn as sn
from sklearn import metrics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# Local Dependencies:
import util

In [None]:
%store -r bucket_name
%store -r experiment_name
%store -r preproc_trial_component_name
%store -r project_id

s3 = boto3.client("s3")
bucket = boto3.resource("s3").Bucket(bucket_name)
role = sagemaker.get_execution_role()
smclient = boto3.client("sagemaker")
smsess = sagemaker.session.Session()

project_config = util.project.init(project_id)  # Read project stack parameters from the AWS SSM store
print(project_config)

## Training the model

For the purposes of **our Experiment**, the (best outcome of the) Autopilot approach is one trial to be compared against other qualitatively different approaches.

Autopilot will automatically log **its own Experiment** describing the different candidate pre-processing and modelling configurations it explored: We can think of this as a lower-level experiment contributing towards our overall Forest Cover exercise.

In [None]:
automl_trial = Trial.create(
    trial_name=util.append_timestamp("autopilot"), 
    experiment_name=experiment_name,
    sagemaker_boto_client=smclient,
)
automl_trial.add_trial_component(preproc_trial_component_name)

preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

In [None]:
# (Or load existing trial instead)
#automl_trial = Trial.load("autopilot-2020-07-28-05-41-14")
#preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

With the [high-level SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/api/training/automl.html), defining and running an AutoML job is very similar to the `Estimator` API, but with higher-level parameters.

As always, it's possible to use the lower-level, cross-AWS [boto3 SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to achieve the same results with usually more verbose code. The alternative boto3 syntax can be seen in the [official Autopilot samples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/autopilot).

In [None]:
autoestimator = sagemaker.AutoML(
    role=role,
    sagemaker_session=smsess,
    target_attribute_name="Cover_Type",
    problem_type="MulticlassClassification",
    job_objective={ "MetricName": "Accuracy" },
    output_path=f"s3://{bucket_name}/automl",
    base_job_name="auto-forestcover",
    max_candidates=30,
    #max_runtime_per_training_job_in_seconds=None,
    #total_job_runtime_in_seconds=None,
    generate_candidate_definitions_only=False,
    tags=None,
)

Owing to the amount of parallel experimentation going on, Autopilot log streams can be a bit much... Instead, we'll asynchronously kick off the job then produce a simple status spinner in the cell below.

Note in particular that we **use the `preproc_trial_component` to set the source data location**: Anywhere we can directly create these links in our code will help to ensure the integrity of our records - even if cells are re-run in different orders during debugging and iteration.

In [None]:
autoestimator.fit(
    [preproc_trial_component.output_artifacts["train-csv"].value],
    wait=False,
    logs=False, #logs=True,  # Only works with wait=True
    # Might want to set the job name explicitly because the default gives you very few free prefix chars!
    #job_name=util.append_timestamp("auto-frstcv"),
)

auto_ml_job_name = autoestimator.current_job_name

In [None]:
# (Or attach to a previous AutoML job)
#auto_ml_job_name = "auto-for-2020-06-26-09-43-01-819"
#autoestimator = sagemaker.AutoML.attach(auto_ml_job_name)

In [None]:
def is_automl_status_done(status):
    if status["AutoMLJobStatus"] == "Completed":
        return True
    elif status["AutoMLJobStatus"] in ("Failed", "Stopped"):
        raise ValueError(f"Job ended in non-successful state '{status['AutoMLJobStatus']}'\n{status}")
    else:
        return False

util.progress.polling_spinner(
    autoestimator.describe_auto_ml_job,
    is_automl_status_done,
    fn_stringify_result=lambda status: f"{status['AutoMLJobStatus']} - {status['AutoMLJobSecondaryStatus']}",
    spinner_secs=0.4,
    poll_secs=30
)

print("Done!")

## Reviewing the results

The AutoML job generates a set of **candidate** solutions, each of which is typically a [pipeline model](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) including feature pre- and post-processing containers as well as the core model container.

There are lots of tools available in the SDK to explore the results, including:

- Listing the candidate leaderboard
- Downloading the auto-created notebooks
- Drilling in to the detailed attributes and metrics for each candidate

In [None]:
job_description = autoestimator.describe_auto_ml_job()
job_description

In [None]:
candidates = autoestimator.list_candidates(
    sort_by="FinalObjectiveMetricValue",
    sort_order="Descending",
)
candidates_df = pd.DataFrame([
    {
        "CandidateName": candidate["CandidateName"],
        "InferenceContainers": len(candidate["InferenceContainers"]),
        "MetricValue": candidate["FinalAutoMLJobObjectiveMetric"]["Value"],
        "MetricName": candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"],
        # (Plenty of other fields we could plot here if we wanted!)
    }
    for candidate in candidates
])
candidates_df

Downloading the notebooks from the Autopilot job lets us actually see the data exploration and candidate model generation code - which we could use as a starting point to customize the candidates manually and improve performance even further.

If there's time, you're encouraged to go open up these notebooks and see what's there!

In [None]:
result_folder = os.path.join("data", "automl-results", job_description["AutoMLJobName"])
os.makedirs(result_folder, exist_ok=True)
for item in job_description["AutoMLJobArtifacts"]:
    artifact_uri = job_description["AutoMLJobArtifacts"][item]
    artifact_name = item.replace("Location", "")
    artifact_bucket, artifact_key = util.boto.s3uri_to_bucket_and_key(artifact_uri)
    filename = os.path.join(result_folder, artifact_key.rpartition("/")[2])
    print(f"Downloading {artifact_name} to {filename}")
    s3.download_file(artifact_bucket, artifact_key, filename)

## Testing top model candidates (Batch Transform)

Note that we could argue the comparison we're going to make here to TabNet isn't strictly fair (and the terminology might get a bit confusing), because Autopilot did its own train/validation splitting **within** the train dataset we gave it.

However in the context of our experiment, we'll test the top **N** candidates proposed by Autopilot against our project's validation dataset - and compare their performance to the TabNet model in the next notebook.

<div class="alert alert-info"><b>Note:</b> We consider the <b>top N</b> candidates, rather than the winner alone, in case their performance on our validation dataset ranks differently than their observed metrics on Autopilot's "validation" split of the training dataset (which is all it had access to).</div>

When creating models from Autopilot candidates, we can [configure the output types](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development-container-output.html) as below - selecting e.g. just the predicted label, the prediction confidence, and/or the confidences associated with every label.

In [None]:
N_TEST_CANDIDATES = 3  # Could be increased or reduced

# We use predicted_label below for accuracy measurement, and also collect probabilities to be comparable to
# the TabNet model in the next notebook:
inference_response_keys = ["predicted_label", "probabilities"]

In [None]:
models = []
transformers = []

# The candidates list is already sorted by metric (above)
for candidate in candidates[0:N_TEST_CANDIDATES]:
    model = autoestimator.create_model(
        name=candidate["CandidateName"],
        candidate=candidate,
        inference_response_keys=inference_response_keys,
    )
    models.append(model)

    # Note the model isn't actually registered with the API until a transformer or endpoint is created:
    # (because CPU vs GPU instance type affects the target container)
    transformer = model.transformer(
        instance_count=1,
        instance_type="ml.m5.xlarge",
        accept="text/csv",  # Need to specify input and output types when using filters (below)
        assemble_with="Line",  # Join the predictions back together in to one output file with newlines
        output_path=f"s3://{bucket_name}/automl/test-transforms/{candidate['CandidateName']}/",
    )
    transformers.append(transformer)

    # Now the model has been registered, we can tag it:
    model_desc = smclient.describe_model(ModelName=transformer.model_name)
    smclient.add_tags(
        ResourceArn=model_desc["ModelArn"],
        Tags=[
            { "Key": "ExperimentName", "Value": experiment_name },
            { "Key": "TrialName", "Value": automl_trial.trial_name },
        ],
    )

Now we're ready to kick off each candidate's transform job.

Note that we:

- Use our `TrialComponent` tracking as the record of the test CSV's location
- Filter out the target column (we don't want to give the model the answer!)
- **Join** the model outputs back on to the input columns in the result CSV - to save ourselves having to reconcile later here in the notebook


In [None]:
for ix, transformer in enumerate(transformers):
    transformer.transform(
        preproc_trial_component.output_artifacts["test-csv"].value,
        split_type="Line",
        content_type="text/csv",  # Need to specify input and output types when using filters
        # TODO: Check why -2 is required vs -1 per JSONPath spec for trimming last column
        input_filter="$[:-2]",  # Exclude target column from input to the model
        join_source="Input",  # Store both input and output in the result (saves us re-joining in notebook)
        # No output_filter so our output will be all source columns (incl target) + all prediction columns
        experiment_config={
            "ExperimentName": experiment_name,
            "TrialName": automl_trial.trial_name,
            "TrialComponentDisplayName": f"Test-{candidates[ix]['CandidateName']}",
        },
        wait=False,
        logs=False,
    )
    print(f"Started transform job {transformer._current_job_name}")

...And then wait for them all to complete:

In [None]:
for transformer in transformers:
    transformer.wait(logs=False)
print("All test transform jobs complete")

Each candidate should now have a transform output CSV containing:

- All the columns from the input dataset
- The target column (which we filtered from going to the model with `input_filter`, but still gets passed through to the output file
- The outputs we requested including `predicted_label` and `probabilities`

Here we'll configure a loader to read in and transform the results spreadsheets, and then loop through the candidates printing out some metrics and visualizations:

In [None]:
# The list of training columns was saved in data prep:
with open("data/columns.json", "r") as f:
    train_columns = json.load(f)

# TODO: Save from data prep, as we did with columns
# Note our first cover_type is a dummy because the dataset's encoding starts at 1.
cover_types = ("N/A", "Spruce/Fir", "Lodgepole Pine", "Ponderosa Pine", "Cottonwood/Willow", "Aspen", "Douglas-fir", "Krummholz")

def standardize_results_df(filepath: str) -> pd.DataFrame:
    """Function to load a test transform result from CSV and load/standardize the columns"""
    df_test_results = pd.read_csv(
        filepath,
        names=train_columns[:-1] + ["Actual_Cover_Type"] + inference_response_keys,
    )

    if "probabilities" in inference_response_keys:
        # By default this field is a JSON-stringified array of numbers, so we'll unpack it into a dataframe
        # and name the columns. (See https://stackoverflow.com/a/36816769/13352657 for .apply(pd.Series))
        probs_df = df_test_results["probabilities"].apply(json.loads).apply(pd.Series)
        probs_df.columns = ["Pred " + typ for typ in cover_types[1:]]
        df_test_results.drop(columns=["probabilities"], inplace=True)
        df_test_results = pd.concat([df_test_results, probs_df], axis=1)

    if "predicted_label" in inference_response_keys:
        df_test_results.rename(columns={ "predicted_label": "Pred_Cover_Type" }, inplace=True)
        df_test_results["Pred_Correct"] = (
            df_test_results["Pred_Cover_Type"] == df_test_results["Actual_Cover_Type"]
        )
        
    return df_test_results

In [None]:
scores = []
test_root_filename = preproc_trial_component.output_artifacts["test-csv"].value.rpartition("/")[2]
for ix, transnformer in enumerate(transformers):
    print("Analysing result for model: {}".format(transformer.model_name))
    os.makedirs(f"data/test/{transformer.model_name}", exist_ok=True)
    candidate = candidates[ix]
    bucket.download_file(
        f"automl/test-transforms/{candidate['CandidateName']}/{test_root_filename}.out",  # Batch Transform appends ".out"
        f"data/test/{transformer.model_name}/{test_root_filename}",
    )

    df_test_results = standardize_results_df(f"data/test/{transformer.model_name}/{test_root_filename}")

    n_correct = sum(df_test_results["Pred_Correct"])
    n_tested = len(df_test_results)
    print(f"{n_correct} of {n_tested} samples correct: Accuracy={n_correct/n_tested:.3%}")
    scores.append(n_correct/n_tested)

    confusion = metrics.confusion_matrix(df_test_results["Actual_Cover_Type"], df_test_results["Pred_Cover_Type"])
    plt.figure(figsize = (10,7))
    sn.heatmap(
        pd.DataFrame(
            confusion,
            index = cover_types[1:],
            columns = cover_types[1:],
        ),
        annot=True
    )
    plt.show()

ixbest = np.argmax(scores)
print(f"\nBest model ix {ixbest}: {models[ixbest].name}\nwith score of {scores[ixbest]:.3%}")

Here we have [confusion matrices](https://en.wikipedia.org/wiki/Confusion_matrix) and accuracy scores for each of the top-N candidate models that we brought forward for analysis.

You might have found the index 0 was the "best" model by our testing set - which is encouraging, because it means Autopilot's metrics (from internal validation) generalized well to the unseen test dataset.


## Logging Results in Our Experiment

Because Autopilot is internally experimenting with different featurizations and algorithm configurations, it **creates its own Experiment** with associated Trials and Trial Components describing the detail of the flow it undertook.

As we've seen so far, there's a lot of tracking data available for us to explore the Autopilot results and (further refine on the candidates it produced).

For the purposes of **our Experiment** though (as created in Notebook 1) - which is to compare Autopilot with other methods - the Autopilot run is just one Trial and we only care about the best/selected outputs.

We've already logged our N test transforms (`transform()` calls above), and the pre-processing step, and there are a couple of different strategies we could take for logging other aspects:

- Verbose: Copy all Trial Components from the *winning* Autopilot Trial into our AutoML Trial
- Concise: Create a custom Trial Component called something like "Training" which just logs the fact that models were created via Autopilot (including what parameters were provided), and links to the Autopilot Experiment

Below we show the Verbose approach. For the Concise alternative, you'd be creating a custom Trial Component and adding parameters/artifacts, much like we did for Pre-processing:

In [None]:
# describe_auto_ml_job() doesn't seem to give us anything to reconstruct what the Experiment name is, so
# we'll assume it was created with the AutoML job name + standard suffix:
automl_experiment = Experiment.load(f"{auto_ml_job_name}-aws-auto-ml-job")

In [None]:
# TODO: This ignores our ixbest, in case it wasn't 0:
best_candidate_name = job_description["BestCandidate"]["CandidateName"]
print(f"Searching for {best_candidate_name} in logged Experiment...")

matching_trials = [
    t for t in automl_experiment.list_trials() if t.display_name == f"{best_candidate_name}-aws-trial"
]
n_matching_trials = len(matching_trials)

if n_matching_trials > 1:
    raise ValueError("Found {} possible AutoML trials for best candidate {}:\n\n{}".format(
        n_matching_trials,
        best_candidate_name,
        matching_trials,
    ))
elif n_matching_trials < 1:
    raise ValueError("Couldn't find AutoML trial matching candidate {}:\n\n{}".format(
        best_candidate_name,
        list(automl_experiment.list_trials()),
    ))

matching_trial = Trial.load(matching_trials[0].trial_name)
print(f"Found exactly one matching trial:\n{matching_trial.trial_name}")

In [None]:
for component in matching_trial.list_trial_components():
    automl_trial.add_trial_component(component.trial_component_name)

## Pipeline-based deployment

We'll skip over adding the Autopilot model to our project model registry and submitting it for production deployment here, and revisit the topic in the next notebook because the model artifact is a simpler, single-container model rather than an inference pipeline!


## A quick demo of real-time deployment

Our main workflow is to test the model against an offline dataset, and in general the SageMaker architecture makes batch transforms and real-time deployment pretty much interchangeable with no code change.

However, there might be some cases where we want to quickly experiment with deploying our model to a test endpoint in case the real-time feed format is intended to be slightly different from the offline dataset.

That direct deployment could look something like this:

In [None]:
try:
    predictor.delete_endpoint()
    time.sleep(5)  # (Otherwise can trigger errors if creating again immediately)
except:
    pass

predictor = autoestimator.deploy(
    endpoint_name=auto_ml_job_name,
    initial_instance_count=1,
    instance_type="ml.m5.large",
    inference_response_keys=inference_response_keys,
    predictor_cls=sagemaker.predictor.RealTimePredictor,
    #wait=False
)

In [None]:
# (Or attach to an existing endpoint)
# predictor = sagemaker.predictor.RealTimePredictor("...")

In [None]:
# Because we're using the default RealTimePredictor class, we need to explicitly configure for CSV:
predictor.accept = "text/csv"
predictor.content_type = "text/csv"
predictor.serializer = sagemaker.predictor.csv_serializer
predictor.deserializer = sagemaker.predictor.csv_deserializer

In [None]:
with open("data/columns.json", "r") as f:
    train_columns = json.load(f)
df_test = pd.read_csv(
    "data/test-noheader.csv",
    names=train_columns
)

In [None]:
# The csv_serializer is capable of processing array-like objects, so we'll use Pandas to filter our data
# (remove target column and send in only a small batch of rows), but then convert to numpy:
result = predictor.predict(df_test.drop("Cover_Type", axis=1).iloc[0:10].to_numpy())
result

Note that the `probabilities` are reported in the same JSON-stringified array format as we saw earlier when interpreting the batch transform results, so these would need to be unpacked for numerical analysis.