# Forest CoverType 2c): Scikit-Learn

> *This notebook works well with the `Python 3 (Data Science)` kernel in [SageMaker Studio](https://aws.amazon.com/sagemaker/studio/)*

In this notebook, we'll tackle our Forest Cover Type classification problem using [**Scikit Learn's RandomForestClassifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import os

# External Dependencies:
import boto3
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sagemaker
from sagemaker.sklearn import SKLearn as SKLearnEstimator
import seaborn as sn
from sklearn import metrics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# Local Dependencies:
import util

In [None]:
%store -r experiment_name
%store -r preproc_trial_component_name
%store -r project_id
%store -r target_model

smclient = boto3.client("sagemaker")
role = sagemaker.get_execution_role()
smsess = sagemaker.session.Session()

project = util.project.init(project_id, role)
print(project)

sandbox_bucket = boto3.resource("s3").Bucket(project.sandbox.sandbox_bucket)

## Training the model

For the purposes of our **Experiment**, the (best outcome of the) SKLearn approach is one trial to be compared against other qualitatively different approaches.

If we were planning to iterate a lot on model parameters, we may also decide to define **another Experiment** to capture our tests leading up to the higher-level business outcome.

In [None]:
randfor_trial = Trial.create(
    trial_name=util.append_timestamp("randfor-sklearn"), 
    experiment_name=experiment_name,
    sagemaker_boto_client=smclient,
)
randfor_trial.add_trial_component(preproc_trial_component_name)

preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

The definition of the Estimator should be quite familiar to experienced framework container users:

In [None]:
hyperparameters = {
    "target": "Cover_Type",
    "seed": 1337,
}

estimator = SKLearnEstimator(
    role=role,
    entry_point="main.py",
    source_dir="src_sklearn",
    framework_version="0.23-1",
    py_version="py3",

    base_job_name="forestcover-randfor",
    output_path=f"s3://{sandbox_bucket.name}/trainjobs",
    checkpoint_s3_uri=f"s3://{sandbox_bucket.name}/trainjobs",

    instance_count=1,
    instance_type="ml.m5.xlarge",
    hyperparameters=hyperparameters,
    metric_definitions=[
        { "Name": "train:accuracy", "Regex": r"train:accuracy=(.*?);", },
        { "Name": "validation:accuracy", "Regex": r"validation:accuracy=(.*?);", },
    ],
    enable_sagemaker_metrics=True,
)

...But one important difference is the `experiment_config` parameter for `.fit()` - which will automatically create special training job-linked Trial Components for us:

In [None]:
estimator.fit(
    inputs={
        "train": preproc_trial_component.output_artifacts["train-csv"].value,
        "validation": preproc_trial_component.output_artifacts["validation-csv"].value,
    },
    experiment_config={
        # This will create a TrainingJob-linked TrialComponent and automatically attach hyperparameters etc
        "TrialName": randfor_trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
    #wait=False,
)

In [None]:
# OR re-attach to previous training job by name, e.g:
#estimator = estimator.attach("...")

## Testing the model (Batch Transform)

We've trained something, and we have validation metrics which should give us an idea of its performance: But at some point we probably want to test out inference to validate that everything's working OK (and maybe see what performance we can expect on our test set).

In [None]:
training_job_desc = estimator.latest_training_job.describe()
model_path = training_job_desc["ModelArtifacts"]["S3ModelArtifacts"]
model_name = training_job_desc["TrainingJobName"]

In [None]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = estimator.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    strategy="MultiRecord",
    max_payload=1,  # 1MB
    accept="text/csv",  # Need to specify input and output types when using filters for .transform()
    assemble_with="Line",
    output_path=f"s3://{sandbox_bucket.name}/test/{model_name}",
)

In [None]:
# The API model is created now, so let's tag it straight away with Boto3:
model_desc = smclient.describe_model(ModelName=transformer.model_name)
smclient.add_tags(
    ResourceArn=model_desc["ModelArn"],
    Tags=[
        { "Key": "ExperimentName", "Value": experiment_name },
        { "Key": "TrialName", "Value": randfor_trial.trial_name },
    ]
)

We'll use SageMaker Batch Transform's [data processing functionality](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to control how data gets batched in to our algorithm, and [associate output predictions with the input data](https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/) - which will make measuring accuracy much easier later.

As with the training job, providing an `experiment_config` will enable the transform job to create us a **Trial Component** recording the test:

In [None]:
transformer.transform(
    preproc_trial_component.output_artifacts["test-csv"].value,
    split_type="Line",
    content_type="text/csv",  # Need to specify input and output types when using filters
    # TODO: Check why -2 is required vs -1 per JSONPath spec for trimming last column
    input_filter="$[:-2]",  # Exclude target column from input to the model
    join_source="Input",  # Store both input and output in the result (saves us re-joining in notebook)
    # No output_filter so our output will be all source columns (incl target) + all prediction columns
    experiment_config={
        "ExperimentName": experiment_name,
        "TrialName": randfor_trial.trial_name,
        "TrialComponentDisplayName": "Test",
    },
    wait=True,
    logs=True,
)

With the transform complete, we simply need to download and plot the results:

In [None]:
test_root_filename = preproc_trial_component.output_artifacts["test-csv"].value.rpartition("/")[2]

!mkdir -p data/test/$model_name
sandbox_bucket.download_file(
    f"test/{model_name}/{test_root_filename}.out",  # Batch Transform appends ".out"
    f"data/test/{model_name}/{test_root_filename}",
)

In [None]:
with open("data/columns.json", "r") as f:
    train_columns = json.load(f)

# TODO: Save from data prep, as we did with columns
# Note our first cover_type is a dummy because the dataset's encoding starts at 1.
cover_types = ("N/A", "Spruce/Fir", "Lodgepole Pine", "Ponderosa Pine", "Cottonwood/Willow", "Aspen", "Douglas-fir", "Krummholz")

#result_cols = train_columns[:-1] + ["Actual_Cover_Type"] + ["Pred_Cover_Type"]
result_cols = train_columns[:-1] + ["Actual_Cover_Type"] + ["Pred " + typ for typ in cover_types[1:]]

In [None]:
df_test_results = pd.read_csv(
    f"data/test/{model_name}/{test_root_filename}",
    names=result_cols
)

print(f"Shape: {df_test_results.shape}")
df_test_results.head()

In [None]:
# Results just have class probabilities: Recover the predicted class column (argmax):
df_pred_probs = df_test_results[["Pred " + typ for typ in cover_types[1:]]]
# idxmax axis=1 returns index values (i.e. column names) - so rename columns to numbers:
predicted_classes = df_pred_probs.rename(
    columns={x:y for x,y in zip(df_pred_probs.columns,range(1,1+len(df_pred_probs.columns)))}
).idxmax(axis=1)

# Now we can add the post-processed results into the main dataframe:
df_test_results["Pred_Cover_Type"] = predicted_classes
df_test_results["Pred_Correct"] = df_test_results["Pred_Cover_Type"] == df_test_results["Actual_Cover_Type"]
df_test_results.head()

In [None]:
# Plot confusion matrix and accuracy summary:
confusion = metrics.confusion_matrix(df_test_results["Actual_Cover_Type"], df_test_results["Pred_Cover_Type"])

plt.figure(figsize = (10,7))
sn.heatmap(
    pd.DataFrame(
        confusion,
        index = cover_types[1:],
        columns = cover_types[1:],
    ),
    annot=True
)

n_correct = sum(df_test_results["Pred_Correct"])
n_tested = len(df_test_results)
print(f"{n_correct} of {n_tested} samples correct: Accuracy={n_correct/n_tested:.3%}")