# Forest CoverType 2b): PyTorch TabNet

In this notebook, we'll tackle our Forest Cover Type classification problem using [**PyTorch-TabNet**](https://github.com/dreamquark-ai/tabnet): A PyTorch-based implementation of TabNet (Arik, S. O., & Pfister, T. (2019). TabNet: Attentive Interpretable Tabular Learning. arXiv preprint arXiv:1908.07442.) https://arxiv.org/pdf/1908.07442.pdf

To grossly over-simplify: TabNet applies deep learning techniques that have recently shown revolutionary success in NLP (attention, transformers, unsupervised pre-training via masking) to tabular problems.

## Libraries and config (yawn)

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import os

# External Dependencies:
import boto3
from botocore import exceptions as botoexceptions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sagemaker
from sagemaker.pytorch.estimator import PyTorch as PyTorchEstimator
from sagemaker.pytorch.model import PyTorchModel
import seaborn as sn
from sklearn import metrics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# Local Dependencies:
import util

In [None]:
%store -r experiment_name
%store -r preproc_trial_component_name
%store -r project_id
%store -r target_model

smclient = boto3.client("sagemaker")
role = sagemaker.get_execution_role()
smsess = sagemaker.session.Session()

project = util.project.init(project_id, role)
print(project)

sandbox_bucket = boto3.resource("s3").Bucket(project.sandbox.sandbox_bucket)

## Training the model

For the purposes of our **Experiment**, the (best outcome of the) PyTorch approach is one trial to be compared against other qualitatively different approaches.

If we were planning to iterate a lot on TabNet parameters, we may also decide to define **another Experiment** to capture our tests leading up to the higher-level business outcome.

In [None]:
tabnet_trial = Trial.create(
    trial_name=util.append_timestamp("tabnet-pytorch"), 
    experiment_name=experiment_name,
    sagemaker_boto_client=smclient,
)
tabnet_trial.add_trial_component(preproc_trial_component_name)

preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

In [None]:
# OR load existing trial instead, e.g:
#tabnet_trial = Trial.load("tabnet-pytorch-2020-07-28-04-56-23")
#preproc_trial_component = TrialComponent.load(preproc_trial_component_name)

The definition of the Estimator should be quite familiar to experienced framework container users:

In [None]:
hyperparameters = {
    "model-type": "classification",
    "target": "Cover_Type",
    "seed": 1337,
    "n-d": 64,
    "n-a": 64,
    "n-steps": 5,
    "lr": 0.02,
    "gamma": 1.5,
    "n-independent": 2,
    "n-shared": 2,
    # TODO: Could integrate some additional hyperparams...
    #"cat-idxs": ",".join(map(lambda i: str(i), cat_idxs)),
    # cat-dims???
    #"cat-emb-dim": ",".join(map(lambda i: str(i), cat_emb_dim)),
    "lambda-sparse": 1e-4,
    "momentum": 0.3,
    "clip-value": 2.,
    "max-epochs": 200,  # Try 1000 for accuracy
    "patience": 100,
    "batch-size": 16384,
    "virtual-batch-size": 256,
    "num-workers": 2,
}


estimator = PyTorchEstimator(
    role=role,
    entry_point="train.py",
    source_dir="src",
    # Anecdotally, have seen accuracy degrade on 1.5.1 (88%) vs 1.4 (96%)... And v1.6 was in the middle (93%)
    framework_version="1.4",
    py_version="py3",

    base_job_name="forestcover-tabnet",
    output_path=f"s3://{sandbox_bucket.name}/trainjobs",
    checkpoint_s3_uri=f"s3://{sandbox_bucket.name}/trainjobs",

    debugger_hook_config=False,

    instance_count=1,
    instance_type="ml.p3.2xlarge",
    hyperparameters=hyperparameters,
    metric_definitions=[
        # One console log per output e.g.:
        # | EPOCH | train | valid | total time (s)
        # | 1 | 0.58782 | 0.06811 | 25.5
        # Since these rows are a bit brusque, we'll write quite precise/picky regexs to stay safe:
        { "Name": "train:accuracy", "Regex": r"\| +\d+ +\| +(.*?) +\| +[^\s]+ +\| +[^\s]+", },
        { "Name": "validation:accuracy", "Regex": r"\| +\d+ +\| +[^\s]+ +\| +(.*?) +\| +[^\s]+", },
    ],
    enable_sagemaker_metrics=True,
)

...But one important difference is the `experiment_config` parameter for `.fit()` - which will automatically create special training job-linked Trial Components for us:

In [None]:
estimator.fit(
    inputs={
        "train": preproc_trial_component.output_artifacts["train-csv"].value,
        "validation": preproc_trial_component.output_artifacts["validation-csv"].value,
    },
    experiment_config={
        # This will create a TrainingJob-linked TrialComponent and automatically attach hyperparameters etc
        "TrialName": tabnet_trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
    #wait=False,
)

In [None]:
# OR re-attach to previous training job instead, e.g:
#estimator = estimator.attach("forestcover-tabnet-2020-07-28-05-11-17-070")

## Testing the model (Batch Transform)

We've trained something, and we have validation metrics which should give us an idea of its performance: But at some point we probably want to test out inference to validate that everything's working OK (and maybe see what performance we can expect on our test set).

In particular, this model needs a [src/inference.py](src/inference.py) script to:
- Meet the I/O requirements of the `DefaultModelMonitor` we'll use later, and
- Tell the PyTorch container how to load the model, since pytorch-tabnet currently only offers saving via Pickle (not standard Pytorch model save)

Rather than using the **"one-click deploy"** APIs `Estimator.transformer()` and `Estimator.deploy()`, we'll explicitly create a `Model` in SageMaker to give us a little more control over our experiment's lifecycle.

In [None]:
training_job_desc = estimator.latest_training_job.describe()
model_path = training_job_desc["ModelArtifacts"]["S3ModelArtifacts"]
model_name = training_job_desc["TrainingJobName"]

In [None]:
try:
    model.delete_model()
except (AttributeError, NameError, ValueError):
    # AttributeError: Model() wasn't initialized with a sagemaker_session and hasn't created one yet
    # NameError: model hasn't been defined yet
    # ValueError: Current model isn't saved to SageMaker
    pass
except botoexceptions.ClientError as e:
    if (
        e.response["Error"]["Code"] == "ValidationException"
        and e.response["Error"]["Message"].startswith("Could not find")
    ):
        # SDK tried to delete but model wasn't found
        pass
    else:
        raise e

# Note a SageMaker SDK "Model" isn't strictly backed by a SageMaker API "Model": The Model only gets created
# when .transformer() or .deploy() are called... So we can't apply tags at the point of SDK "Model" creation
# or load an SDK "Model" from an API Model name/ARN.
model = PyTorchModel(
    name=model_name,
    model_data=model_path,
    role=role,
    source_dir="src/",
    entry_point="src/inference.py",
    framework_version="1.4",
    py_version="py3",
    sagemaker_session=smsess,
)

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",  # g4dn family not yet supported for batch transform
    strategy="MultiRecord",
    max_payload=1,  # 1MB
    accept="text/csv",  # Need to specify input and output types when using filters for .transform()
    assemble_with="Line",
    output_path=f"s3://{sandbox_bucket.name}/test/{model.name}",
)

# The API model is created now, so let's tag it straight away with Boto3:
model_desc = smclient.describe_model(ModelName=transformer.model_name)
smclient.add_tags(
    ResourceArn=model_desc["ModelArn"],
    Tags=[
        { "Key": "ExperimentName", "Value": experiment_name },
        { "Key": "TrialName", "Value": tabnet_trial.trial_name },
    ]
)

We'll use SageMaker Batch Transform's [data processing functionality](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to control how data gets batched in to our algorithm, and [associate output predictions with the input data](https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/) - which will make measuring accuracy much easier later.

As with the training job, providing an `experiment_config` will enable the transform job to create us a **Trial Component** recording the test:

In [None]:
transformer.transform(
    preproc_trial_component.output_artifacts["test-csv"].value,
    split_type="Line",
    content_type="text/csv",  # Need to specify input and output types when using filters
    # TODO: Check why -2 is required vs -1 per JSONPath spec for trimming last column
    input_filter="$[:-2]",  # Exclude target column from input to the model
    join_source="Input",  # Store both input and output in the result (saves us re-joining in notebook)
    # No output_filter so our output will be all source columns (incl target) + all prediction columns
    experiment_config={
        "ExperimentName": experiment_name,
        "TrialName": tabnet_trial.trial_name,
        "TrialComponentDisplayName": "Test",
    },
    wait=True,
    logs=True,
)

With the transform complete, we simply need to download and plot the results:

In [None]:
test_root_filename = preproc_trial_component.output_artifacts["test-csv"].value.rpartition("/")[2]

!mkdir -p data/test/$model_name
sandbox_bucket.download_file(
    f"test/{model.name}/{test_root_filename}.out",  # Batch Transform appends ".out"
    f"data/test/{model.name}/{test_root_filename}",
)

In [None]:
with open("data/columns.json", "r") as f:
    train_columns = json.load(f)

# TODO: Save from data prep, as we did with columns
# Note our first cover_type is a dummy because the dataset's encoding starts at 1.
cover_types = ("N/A", "Spruce/Fir", "Lodgepole Pine", "Ponderosa Pine", "Cottonwood/Willow", "Aspen", "Douglas-fir", "Krummholz")

result_cols = train_columns[:-1] + ["Actual_Cover_Type"] + ["Pred " + typ for typ in cover_types[1:]]

In [None]:
df_test_results = pd.read_csv(
    f"data/test/{model.name}/{test_root_filename}",
    names=result_cols
)

print(f"Shape: {df_test_results.shape}")
df_test_results.head()

In [None]:
# Results just have class probabilities: Recover the predicted class column (argmax):
df_pred_probs = df_test_results[["Pred " + typ for typ in cover_types[1:]]]
# idxmax axis=1 returns index values (i.e. column names) - so rename columns to numbers:
predicted_classes = df_pred_probs.rename(
    columns={x:y for x,y in zip(df_pred_probs.columns,range(1,1+len(df_pred_probs.columns)))}
).idxmax(axis=1)

# Now we can add the post-processed results into the main dataframe:
df_test_results["Pred_Cover_Type"] = predicted_classes
df_test_results["Pred_Correct"] = df_test_results["Pred_Cover_Type"] == df_test_results["Actual_Cover_Type"]
df_test_results.head()

In [None]:
# Plot confusion matrix and accuracy summary:

confusion = metrics.confusion_matrix(df_test_results["Actual_Cover_Type"], df_test_results["Pred_Cover_Type"])

plt.figure(figsize = (10,7))
sn.heatmap(
    pd.DataFrame(
        confusion,
        index = cover_types[1:],
        columns = cover_types[1:],
    ),
    annot=True
)

n_correct = sum(df_test_results["Pred_Correct"])
n_tested = len(df_test_results)
print(f"{n_correct} of {n_tested} samples correct: Accuracy={n_correct/n_tested:.3%}")

In [None]:
# TODO: Interactive ROC curve?

In [None]:
# TODO: Associate accuracy metric with existing Test TrialComponent?

# It's not trivial because neither training nor processing jobs (currently) advertise what TrialComponentName
# they created during run: We only have access to the DisplayName we requested... So could work around by
# appending timestamps to our displayname (to make lookups unique when job retried multiple times within
# trial), but that would make the structure within a trial less straightforward.

## Register/submit the model for deployment

Now we have (in our data science sandbox) a candidate model which is performing well. We'd like to promote it up to the project space and deploy it to an environment.

How we set up our "model registry" depends on the governance requirements of the project and organization. Note that while the `ModelPackage` and `Algorithm` constructs built in to SageMaker are one interesting option, they don't currently support resource tagging. :-(

In [None]:
submission_result = project.submit_model(
    {
        "EndpointName": target_model,
        "TrainingJob": training_job_desc,
        "Model": model_desc,
    },
    wait=True
)
print(submission_result)

## Bonus: A quick demo of real-time deployment

Sure our main workflow is to test the model with Batch Transform - but what about cases where the batch transform feed format might be slightly different from production?

If we wanted to quickly experiment with deploying our model to a test endpoint here, it could look something like this:

In [None]:
try:
    predictor.delete_endpoint()
    time.sleep(5)  # (Otherwise can trigger ResourceLimitExceeded if creating again immediately)
except:
    pass

predictor = model.deploy(
    endpoint_name=model.name,  # Use model.name to avoid us accidentally deploying the model twice
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    #wait=False
)

In [None]:
with open("data/columns.json", "r") as f:
    train_columns = json.load(f)
df_test = pd.read_csv(
    "data/test-noheader.csv",
    names=train_columns
)

In [None]:
# By default, PyTorch predictors use serializer sagemaker.predictor.npy_serializer and content_type
# application/x-npy ...So it's as easy as dropping the target column and converting pandas->numpy.
# (Note you can do many more than 10 samples at once - see the batch transform logs)
result = predictor.predict(df_test.drop("Cover_Type", axis=1).iloc[0:10].to_numpy())
result