# Forest Cover Type with SageMaker Experiments - Introduction

This series of notebooks demonstrates techniques for tabular data ML in SageMaker, on the popular **"Forest Cover Type"** multiclass classification task.

**This notebook** handles initial loading of the data and some basic transformations: After which you'll be ready to run the follow-on notebooks to train and deploy predictive models.

## Contents

TODO: Maybe?

## About the task & Acknowledgements

The Forest Cover Type dataset is copyright Jock A. Blackard and Colorado State University, and made available to us via the [**UCI Machine Learning Repository page**](https://archive.ics.uci.edu/ml/datasets/covertype).

The task is to predict, for each of 581012 patches of forest in northern Colorado, which of 7 types types of tree cover dominate.

See [**Forest Cover Type Classification Study**](https://rstudio-pubs-static.s3.amazonaws.com/160297_f7bcb8d140b74bd19b758eb328344908.html) (Thomas Kolasa and Aravind Kolumum Raja) for a really nicely-presented review of the problem with traditional data science methods and interactive graphics!

## Getting Started

In [None]:
# SageMaker Experiments SDK is not installed on SageMaker notebooks by default:
!pip install sagemaker-experiments

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import os

# External Dependencies:
import boto3
import numpy as np
import pandas as pd
import sagemaker
from sagemaker.pytorch.model import PyTorchModel
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# Local Dependencies:
import util

In [None]:
role = sagemaker.get_execution_role()
smclient = boto3.client("sagemaker")
smsess = sagemaker.session.Session()


bucket_name = # TODO: Bucket
%store bucket_name

bucket = boto3.resource("s3").Bucket(bucket_name)

Create (or load) the **Experiment** in which we'll track our work:

In [None]:
experiment_name = util.append_timestamp("forest-cover-type")
experiment = util.smexps.create_or_load_experiment(
    experiment_name=experiment_name,
    description="Classification of forest type from cartographic variables",
    sagemaker_boto_client=smclient,  # (Optional)
)
%store experiment_name
print(experiment)

In [None]:
preproc_tracker = Tracker.create(
    display_name="Preprocessing",
    sagemaker_boto_client=smclient,  # (Optional)
)

preproc_trial_component_name = preproc_tracker.trial_component.trial_component_name
%store preproc_trial_component_name

print(preproc_tracker.trial_component)

For the sake of illustration we could create a "Trial" including **only** the pre-processing step... But that'd be weird

In [None]:
# preproc_trial = Trial.create(
#     trial_name=util.append_timestamp("preproc-only"), 
#     experiment_name=experiment.experiment_name,
#     sagemaker_boto_client=smclient,
# )
# preproc_trial.add_trial_component(preproc_trial_component)

## Download and Explore the Data


In [None]:
raw_data_uri = "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"

!mkdir -p data/raw
!wget -O data/raw/covtype.data.gz $raw_data_uri
!wget -O data/raw/covtype.info https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info
!gunzip -f data/raw/covtype.data.gz
assert os.path.isfile("data/raw/covtype.data")  # (Because some of the shell cmds can fail without raising error)

preproc_tracker.log_input(
    name="UCI-Covertype",
    media_type="text/csv",
    value=raw_data_uri
)

The data format is documented in the [data/raw/covtype.info](data/raw/covtype.info) we just downloaded

In [None]:
cover_types = ("N/A", "Spruce/Fir", "Lodgepole Pine", "Ponderosa Pine", "Cottonwood/Willow", "Aspen", "Douglas-fir", "Krummholz")
wilderness_areas = ("Rawah", "Neota", "Comanche Peak", "Cache la Poudre")
preproc_tracker.log_parameters({
    "cover_types": cover_types,
    "n_cover_types": len(cover_types) - 1,
    "wilderness_areas": wilderness_areas,
    "n_wilderness_areas": len(wilderness_areas),
})

df_raw = pd.read_csv(
    "data/raw/covtype.data",
    names=[
        "Elevation_m",  # Elevation in meters
        "Aspect_deg",  # Aspect in degrees azimuth
        "Slope_deg",  # Slope in degrees
        "Horizontal_Distance_To_Hydrology_m",  # Horz Dist to nearest surface water features
        "Vertical_Distance_To_Hydrology_m",  # Vert Dist to nearest surface water features
        "Horizontal_Distance_To_Roadways_m",  # Horz Dist to nearest roadway
        "Hillshade_9am_uint8",  # Hillshade index at 9am, summer solstice
        "Hillshade_Noon_uint8",  # Hillshade index at noon, summer soltice
        "Hillshade_3pm_uint8",  # Hillshade index at 3pm, summer solstice
        "Horizontal_Distance_To_Fire_Points_m",  # Horz Dist to nearest wildfire ignition points
    ]
    + ["Area_is_{}".format(area.replace(" ", "_")) for area in wilderness_areas]
    + ["Soil_Type_is_{:02}".format(typ) for typ in range(1, 41)]
    + [
        "Cover_Type",  # Forest Cover Type designation
    ]
)

bool_columns = [col for col in df_raw.columns if "_is_" in col]

print(f"Raw dataframe shape (rows, cols) = {df_raw.shape}")
df_raw.head()

## Pre-Process and Split

In [None]:
# This cell from the DreamQuark Forest CoverType demo notebook is not actually necessary because the data is
# already clean:


# from sklearn.preprocessing import LabelEncoder

# categorical_columns = []
# categorical_dims =  {}
# for col in df_raw.columns[df_raw.dtypes == object]:
#     print(col, df_raw[col].nunique())
#     l_enc = LabelEncoder()
#     # df_raw[col] = df_raw[col].fillna("Unknown")
#     df_raw[col] = l_enc.fit_transform(df_raw[col].values)
#     categorical_columns.append(col)
#     categorical_dims[col] = len(l_enc.classes_)

# preproc_tracker.log_parameters({
#     "categorical_columns": categorical_columns,
#     "normalization_std": categorical_dims,
# })

df_all = df_raw  # No pre-processing to do, so we'll just map same ref to a different variable name

In [None]:
train_pct = 0.8
val_pct = 0.1
test_pct = 1. - train_pct - val_pct
preproc_tracker.log_parameters({
    "train_pct": train_pct,
    "val_pct": val_pct,
    "test_pct": test_pct,
})

df_train, df_val, df_test = np.split(
    df_all.sample(frac=1),
    [int(train_pct*len(df_all)), int((train_pct + val_pct)*len(df_all))]
)

print(f"Split randomly into train={len(df_train):,}, validation={len(df_val):,}, test={len(df_test):,} samples")

## Extract a Deliberately Biased Testing Subset

We'll extract a deliberately biased subset of our testing data to use later in demonstrating how SageMaker's **Model Monitoring** functionality can be applied to detect [Concept Drift](https://en.wikipedia.org/wiki/Concept_drift) over time after a live model endpoint is deployed.

Specifically we'll use the top 20% of test records by `Elevation_m`, the target variable is known to be strongly affected by this variable.

In [None]:
biascheck_field = "Elevation_m"
biascheck_test_pct = .2
preproc_tracker.log_parameters({
    "biascheck_field": biascheck_field,
    "biascheck_test_pct": biascheck_test_pct,
})

df_test_bias, _ = np.split(
    df_test.sort_values(biascheck_field, ascending=False),
    [int(biascheck_test_pct*len(df_test))]
)
# ...And re-randomize:
df_test_bias = df_test_bias.sample(frac=1)

# Simpler versioon of summary showing elevation only:
# pd.DataFrame({
#     "Test Set Elevation_m": df_test["Elevation_m"].describe(),
#     "Biased Subset Elevation_m": df_test_hielev["Elevation_m"].describe(),
# })

# Create summaries:
test_summary = df_test.describe()
test_bias_summary = df_test_bias.describe()

# Log summary metrics to Experiment:
for (dsname, summary) in (("test", test_summary), ("biastest", test_bias_summary)):
    for (fname, field) in (("feature", biascheck_field), ("target", "Cover_Type")):
        for stat in ("mean", "std"):
            preproc_tracker.log_metric(
                f"biascheck-{fname}-{dsname}-{stat}",
                summary[field][stat]
            )

# Present nested-column summary tables in the notebook:
test_summ_cpy = test_summary.copy()
test_summ_cpy.columns = pd.MultiIndex.from_product([["Test Set"], test_summ_cpy.columns])
test_bias_summ_cpy = test_bias_summary.copy()
test_bias_summ_cpy.columns = pd.MultiIndex.from_product([["Biased Subset"], test_bias_summ_cpy.columns])
pd.concat([test_summ_cpy, test_bias_summ_cpy], axis=1).loc[:, pd.IndexSlice[:, [biascheck_field, "Cover_Type"]]]

## Upload Prepared Datasets to S3

In [None]:
# It's helpful to have our training and validation datasets with headers so we can reference columns by name
# in training hyperparams:
df_train.to_csv("data/train-withheader.csv", index=False)
df_val.to_csv("data/validation-withheader.csv", index=False)

# ...But useful to skip headers in our test datasets so we can push the files through batch transformations:
df_test.to_csv("data/test-noheader.csv", index=False, header=False)
df_test_bias.to_csv("data/test-bias-noheader.csv", index=False, header=False)

In [None]:
# Let's save our columns list as well, since some files don't include them:
with open("data/columns.json", "w") as f:
    json.dump(df_train.columns.to_list(), f)

In [None]:
# The upload() function returns the created S3 URI, allowing for nice feed-in to logging:
preproc_tracker.log_output(
    "columns",
    sagemaker.s3.S3Uploader.upload("data/columns.json", f"s3://{bucket_name}/data"),
    "application/json"
)
preproc_tracker.log_output(
    "train-csv",
    sagemaker.s3.S3Uploader.upload("data/train-withheader.csv", f"s3://{bucket_name}/data"),
    "text/csv"
)
preproc_tracker.log_output(
    "validation-csv",
    sagemaker.s3.S3Uploader.upload("data/validation-withheader.csv", f"s3://{bucket_name}/data"),
    "text/csv"
)
preproc_tracker.log_output(
    "test-csv",
    sagemaker.s3.S3Uploader.upload("data/test-noheader.csv", f"s3://{bucket_name}/data"),
    "text/csv"
)
preproc_tracker.log_output(
    "test-biased-csv",
    sagemaker.s3.S3Uploader.upload("data/test-bias-noheader.csv", f"s3://{bucket_name}/data"),
    "text/csv"
)

In [None]:
# Double-check everything saves OK, because otherwise errors can be silent:
preproc_tracker.trial_component.save()

# I don't *think* these are necessary?
#preproc_trial.save()
#experiment.save()


## Next Steps

TODO