# Orchestrate Jobs to Train and Evaluate Models 


## Notebook Description

**Dataset Reference:** https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html

**Type of problem:** Linear Regression

**Type of solution:** XGBoost using SageMaker Training Job


Source of this notebook: https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html

This notebook has been adapted to run outside of SageMaker Pipeline


**Stack:**
- pandas, numpy 
- SageMaker Training Job
- Studio's prebuilt image DataScience 3.0 (conda) and XGBoost Stack

**Steps:**
- download data
- do some data preparation
- split the datasets and upload the datasets to S3
- configure and run the training job
- check the model evaluation


## Dataset

The dataset you use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone) [1].  The aim for this task is to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

The dataset contains several features: length (the longest shell measurement), diameter (the diameter perpendicular to length), height (the height with meat in the shell), whole_weight (the weight of whole abalone), shucked_weight (the weight of meat), viscera_weight (the gut weight after bleeding), shell_weight (the weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), and rings (integer).

The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope, which is a time-consuming task. However, the other physical measurements are easier to determine. You use the dataset to build a predictive model of the variable rings through these other physical measurements.

Before you upload the data to an S3 bucket, install the SageMaker Python SDK and gather some constants you can use later in this notebook.

> [1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

# Session initialisation

In [None]:
import sys

!{sys.executable} -m pip install "sagemaker>=2.121.0"  # noqa

In [None]:
from pprint import pprint

In [None]:
import boto3
import sagemaker
from sagemaker.workflow.pipeline_context import PipelineSession

sagemaker_session = sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
#pipeline_session = PipelineSession()
default_bucket = sagemaker_session.default_bucket()
#model_package_group_name = f"AbaloneModelPackageGroupName"
model_package_group_name = f"LabBench-Abalone-Jobs-GroupName"
#ipeline_name = f"AbalonePipeline"
pipeline_name = f"LabBench-Abalone-Jobs"


# Parameters

In [None]:
from time import gmtime, strftime
import time

run_id = f"{strftime('%y%m%d%H%M', gmtime())}"

stage_prefix = "L"
project_prefix = "abalone"
variant_prefix = "xgbjob"

In [None]:
job_prefix_short = f"{variant_prefix}/{run_id}"
job_prefix_long = f"{stage_prefix}/{project_prefix}/{job_prefix_short}"

In [None]:
print(f"{job_prefix_short=}")
print(f"{job_prefix_long=}")

In [None]:
import os
base_folder = os.path.join("./generated", job_prefix_short)

base_uri = f"s3://{default_bucket}/{job_prefix_long}"
base_uri_for_jobs = f"s3://{default_bucket}/{stage_prefix}-jobs"

In [None]:
print(f"{base_folder=}")
print(f"{base_uri=}")
print(f"{base_uri_for_jobs=}")

# Data Preparation

Now, upload the data into the default bucket. You can select our own data set for the `input_data_uri` as is appropriate.

In [None]:
import os

# tmp directory
data_folder = os.path.join(base_folder, "data")

raw_data_folder = os.path.join(data_folder, "raw")
os.makedirs(raw_data_folder, exist_ok=True)

local_path = os.path.join(raw_data_folder, "abalone-dataset.csv")

s3 = boto3.resource("s3")
s3.Bucket(f"sagemaker-sample-files").download_file(
    "datasets/tabular/uci_abalone/abalone.csv", local_path
)

print(f"{local_path=}")

In [None]:
input_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path,
    desired_s3_uri=base_uri,
)
print(f"{input_data_uri=}")

## Define Parameters 

The parameters defined in this workflow include:

* `processing_instance_count` - The instance count of the processing job.
* `instance_type` - The `ml.*` instance type of the training job.
* `input_data` - The S3 bucket URI location of the input data.
* `batch_data` - The S3 bucket URI location of the batch data.
* `mse_threshold` - The Mean Squared Error (MSE) threshold used to verify the accuracy of a model.

In [None]:
processing_instance_count = 1
#instance_type = "ml.m5.xlarge"
#instance_type = "ml.m5.2xlarge"
instance_type = "ml.t3.medium"
input_data = input_data_uri
mse_threshold = 6.0

In [None]:
base_job_name = f"{stage_prefix}-{project_prefix}-{variant_prefix}"
# a long time stamp will be added

In [None]:
print(f"{base_job_name=}")

## Define a Processing Step for Feature Engineering

First, develop a preprocessing script that is specified in the Processing step.

This notebook cell writes a file `preprocessing_abalone.py`, which contains the preprocessing script. You can update the script, and rerun this cell to overwrite. The preprocessing script uses `scikit-learn` to do the following:

* Fill in missing sex category data and encode it so that it is suitable for training.
* Scale and normalize all numerical fields, aside from sex and rings numerical data.
* Split the data into training, validation, and test datasets.

The Processing step executes the script on the input data. The Training step uses the preprocessed training features and labels to train a model. The Evaluation step uses the trained model and preprocessed test features and labels to evaluate the model.

In [None]:
import os

# tmp directory
code_folder="generated/code"
os.makedirs(code_folder, exist_ok=True)

In [None]:
%%writefile generated/code/preprocessing.py
import argparse
import os
import requests
import tempfile

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Since we get a headerless CSV file, we specify the column names here.
feature_columns_names = [
    "sex",
    "length",
    "diameter",
    "height",
    "whole_weight",
    "shucked_weight",
    "viscera_weight",
    "shell_weight",
]
label_column = "rings"

feature_columns_dtype = {
    "sex": str,
    "length": np.float64,
    "diameter": np.float64,
    "height": np.float64,
    "whole_weight": np.float64,
    "shucked_weight": np.float64,
    "viscera_weight": np.float64,
    "shell_weight": np.float64,
}
label_column_dtype = {"rings": np.float64}


def merge_two_dicts(x, y):
    z = x.copy()
    z.update(y)
    return z


if __name__ == "__main__":
    base_dir = "/opt/ml/processing"

    df = pd.read_csv(
        f"{base_dir}/input/abalone-dataset.csv",
        header=None,
        names=feature_columns_names + [label_column],
        dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),
    )
    numeric_features = list(feature_columns_names)
    numeric_features.remove("sex")
    numeric_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
    )

    categorical_features = ["sex"]
    categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]
    )

    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )

    y = df.pop("rings")
    X_pre = preprocess.fit_transform(df)
    y_pre = y.to_numpy().reshape(len(y), 1)

    X = np.concatenate((y_pre, X_pre), axis=1)

    np.random.shuffle(X)
    train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])

    pd.DataFrame(train).to_csv(f"{base_dir}/train/train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(
        f"{base_dir}/validation/validation.csv", header=False, index=False
    )
    pd.DataFrame(test).to_csv(f"{base_dir}/test/test.csv", header=False, index=False)

Next, create an instance of a `SKLearnProcessor` processor and use that in our `ProcessingStep`.

You also specify the `framework_version` to use throughout this notebook.

Use sagemaker_session in order to run this immediately.

API Reference 
- SK Framework https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor


framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name=f"{base_job_name}-preprocess",
    role=role,
    sagemaker_session=sagemaker_session
)

Finally, we take the output of the processor's `run` method. 
Note the `"train_data"` and `"test_data"` named channels specified in the output configuration for the processing job.

In [None]:
%%time

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

preprocessor_job = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="generated/code/preprocessing.py",
)

In [None]:
pprint(sklearn_processor.__dict__)

In [None]:
pprint(sklearn_processor.latest_job.__dict__)

In [None]:
pprint(sklearn_processor.latest_job.outputs[0].__dict__)

In [None]:
train_dataset_uri = sklearn_processor.latest_job.outputs[0].destination
validation_dataset_uri = sklearn_processor.latest_job.outputs[1].destination
test_dataset_uri = sklearn_processor.latest_job.outputs[2].destination

## Define a Training Step to Train a Model

In this section, use Amazon SageMaker's [XGBoost Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train on this dataset. Configure an Estimator for the XGBoost algorithm and the input dataset. A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later.

The model path where the models from training are saved is also specified.

Use sagemaker_session in order to run this immediately.

Note the `instance_type` parameter may be used in multiple places in the pipeline. In this case, the `instance_type` is passed into the estimator.

API Reference
- XGBoost Framework https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

In [None]:
%%time

from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

model_path = f"{base_uri}/model"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=instance_type,  
)
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type="ml.m5.large",  #instance_type,
    instance_count=1,
    base_job_name=f"{base_job_name}-train",
    output_path=model_path,
    role=role,
    sagemaker_session=sagemaker_session,
)
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
)

train_job = xgb_train.fit(
    inputs={
        "train": TrainingInput(
            s3_data=train_dataset_uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=validation_dataset_uri,
            content_type="text/csv",
        ),
    }
)

In [None]:
pprint(xgb_train.__dict__)

In [None]:
pprint(xgb_train.latest_training_job.__dict__)

In [None]:
pprint(xgb_train.output_path)

In [None]:
job_name = xgb_train.latest_training_job.job_name
model_uri = f"{xgb_train.output_path}/{job_name}/output/model.tar.gz"

In [None]:
print(model_uri)

## Evaluate the Trained Model

First, develop an evaluation script that is specified in a Processing step that performs the model evaluation.

After execution, you can examine the resulting `evaluation.json` for analysis.

The evaluation script uses `xgboost` to do the following:

* Load the model.
* Read the test data.
* Issue predictions against the test data.
* Build a classification report, including accuracy and ROC curve.
* Save the evaluation report to the evaluation directory.

Need a script processor to run the model while it is not deployed on an endpoint.

In [None]:
%%writefile generated/code/evaluation.py
import json
import pathlib
import pickle
import tarfile

import joblib
import numpy as np
import pandas as pd
import xgboost

from sklearn.metrics import mean_squared_error


if __name__ == "__main__":
    model_path = f"/opt/ml/processing/model/model.tar.gz"
    with tarfile.open(model_path) as tar:
        tar.extractall(path=".")

    model = pickle.load(open("xgboost-model", "rb"))

    test_path = "/opt/ml/processing/test/test.csv"
    df = pd.read_csv(test_path, header=None)

    y_test = df.iloc[:, 0].to_numpy()
    df.drop(df.columns[0], axis=1, inplace=True)

    X_test = xgboost.DMatrix(df.values)

    predictions = model.predict(X_test)

    mse = mean_squared_error(y_test, predictions)
    std = np.std(y_test - predictions)
    report_dict = {
        "regression_metrics": {
            "mse": {"value": mse, "standard_deviation": std},
        },
    }

    output_dir = "/opt/ml/processing/evaluation"
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

    evaluation_path = f"{output_dir}/evaluation.json"
    with open(evaluation_path, "w") as f:
        f.write(json.dumps(report_dict))

Next, create an instance of a `ScriptProcessor` processor.

API Reference
- https://sagemaker.readthedocs.io/en/stable/api/training/processing.html

In [None]:
%%time

from sagemaker.processing import ScriptProcessor

script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=instance_type,
    instance_count=1,
    base_job_name=f"{base_job_name}-eval",
    role=role,
    sagemaker_session=sagemaker_session,
)

eval_job = script_eval.run(
    inputs=[
        ProcessingInput(
            source=model_uri,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=test_dataset_uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="generated/code/evaluation.py",
)

In [None]:
pprint(script_eval.__dict__)

In [None]:
test_result_uri = script_eval.latest_job.outputs[0].destination

# Examining the Evaluation

Examine the resulting model evaluation after the pipeline completes. Download the resulting evaluation.json file from S3 and print the report.

In [None]:
from pprint import pprint
import json

evaluation_json = sagemaker.s3.S3Downloader.read_file(
    "{}/evaluation.json".format(test_result_uri)
)
pprint(json.loads(evaluation_json))