This notebook has been adapted from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-lineage/sagemaker-lineage-multihop-queries.ipynb

## Setup

Must use Python3 (Data Science 2.0) Kernal

Let's start by installing the Python SDK, boto and AWS CLI.

In [None]:
!pip install sagemaker botocore boto3 awscli --upgrade

In [None]:
!pip install sagemaker-experiments pyvis

In [None]:
!python --version

In [None]:
import os
import boto3
import sagemaker
import pprint
from botocore.config import Config

config = Config(retries={"max_attempts": 50, "mode": "adaptive"})

sagemaker_session = sagemaker.Session()
sm_client = sagemaker_session.sagemaker_client

region = sagemaker_session.boto_region_name

default_bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

# Helper function to print query outputs
pp = pprint.PrettyPrinter()

In [None]:
from datetime import datetime

training_instance_type = "ml.m5.xlarge"
inference_instance_type = "ml.m5.xlarge"
s3_prefix = "astronomer-example"

unique_id = str(datetime.now().timestamp()).split(".")[0]

## Create an Experiment and Trial for a training job

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

experiment_name = f"AstronomerExperiment-{unique_id}"
exp = Experiment.create(experiment_name=experiment_name, sagemaker_boto_client=sm_client)

trial = Trial.create(
    experiment_name=exp.experiment_name,
    trial_name=f"AstronomerTrial-{unique_id}",
    sagemaker_boto_client=sm_client,
)

print(exp.experiment_name)
print(trial.trial_name)

## Training Data

Creating a `data/` directory to store the preprocessed [UCI Abalone](https://archive.ics.uci.edu/ml/datasets/abalone) dataset. The preprocessing is done using the preprocessing script defined in the notebook [Orchestrating Jobs with Amazon SageMaker Model Building Pipelines](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb) notebook. Then training and validation data is uploaded to S3 so that it can be used in the training and inference job.

In [None]:
default_bucket

In [None]:
if not os.path.exists("./data/"):
    os.makedirs("./data/")
    print("Directory Created ")
else:
    print("Directory already exists")

# Download the processed abalone dataset files
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-sample-files",
    "datasets/tabular/uci_abalone/preprocessed/test.csv",
    "./data/test.csv",
)
s3.download_file(
    f"sagemaker-sample-files",
    "datasets/tabular/uci_abalone/preprocessed/train.csv",
    "./data/train.csv",
)
s3.download_file(
    f"sagemaker-sample-files",
    "datasets/tabular/uci_abalone/preprocessed/validation.csv",
    "./data/validation.csv",
)

# Upload the datasets to the SageMaker session default bucket
boto3.Session().resource("s3").Bucket(default_bucket).Object(
    "experiments-demo/train.csv"
).upload_file("data/train.csv")
boto3.Session().resource("s3").Bucket(default_bucket).Object(
    "experiments-demo/validation.csv"
).upload_file("data/validation.csv")

training_data = f"s3://{default_bucket}/experiments-demo/train.csv"
validation_data = f"s3://{default_bucket}/experiments-demo/validation.csv"

In [None]:
print(f"s3://{default_bucket}/experiments-demo/train.csv")
print(f"s3://{default_bucket}/experiments-demo/validation.csv")

## Create a training job

We train a simple XGBoost model on the Abalone dataset. 
`sagemaker.image_uris.retrieve()` is used to get the sagemaker container for XGBoost so that it can be used in the Estimator. 

In the `.fit()` function, we pass in a training and validation dataset along with an `experiment_config`. The `experiment_config` ensures that the metrics, parameters, and artifats associated with this training job are logged to the experiment and trial created above. 


In [None]:
from sagemaker.estimator import Estimator

model_path = f"s3://{default_bucket}/{s3_prefix}/xgb_model"
training_instance_type = "ml.m5.large"

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.2-2",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,base_job_name='astronomer-blogpost'
)

xgb_train.set_hyperparameters(
    objective="reg:squarederror",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    verbosity=0,
)

In [None]:
from sagemaker.inputs import TrainingInput

xgb_train.fit(
    inputs={
        "train": TrainingInput(
            s3_data=training_data,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=validation_data,
            content_type="text/csv",
        ),
    },
    experiment_config={
        "ExperimentName": experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "AstronomerTrialComponent",
    },
)

## Create a Model Package Group for the trained model to be registered

Create a new Model Package Group or use an existing one to register the model.

In [None]:
model_package_group_name = "astronomer-blogpost"
mpg = sm_client.create_model_package_group(ModelPackageGroupName=model_package_group_name)
mpg_arn = mpg["ModelPackageGroupArn"]

## Register the model in the Model Registry
Once the model is registered, it appears in the Model Registry tab of the SageMaker Studio UI. The model is registered with the `approval_status` set to "Approved". By default, the model is registered with the `approval_status` set to "PendingManualApproval". Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API. 

In [None]:
inference_instance_type = "ml.m5.xlarge"
model_package = xgb_train.register(
    model_package_group_name=mpg_arn,
    inference_instances=[inference_instance_type],
    transform_instances=[inference_instance_type],
    content_types=["text/csv"],
    response_types=["text/csv"],
    approval_status="Approved",
)

model_package_arn = model_package.model_package_arn
print("Model Package ARN : ", model_package_arn)

In [None]:
create_model_response = sm_client.create_model(
    ModelName = 'astronomer-blogpost-v{}'.format(model_package_arn.split('/')[-1]),
    ExecutionRoleArn = "arn:aws:iam::936535839574:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole",
    PrimaryContainer = {
        'ModelPackageName': model_package_arn
    },
)