# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

In this notebook, we convert raw text into BERT embeddings.  This will allow us to perform natural language processing tasks such as text classification.

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

# NOTE:  THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Scikit-Learn container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [2]:
!pip install --disable-pip-version-check -q sagemaker-experiments==0.1.26
import sagemaker
import boto3

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name="sagemaker", region_name=region)
s3 = boto3.Session().client(service_name="s3", region_name=region)

[0m

# Setup Input Data

In [3]:
%store -r s3_public_path_tsv

In [4]:
try:
    s3_public_path_tsv
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the INGEST section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [5]:
print(s3_public_path_tsv)

s3://sagemaker-us-east-ads508-sp23-t8


In [6]:
%store -r s3_private_path_tsv

In [7]:
try:
    s3_private_path_tsv
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the INGEST section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [8]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-657724983756/team_8_data


In [9]:
raw_input_data_s3_uri = "s3://{}/amazon-reviews-pds/tsv/".format(bucket)
print(raw_input_data_s3_uri)

s3://sagemaker-us-east-1-657724983756/amazon-reviews-pds/tsv/


In [10]:
!aws s3 ls $raw_input_data_s3_uri

2023-03-18 17:44:16   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2023-03-18 17:44:18   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2023-03-18 17:44:19   12134676 amazon_reviews_us_Gift_Card_v1_00.tsv.gz


# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script.

# Review the Processing Script

In [11]:
!pygmentize preprocess-scikit-text-to-bert-feature-store.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample[37m[39;49;00m
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtime[39;49;00m [34mimport[39;49;00m gmtime, strftime, sleep[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36m

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess-scikit-text-to-bert-feature-store.py` script.

Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster.

# Track the `Experiment`
We will track every step of this experiment throughout the `prepare`, `train`, `optimize`, and `deploy`.

# Concepts

**Experiment**: A collection of related Trials.  Add Trials to an Experiment that you wish to compare together.

**Trial**: A description of a multi-step machine learning workflow. Each step in the workflow is described by a Trial Component. There is no relationship between Trial Components such as ordering.

**Trial Component**: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.

**Tracker**: A logger of information about a single TrialComponent.

<img src="img/sagemaker-experiments.png" width="90%" align="left">


# Create the `Experiment`

In [12]:
import time
from smexperiments.experiment import Experiment

timestamp = int(time.time())

experiment = Experiment.create(
    experiment_name="Amazon-Customer-Reviews-BERT-Experiment-{}".format(timestamp),
    description="Amazon Customer Reviews BERT Experiment",
    sagemaker_boto_client=sm,
)

experiment_name = experiment.experiment_name
print("Experiment name: {}".format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1680137374


# Create the `Trial`

In [13]:
import time
from smexperiments.trial import Trial

timestamp = int(time.time())

trial = Trial.create(
    trial_name="trial-{}".format(timestamp), experiment_name=experiment_name, sagemaker_boto_client=sm
)

trial_name = trial.trial_name
print("Trial name: {}".format(trial_name))

Trial name: trial-1680137374


# Create the `Experiment Config`

In [14]:
experiment_config = {
    "ExperimentName": experiment_name,
    "TrialName": trial_name,
    "TrialComponentDisplayName": "prepare",
}

In [15]:
print(experiment_name)

Amazon-Customer-Reviews-BERT-Experiment-1680137374


In [16]:
%store experiment_name

Stored 'experiment_name' (str)


In [17]:
print(trial_name)

trial-1680137374


In [18]:
%store trial_name

Stored 'trial_name' (str)


# Create Feature Store and Feature Group

In [19]:
featurestore_runtime = boto3.Session().client(service_name="sagemaker-featurestore-runtime", region_name=region)

In [20]:
timestamp = int(time.time())

feature_store_offline_prefix = "reviews-feature-store-" + str(timestamp)

print(feature_store_offline_prefix)

reviews-feature-store-1680137375


In [21]:
feature_group_name = "reviews-feature-group-" + str(timestamp)

print(feature_group_name)

reviews-feature-group-1680137375


In [22]:
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition,
    FeatureTypeEnum,
)

feature_definitions = [
    FeatureDefinition(feature_name="input_ids", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="input_mask", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="segment_ids", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="label_id", feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="review_id", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="date", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="label", feature_type=FeatureTypeEnum.INTEGRAL),
    #    FeatureDefinition(feature_name='review_body', feature_type=FeatureTypeEnum.STRING)
]

In [23]:
from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup(name=feature_group_name, feature_definitions=feature_definitions, sagemaker_session=sess)

print(feature_group)

FeatureGroup(name='reviews-feature-group-1680137375', sagemaker_session=<sagemaker.session.Session object at 0x7f9744774d90>, feature_definitions=[FeatureDefinition(feature_name='input_ids', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='input_mask', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='segment_ids', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='label_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>), FeatureDefinition(feature_name='review_id', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='date', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='label', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>)])


# Set the Processing Job Hyper-Parameters 

In [24]:
processing_instance_type = "ml.c5.xlarge"
processing_instance_count = 2
train_split_percentage = 0.90
validation_split_percentage = 0.05
test_split_percentage = 0.05
balance_dataset = True
max_seq_length = 64

# Choosing a `max_seq_length` for BERT
Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `80%` of our reviews.

Remember our distribution of review lengths from a previous section?

```
mean         51.683405
std         107.030844
min           1.000000
10%           2.000000
20%           7.000000
30%          19.000000
40%          22.000000
50%          26.000000
60%          32.000000
70%          43.000000
80%          63.000000
90%         110.000000
100%       5347.000000
max        5347.000000
```

![](img/review_word_count_distribution.png)

Review length `63` represents the `80th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `63`.  Reviews with length > `64` will be truncated to `64`.

In [25]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    env={"AWS_DEFAULT_REGION": region},
    max_runtime_in_seconds=7200,
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [26]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(
    code="preprocess-scikit-text-to-bert-feature-store.py",
    inputs=[
        ProcessingInput(
            input_name="raw-input-data",
            source=raw_input_data_s3_uri,
            destination="/opt/ml/processing/input/data/",
            s3_data_distribution_type="ShardedByS3Key",
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="bert-train", s3_upload_mode="EndOfJob", source="/opt/ml/processing/output/bert/train"
        ),
        ProcessingOutput(
            output_name="bert-validation",
            s3_upload_mode="EndOfJob",
            source="/opt/ml/processing/output/bert/validation",
        ),
        ProcessingOutput(
            output_name="bert-test", s3_upload_mode="EndOfJob", source="/opt/ml/processing/output/bert/test"
        ),
    ],
    arguments=[
        "--train-split-percentage",
        str(train_split_percentage),
        "--validation-split-percentage",
        str(validation_split_percentage),
        "--test-split-percentage",
        str(test_split_percentage),
        "--max-seq-length",
        str(max_seq_length),
        "--balance-dataset",
        str(balance_dataset),
        "--feature-store-offline-prefix",
        str(feature_store_offline_prefix),
        "--feature-group-name",
        str(feature_group_name),
    ],
    experiment_config=experiment_config,
    logs=True,
    wait=False,
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2023-03-30-00-49-35-904


In [27]:
scikit_processing_job_name = processor.jobs[-1].describe()["ProcessingJobName"]
print(scikit_processing_job_name)

sagemaker-scikit-learn-2023-03-30-00-49-35-904


In [28]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(
            region, scikit_processing_job_name
        )
    )
)

In [29]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, scikit_processing_job_name
        )
    )
)

In [30]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(
            bucket, scikit_processing_job_name, region
        )
    )
)

# Monitor the Processing Job

In [31]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=scikit_processing_job_name, sagemaker_session=sess
)

processing_job_description = running_processor.describe()

print(processing_job_description)

{'ProcessingInputs': [{'InputName': 'raw-input-data', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-657724983756/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/input/code/preprocess-scikit-text-to-bert-feature-store.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}, 'AppMan

In [32]:
running_processor.wait(logs=False)

...........................................................................................................................................................................!

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [33]:
processing_job_description = running_processor.describe()

output_config = processing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "bert-train":
        processed_train_data_s3_uri = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "bert-validation":
        processed_validation_data_s3_uri = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "bert-test":
        processed_test_data_s3_uri = output["S3Output"]["S3Uri"]

print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-train
s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-validation
s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-test


In [34]:
!aws s3 ls $processed_train_data_s3_uri/

2023-03-30 01:03:55   10484051 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2023-03-30 01:03:55    2311567 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2023-03-30 01:03:55   11706188 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [35]:
!aws s3 ls $processed_validation_data_s3_uri/

2023-03-30 01:03:56     579800 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2023-03-30 01:03:56     128602 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2023-03-30 01:03:56     651605 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [36]:
!aws s3 ls $processed_test_data_s3_uri/

2023-03-30 01:03:56     582700 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2023-03-30 01:03:56     129436 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2023-03-30 01:03:56     649742 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


# Pass Variables to the Next Notebook(s)

In [37]:
%store raw_input_data_s3_uri

Stored 'raw_input_data_s3_uri' (str)


In [38]:
%store max_seq_length

Stored 'max_seq_length' (int)


In [39]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [40]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [41]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [42]:
%store balance_dataset

Stored 'balance_dataset' (bool)


In [43]:
%store feature_store_offline_prefix

Stored 'feature_store_offline_prefix' (str)


In [44]:
%store feature_group_name

Stored 'feature_group_name' (str)


In [45]:
%store processed_train_data_s3_uri

Stored 'processed_train_data_s3_uri' (str)


In [46]:
%store processed_validation_data_s3_uri

Stored 'processed_validation_data_s3_uri' (str)


In [47]:
%store processed_test_data_s3_uri

Stored 'processed_test_data_s3_uri' (str)


In [48]:
%store

Stored variables and their in-db values:
balance_dataset                                       -> True
balanced_bias_data_jsonlines_s3_uri                   -> 's3://sagemaker-us-east-1-657724983756/bias-detect
balanced_bias_data_s3_uri                             -> 's3://sagemaker-us-east-1-657724983756/bias-detect
bias_data_s3_uri                                      -> 's3://sagemaker-us-east-1-657724983756/bias-detect
experiment_name                                       -> 'Amazon-Customer-Reviews-BERT-Experiment-168013737
feature_group_name                                    -> 'reviews-feature-group-1680137375'
feature_store_offline_prefix                          -> 'reviews-feature-store-1680137375'
ingest_create_athena_db_passed                        -> True
ingest_create_athena_table_parquet_passed             -> True
ingest_create_athena_table_tsv_passed                 -> True
max_seq_length                                        -> 64
processed_test_data_s3_uri         

# Query The Feature Store

In [49]:
feature_store_query = feature_group.athena_query()

In [50]:
feature_store_table = feature_store_query.table_name

In [51]:
query_string = """
SELECT input_ids, input_mask, segment_ids, label_id, split_type  FROM "{}" WHERE split_type='train' LIMIT 5
""".format(
    feature_store_table
)

print("Running " + query_string)

Running 
SELECT input_ids, input_mask, segment_ids, label_id, split_type  FROM "reviews_feature_group_1680137375_1680137851" WHERE split_type='train' LIMIT 5



In [52]:
feature_store_query.run(
    query_string=query_string,
    output_location="s3://" + bucket + "/" + feature_store_offline_prefix + "/query_results/",
)

feature_store_query.wait()

INFO:sagemaker:Query 48034212-5e5e-4f4f-90b3-27655bed4b3e is being executed.
INFO:sagemaker:Query 48034212-5e5e-4f4f-90b3-27655bed4b3e successfully executed.


In [53]:
feature_store_query.as_dataframe()

Unnamed: 0,input_ids,input_mask,segment_ids,label_id,split_type
0,"[101, 1045, 1005, 1049, 9364, 2138, 1045, 2359...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,train
1,"[101, 2348, 1045, 3641, 5592, 4003, 1999, 2085...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,train
2,"[101, 1045, 2903, 1996, 7799, 12246, 5632, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,train
3,"[101, 3733, 2000, 4965, 1998, 4604, 2021, 1996...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,train
4,"[101, 1045, 2134, 2102, 2066, 1996, 2755, 2043...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,train


# Show the Experiment Tracking Lineage

In [54]:
from sagemaker.analytics import ExperimentAnalytics

import pandas as pd

pd.set_option("max_colwidth", 500)
# pd.set_option("max_rows", 100)

experiment_analytics = ExperimentAnalytics(
    sagemaker_session=sess, experiment_name=experiment_name, sort_by="CreationTime", sort_order="Descending"
)

experiment_analytics_df = experiment_analytics.dataframe()
experiment_analytics_df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,AWS_DEFAULT_REGION,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,SageMaker.ImageUri - MediaType,SageMaker.ImageUri - Value,code - MediaType,...,raw-input-data - MediaType,raw-input-data - Value,bert-test - MediaType,bert-test - Value,bert-train - MediaType,bert-train - Value,bert-validation - MediaType,bert-validation - Value,Trials,Experiments
0,sagemaker-scikit-learn-2023-03-30-00-49-35-904-aws-processing-job,prepare,arn:aws:sagemaker:us-east-1:657724983756:processing-job/sagemaker-scikit-learn-2023-03-30-00-49-35-904,us-east-1,2.0,ml.c5.xlarge,30.0,,683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3,,...,,s3://sagemaker-us-east-1-657724983756/amazon-reviews-pds/tsv/,,s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-test,,s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-train,,s3://sagemaker-us-east-1-657724983756/sagemaker-scikit-learn-2023-03-30-00-49-35-904/output/bert-validation,[trial-1680137374],[Amazon-Customer-Reviews-BERT-Experiment-1680137374]


In [55]:
trial_component_name = experiment_analytics_df.TrialComponentName[0]
print(trial_component_name)

sagemaker-scikit-learn-2023-03-30-00-49-35-904-aws-processing-job


In [56]:
trial_component_description = sm.describe_trial_component(TrialComponentName=trial_component_name)
trial_component_description

{'TrialComponentName': 'sagemaker-scikit-learn-2023-03-30-00-49-35-904-aws-processing-job',
 'TrialComponentArn': 'arn:aws:sagemaker:us-east-1:657724983756:experiment-trial-component/sagemaker-scikit-learn-2023-03-30-00-49-35-904-aws-processing-job',
 'DisplayName': 'prepare',
 'Source': {'SourceArn': 'arn:aws:sagemaker:us-east-1:657724983756:processing-job/sagemaker-scikit-learn-2023-03-30-00-49-35-904',
  'SourceType': 'SageMakerProcessingJob'},
 'Status': {'PrimaryStatus': 'Completed',
  'Message': 'Status: Completed, exit message: null, failure reason: null'},
 'StartTime': datetime.datetime(2023, 3, 30, 0, 53, 41, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2023, 3, 30, 1, 4, 1, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2023, 3, 30, 0, 49, 37, 151000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:657724983756:user-profile/d-lhk93bs8oqqq/default-1677777930654',
  'UserProfileName': 'default-1677777930654',
  'DomainId': 'd-lhk93bs

# Show SageMaker ML Lineage Tracking 

Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. 

Amazon SageMaker Lineage enables events that happen within SageMaker to be traced via a graph structure. The data simplifies generating reports, making comparisons, or discovering relationships between events. For example easily trace both how a model was generated and where the model was deployed.

The lineage graph is created automatically by SageMaker and you can directly create or modify your own graphs.

## Key Concepts

* **Lineage Graph** - A connected graph tracing your machine learning workflow end to end.

* **Artifacts** - Represents a URI addressable object or data. Artifacts are typically inputs or outputs to Actions.

* **Actions** - Represents an action taken such as a computation, transformation, or job.

* **Contexts** - Provides a method to logically group other entities.

* **Associations** - A directed edge in the lineage graph that links two entities.

* **Lineage Traversal** - Starting from an arbitrary point trace the lineage graph to discover and analyze relationships between steps in your workflow.

* **Experiments** - Experiment entites (Experiments, Trials, and Trial Components) are also part of the lineage graph and can be associated wtih Artifacts, Actions, or Contexts.

## Show Lineage Artifacts For Our Processing Job

In [57]:
from sagemaker.lineage.visualizer import LineageTableVisualizer

lineage_table_viz = LineageTableVisualizer(sess)
lineage_table_viz_df = lineage_table_viz.show(processing_job_name=scikit_processing_job_name)
lineage_table_viz_df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...ess-scikit-text-to-bert-feature-store.py,Input,DataSet,ContributedTo,artifact
1,s3://...t-1-657724983756/amazon-reviews-pds/tsv/,Input,DataSet,ContributedTo,artifact
2,68331...om/sagemaker-scikit-learn:0.23-1-cpu-py3,Input,Image,ContributedTo,artifact
3,s3://...2023-03-30-00-49-35-904/output/bert-test,Output,DataSet,Produced,artifact
4,s3://...3-30-00-49-35-904/output/bert-validation,Output,DataSet,Produced,artifact
5,s3://...023-03-30-00-49-35-904/output/bert-train,Output,DataSet,Produced,artifact


# Release Resources

In [58]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [59]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>