# NOTE:  THIS NOTEBOOK WILL TAKE A 30 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.

# Fine-Tuning a BERT Model and Create a Text Classifier

In the previous section, we've already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

![BERT Training](img/bert_training.png)

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.

In [None]:
!pip install -q sagemaker==2.20.0
!pip install -q smdebug==1.0.1

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# _PRE-REQUISITE: You need to have succesfully run the notebooks in the `PREPARE` section before you continue with this notebook._

In [2]:
%store -r processed_train_data_s3_uri

In [3]:
try:
    processed_train_data_s3_uri
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the PREPARE section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [4]:
print(processed_train_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-train


In [5]:
%store -r processed_validation_data_s3_uri

In [6]:
try:
    processed_validation_data_s3_uri
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the PREPARE section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [7]:
print(processed_validation_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-validation


In [8]:
%store -r processed_test_data_s3_uri

In [9]:
try:
    processed_test_data_s3_uri
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the PREPARE section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [10]:
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-test


# Track the `Experiment`
We will track every step of this experiment throughout the `prepare`, `train`, `optimize`, and `deploy`.

# Concepts

**Experiment**: A collection of related Trials.  Add Trials to an Experiment that you wish to compare together.

**Trial**: A description of a multi-step machine learning workflow. Each step in the workflow is described by a Trial Component. There is no relationship between Trial Components such as ordering.

**Trial Component**: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.

**Tracker**: A logger of information about a single TrialComponent.

<img src="img/sagemaker-experiments.png" width="90%" align="left">


# Create the `Experiment`

In [11]:
import time
from smexperiments.experiment import Experiment

timestamp = int(time.time())

experiment = Experiment.create(
                experiment_name='Amazon-Customer-Reviews-BERT-Experiment-{}'.format(timestamp),
                description='Amazon Customer Reviews BERT Experiment', 
                sagemaker_boto_client=sm)

experiment_name = experiment.experiment_name
print('Experiment name: {}'.format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1608314434


# Create the `Trial`

In [12]:
import time
from smexperiments.trial import Trial

timestamp = int(time.time())

trial = Trial.create(trial_name='trial-{}'.format(timestamp),
                     experiment_name=experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1608314434


# Create the `prepare` Trial Component and Tracker
Note:  A Trial Component is actually created through a Tracker.  This is a bit confusing, we know.

In [13]:
from smexperiments.tracker import Tracker

tracker_prepare = Tracker.create(display_name='prepare', 
                                 sagemaker_boto_client=sm)

prepare_trial_component_name = tracker_prepare.trial_component.trial_component_name
print('Prepare trial component name {}'.format(prepare_trial_component_name))

Prepare trial component name TrialComponent-2020-12-18-180035-flma


# Attach the `prepare` Trial Component and Tracker as a Component to the Trial

In [14]:
trial.add_trial_component(tracker_prepare.trial_component)

# Log All Parameters Used During `prepare` Phase

In [15]:
%store -r raw_input_data_s3_uri

In [16]:
print(raw_input_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/


In [17]:
tracker_prepare.log_input(name='raw_data_s3_uri', 
                          media_type='s3/uri', 
                          value=raw_input_data_s3_uri)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3d17c9b748>,trial_component_name='TrialComponent-2020-12-18-180035-flma',display_name='prepare',tags=None,trial_component_arn='arn:aws:sagemaker:us-east-1:231218423789:experiment-trial-component/trialcomponent-2020-12-18-180035-flma',response_metadata={'RequestId': '43410760-b849-454f-ba88-1261f52864ba', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '43410760-b849-454f-ba88-1261f52864ba', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Fri, 18 Dec 2020 18:00:36 GMT'}, 'RetryAttempts': 0},parameters={},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [18]:
%store -r train_split_percentage

In [19]:
print(train_split_percentage)

0.9


In [20]:
%store -r validation_split_percentage

In [21]:
print(validation_split_percentage)

0.05


In [22]:
%store -r test_split_percentage

In [23]:
print(test_split_percentage)

0.05


In [24]:
%store -r max_seq_length

In [25]:
print(max_seq_length)

64


In [26]:
%store -r balance_dataset

In [27]:
print(balance_dataset)

True


In [28]:
tracker_prepare.log_parameters({
    'max_seq_length': max_seq_length,
    'train_split_percentage': train_split_percentage,
    'validation_split_percentage': validation_split_percentage,
    'test_split_percentage': test_split_percentage, 
    'balance_dataset': str(balance_dataset)
})

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3d17c9b748>,trial_component_name='TrialComponent-2020-12-18-180035-flma',display_name='prepare',tags=None,trial_component_arn='arn:aws:sagemaker:us-east-1:231218423789:experiment-trial-component/trialcomponent-2020-12-18-180035-flma',response_metadata={'RequestId': '64c95660-37da-4dcf-a745-05482e32a939', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '64c95660-37da-4dcf-a745-05482e32a939', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Fri, 18 Dec 2020 18:00:39 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 64, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05, 'balance_dataset': 'True'},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [29]:
tracker_prepare.log_output(name='train_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_train_data_s3_uri)

tracker_prepare.log_output(name='validation_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_validation_data_s3_uri)

tracker_prepare.log_output(name='test_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_test_data_s3_uri)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3d17c9b748>,trial_component_name='TrialComponent-2020-12-18-180035-flma',display_name='prepare',tags=None,trial_component_arn='arn:aws:sagemaker:us-east-1:231218423789:experiment-trial-component/trialcomponent-2020-12-18-180035-flma',response_metadata={'RequestId': '344c5163-31bb-422c-bbe6-844f82886346', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '344c5163-31bb-422c-bbe6-844f82886346', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Fri, 18 Dec 2020 18:00:41 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 64, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05, 'balance_dataset': 'True'},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={'train_data_s3_uri': TrialComponentArtifact(value='s3:/

# Specify the Dataset in S3
We are using the train, validation, and test splits created in the previous section.

In [30]:
print(processed_train_data_s3_uri)

!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-train
2020-12-18 17:53:19     352994 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-12-18 17:53:19      11896 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-12-18 17:51:44      10622 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


In [31]:
print(processed_validation_data_s3_uri)

!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-validation
2020-12-18 17:53:19      20127 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-12-18 17:53:19        660 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-12-18 17:51:45        649 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


In [32]:
print(processed_test_data_s3_uri)

!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-test
2020-12-18 17:53:20      19789 part-algo-1-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-12-18 17:53:20        720 part-algo-1-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-12-18 17:51:45        603 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.tfrecord


# Specify S3 `Distribution Strategy`

In [33]:
from sagemaker.inputs import TrainingInput

s3_input_train_data = TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-18-17-46-35-248/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Setup Hyper-Parameters for Classification Layer

In [34]:
print(max_seq_length)

64


In [35]:
epochs=3
learning_rate=0.00001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=100
validation_steps=100
test_steps=100
train_instance_count=1
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=False
enable_sagemaker_debugger=True
enable_checkpointing=False
enable_tensorboard=False
input_mode='File'
run_validation=True
run_test=True
run_sample_predictions=True

# Setup Metrics To Track Model Performance

These sample log lines...
```
45/50 [=====>..] - ETA: 3s - loss: 0.425 - accuracy: 0.881
50/50 [=======>] - ETA: 0s - val_loss: 0.407 - val_accuracy: 0.885
```
...will produce the following 4 metrics in CloudWatch:

`loss` = 0.425

`accuracy` = 0.881

`val_loss` = 0.407

`val_accuracy` = 0.885

<img src="img/cloudwatch_train_accuracy.png" width="50%" align="left">

<img src="img/cloudwatch_train_loss.png" width="50%" align="left">

In [36]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

# Setup SageMaker Debugger
Define Debugger Rules as deccribed here:  https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

In [37]:
# from sagemaker.debugger import Rule
# from sagemaker.debugger import rule_configs
# from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

# rules=[
#         Rule.sagemaker(
#             rule_configs.loss_not_decreasing(),
#             rule_parameters={
#                 'collection_names': 'losses,metrics',
#                 'use_losses_collection': 'true',
#                 'num_steps': '10',
#                 'diff_percent': '50'
#             },
#             collections_to_save=[
#                 CollectionConfig(name='losses',
#                                  parameters={
#                                      'save_interval': '10',
#                                  }),
#                 CollectionConfig(name='metrics',
#                                  parameters={
#                                      'save_interval': '10',
#                                  })
#             ]
#         ),
#         Rule.sagemaker(
#             rule_configs.overtraining(),
#             rule_parameters={
#                 'collection_names': 'losses,metrics',
#                 'patience_train': '10',
#                 'patience_validation': '10',
#                 'delta': '0.5'
#             },
#             collections_to_save=[
#                 CollectionConfig(name='losses',
#                                  parameters={
#                                      'save_interval': '10',
#                                  }),
#                 CollectionConfig(name='metrics',
#                                  parameters={
#                                      'save_interval': '10',
#                                  })
#             ]
#         )
#     ]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'save_interval': '10', # number of steps
        'export_tensorboard': 'true',
        'tensorboard_dir': 'hook_tensorboard/',
    })

## Configure Debugger Rules
We specify the following rules:

* loss_not_decreasing: checks if loss is decreasing and triggers if the loss has not decreased by a certain persentage in the last few iterations
* LowGPUUtilization: checks if GPU is under-utilizated
* ProfilerReport: runs the entire set of performance rules and create a final output report with further insights and recommendations.

In [38]:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules=[ 
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overtraining()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

## Specify a Debugger profiler configuration

The following configuration will capture system metrics at 500 milliseconds. The system metrics include utilization per CPU, GPU, memory utilization per CPU, GPU as well I/O and network.

Debugger will capture detailed profiling information from step 5 to step 15. This information includes Horovod metrics, dataloading, preprocessing, operators running on CPU and GPU.

In [39]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(local_path="/opt/ml/output/profiler/", start_step=5, num_steps=10)
)

# Specify Checkpoint S3 Location
This is used for Spot Instances Training.  If nodes are replaced, the new node will start training from the latest checkpoint.

In [40]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-231218423789/checkpoints/f4f53595-bbfb-4ad7-bf45-c1ea409d0387/


# Setup Our BERT + TensorFlow Script to Run on SageMaker
Prepare our TensorFlow model to run on the managed SageMaker service

In [41]:
!pygmentize src/tf_bert_reviews.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33

In [43]:
tf_image = sagemaker.image_uris.retrieve('tensorflow', region=region, version='2.3.1', py_version='py37', image_scope='training', instance_type='ml.p3.2xlarge')
print(tf_image)

763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37


In [44]:
print(role)

arn:aws:iam::231218423789:role/TeamRole


In [45]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py',
                       source_dir='src-tf231-profiler',
                       role=role,
                       image_uri=tf_image,
                       instance_count=train_instance_count,
                       instance_type=train_instance_type,
                       volume_size=train_volume_size,
#                        use_spot_instances=True,
#                        max_wait=7200, # Seconds to wait for spot instances to become available
                       checkpoint_s3_uri=checkpoint_s3_uri,
#                       py_version='py37',
#                       framework_version='2.3.1',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_sagemaker_debugger': enable_sagemaker_debugger,
                                        'enable_checkpointing': enable_checkpointing,
                                        'enable_tensorboard': enable_tensorboard,                                        
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions},
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions,
                       rules=rules,
                       debugger_hook_config=hook_config, 
                       profiler_config=profiler_config
#                       max_run=7200, # number of seconds
                      )

# Create the `Experiment Config`

In [46]:
experiment_config = {
    'ExperimentName': experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'train'
}

In [47]:
print(experiment_name)

Amazon-Customer-Reviews-BERT-Experiment-1608314434


In [48]:
%store experiment_name

Stored 'experiment_name' (str)


In [49]:
print(trial_name)

trial-1608314434


In [50]:
%store trial_name

Stored 'trial_name' (str)


# Train the Model on SageMaker

In [51]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
              },              
              experiment_config=experiment_config,                   
              wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-12-18-18-01-49-949


In [52]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  tensorflow-training-2020-12-18-18-01-49-949


In [53]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [54]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [55]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [56]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Checkpoint Data</a> After The Training Job Has Completed</b>'.format(bucket, checkpoint_s3_prefix, region)))


In [57]:
%%time

estimator.latest_training_job.wait(logs=False)


2020-12-18 18:01:54 Starting - Launching requested ML instances.............
2020-12-18 18:03:08 Starting - Preparing the instances for training..................
2020-12-18 18:04:44 Downloading - Downloading input data.....
2020-12-18 18:05:17 Training - Downloading the training image.....
2020-12-18 18:05:48 Training - Training image download completed. Training in progress..................................................................................
2020-12-18 18:12:38 Uploading - Uploading generated training model...............
2020-12-18 18:13:57 Completed - Training job completed
CPU times: user 576 ms, sys: 43.6 ms, total: 619 ms
Wall time: 11min 59s


In [59]:
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName='tensorflow-training-2020-12-18-18-01-49-949')

{'TrainingJobName': 'tensorflow-training-2020-12-18-18-01-49-949',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:231218423789:training-job/tensorflow-training-2020-12-18-18-01-49-949',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'enable_checkpointing': 'false',
  'enable_sagemaker_debugger': 'true',
  'enable_tensorboard': 'false',
  'epochs': '3',
  'epsilon': '1e-08',
  'freeze_bert_layer': 'false',
  'learning_rate': '1e-05',
  'max_seq_length': '64',
  'model_dir': '"s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/model"',
  'run_sample_predictions': 'true',
  'run_test': 'true',
  'run_validation': 'true',
  'sagemaker_container_log_level': '20',
  'sagemaker_job_name': '"tensorflow-training-2020-12-18-18-01-49-949"',
  'sagemaker_program': '"tf_bert_reviews

# Wait Until the ^^ Training Job ^^ Completes Above!

# [INFO] _Feel free to continue to the next workshop section while this notebook is running._

## 2.  Analyze Profiling Data

Copy outputs of the following cell (`training_job_name` and `region`) to run the analysis notebooks `profiling_generic_dashboard.ipynb`, `analyze_performance_bottlenecks.ipynb`, and `profiling_interactive_analysis.ipynb`.

In [63]:
training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Training jobname: tensorflow-training-2020-12-18-18-01-49-949
Region: us-east-1


While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook.
Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook 
[profiling_interactive_analysis.ipynb](analysis_tools/profiling_interactive_analysis.ipynb) for more details. In the following code cell we plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to `select_dimension` and `select_events`.

In [64]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

[2020-12-18 18:19:18.734 ip-172-16-176-76:18469 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None


ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-231218423789/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 5, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler/', 'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 5, "NumSteps": 10, }'}}
s3 path:s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/profiler-output


Profiler data from system is available


In [65]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts  = TimelineCharts(system_metrics_reader, 
                                       framework_metrics_reader=None,
                                       select_dimensions=["CPU", "GPU"],
                                       select_events=["total"])

[2020-12-18 18:19:30.724 ip-172-16-176-76:18469 INFO metrics_reader_base.py:134] Getting 9 event files
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'CPUUtilization-nodeid:algo-1', 'GPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-1'}


## Download Debugger Profiling Report and Notebook

The profiling report rule will create an html report `profiler-report.html` with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket.  

In [None]:
#s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/rule-output/ProfilerReport/profiler-output/profiler-report.html

In [69]:
profiler_report = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output/ProfilerReport/profiler-output/profiler-report.html"
print(profiler_report)

s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/rule-output/ProfilerReport/profiler-output/profiler-report.html


In [70]:
!aws s3 cp $profiler_report .

download: s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/rule-output/ProfilerReport/profiler-output/profiler-report.html to ./profiler-report.html


In [73]:
profiler_notebook = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb"
print(profiler_notebook)

s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb


In [74]:
!aws s3 cp $profiler_notebook .

download: s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ./profiler-report.ipynb


## Download Trained Model

In [60]:
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./model.tar.gz

download: s3://sagemaker-us-east-1-231218423789/tensorflow-training-2020-12-18-18-01-49-949/output/model.tar.gz to ./model.tar.gz


In [61]:
!mkdir -p ./model/
!tar -xvzf ./model.tar.gz -C ./model/

metrics/
metrics/confusion_matrix.png
tensorflow/
tensorflow/saved_model/
tensorflow/saved_model/0/
tensorflow/saved_model/0/variables/
tensorflow/saved_model/0/variables/variables.index
tensorflow/saved_model/0/variables/variables.data-00000-of-00001
tensorflow/saved_model/0/saved_model.pb
tensorflow/saved_model/0/assets/
transformers/
transformers/fine-tuned/
transformers/fine-tuned/tf_model.h5
transformers/fine-tuned/config.json
code/
code/inference.py
tensorboard/


In [62]:
!saved_model_cli show --all --dir ./model/tensorflow/saved_model/0/

2020-12-18 18:18:04.676279: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/efa/lib:/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:
2020-12-18 18:18:04.676386: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/ef

# Show the Experiment Tracking Lineage

In [None]:
from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=['validation:accuracy'],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

In [None]:
lineage_df

In [None]:
sm.describe_trial_component(TrialComponentName=lineage_df.TrialComponentName[0])

# Analyze Debugger Rules

In [None]:
estimator.latest_training_job.rule_job_summary()

In [None]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)

# Pass Variables to the Next Notebook(s)

In [None]:
%store training_job_name

In [None]:
%store prepare_trial_component_name

In [None]:
%store training_job_debugger_artifacts_path

In [None]:
%store

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();