# Fine-Tuning a BERT Model and Create a Text Classifier

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.


## Feature Engineering

In the previous section, we've already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

![BERT Training](img/bert_training.png)

![BERT Pre-Processing](img/prepare_dataset_bert.png)

In [189]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [190]:
!pip install -q smdebug==0.8.0
!pip install -q sagemaker-experiments==0.1.13

# Track the `Experiment`
We will track every step of this experiment throughout the `prepare`, `train`, and `deploy`.

# Concepts

**Experiment**: A collection of related Trials.  Add Trials to an Experiment that you wish to compare together.

**Trial**: A description of a multi-step machine learning workflow. Each step in the workflow is described by a Trial Component. There is no relationship between Trial Components such as ordering.

**Trial Component**: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.

**Tracker**: A logger of information about a single TrialComponent.

![SageMaker Experiments](img/sagemaker-experiments.png)


# Create the Experiment

In [191]:
import time
from smexperiments.experiment import Experiment

timestamp = '{}'.format(int(time.time()))

experiment = Experiment.create(
                experiment_name='Amazon-Customer-Reviews-BERT-Experiment-{}'.format(timestamp),
                description='Amazon Customer Reviews BERT Experiment', 
                sagemaker_boto_client=sm)

experiment_name = experiment.experiment_name
print('Experiment name: {}'.format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1592027317


# Create the Trial

In [192]:
import time
from smexperiments.trial import Trial

timestamp = '{}'.format(int(time.time()))

trial = Trial.create(trial_name='trial-{}'.format(timestamp),
                     experiment_name=experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1592027317


# Create the `prepare` Trial Component and Tracker
Note:  A Trial Component is actually created through a Tracker.  This is a bit confusing, we know.

In [193]:
from smexperiments.tracker import Tracker

tracker_prepare = Tracker.create(display_name='prepare', 
                                 sagemaker_boto_client=sm)

prepare_trial_component_name = tracker_prepare.trial_component.trial_component_name
print('Prepare trial component name {}'.format(prepare_trial_component_name))

Prepare trial component name TrialComponent-2020-06-13-054837-cpaj


# Attach the `prepare` Trial Component and Tracker as a Component to the Trial

In [194]:
trial.add_trial_component(tracker_prepare.trial_component)

# Log All Parameters Used During `prepare` Phase

In [195]:
%store -r s3_raw_input_data

In [196]:
print(s3_raw_input_data)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/


In [197]:
tracker_prepare.log_input(name='raw_data_s3_uri', 
                          media_type='s3/uri', 
                          value=s3_raw_input_data)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fe2decd2358>,trial_component_name='TrialComponent-2020-06-13-054837-cpaj',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:835319576252:experiment-trial-component/trialcomponent-2020-06-13-054837-cpaj',response_metadata={'RequestId': 'e0e3301c-f2e1-4131-8eb2-53cff1accfe2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e0e3301c-f2e1-4131-8eb2-53cff1accfe2', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Sat, 13 Jun 2020 05:48:36 GMT'}, 'RetryAttempts': 0},parameters={},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/',media_type='s3/uri')},output_artifacts={})

In [198]:
%store -r train_split_percentage

In [199]:
print(train_split_percentage)

0.9


In [200]:
%store -r validation_split_percentage

In [201]:
print(validation_split_percentage)

0.05


In [202]:
%store -r test_split_percentage

In [203]:
print(test_split_percentage)

0.05


In [204]:
%store -r max_seq_length

In [205]:
print(max_seq_length)

128


In [206]:
tracker_prepare.log_parameters({
    'max_seq_length': max_seq_length,
    'train_split_percentage': train_split_percentage,
    'validation_split_percentage': validation_split_percentage,
    'test_split_percentage': test_split_percentage,
})

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fe2decd2358>,trial_component_name='TrialComponent-2020-06-13-054837-cpaj',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:835319576252:experiment-trial-component/trialcomponent-2020-06-13-054837-cpaj',response_metadata={'RequestId': '13d9a6a5-e0ad-4828-a0c9-1c0a6926b246', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '13d9a6a5-e0ad-4828-a0c9-1c0a6926b246', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Sat, 13 Jun 2020 05:48:37 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/',media_type='s3/uri')},output_artifacts={})

In [207]:
%store -r processed_train_data_s3_uri

In [208]:
print(processed_train_data_s3_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-train


In [209]:
%store -r processed_validation_data_s3_uri

In [210]:
print(processed_validation_data_s3_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-validation


In [211]:
%store -r processed_test_data_s3_uri

In [212]:
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-test


In [213]:
tracker_prepare.log_output(name='train_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_train_data_s3_uri)

tracker_prepare.log_output(name='validation_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_validation_data_s3_uri)

tracker_prepare.log_output(name='test_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_test_data_s3_uri)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fe2decd2358>,trial_component_name='TrialComponent-2020-06-13-054837-cpaj',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:835319576252:experiment-trial-component/trialcomponent-2020-06-13-054837-cpaj',response_metadata={'RequestId': 'fb924ddf-2cb2-448a-be3e-95d922e9deac', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'fb924ddf-2cb2-448a-be3e-95d922e9deac', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Sat, 13 Jun 2020 05:48:37 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/',media_type='s3/uri')},output_artifacts={'train_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-83531957625

# Specify the Dataset in S3
We are using the train, validation, and test splits created in the previous section.

In [214]:
print(processed_train_data_s3_uri)

!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-train
2020-06-04 01:16:04    2841914 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-06-04 01:16:04     816731 part-algo-1-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-06-04 01:16:04    1291015 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-06-04 01:16:04     450177 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-06-04 01:16:04    2382227 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-06-04 01:16:24    2592266 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-06-04 01:16:24     361320 part-algo-10-amazon_reviews_us_Home_Entertainment_v1_00.tfrecord
2020-06-04 01:16:24    2465781 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-06-04 01:16:24     857614 part-algo-10-amazon_reviews_us_Tools_v1_00.tfrecord
2020-06-04 01:16:34    1701886 part-algo-2-amazon_reviews_us_Automotive_v1_00.tfre

In [215]:
print(processed_validation_data_s3_uri)

!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-validation
2020-06-04 01:16:04     158328 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-06-04 01:16:04      45879 part-algo-1-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-06-04 01:16:04      72380 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-06-04 01:16:04      25413 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-06-04 01:16:04     133330 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-06-04 01:16:24     143760 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-06-04 01:16:24      20252 part-algo-10-amazon_reviews_us_Home_Entertainment_v1_00.tfrecord
2020-06-04 01:16:24     137101 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-06-04 01:16:24      48587 part-algo-10-amazon_reviews_us_Tools_v1_00.tfrecord
2020-06-04 01:16:34      93564 part-algo-2-amazon_reviews_us_Automotive_v1_00

In [216]:
print(processed_test_data_s3_uri)

!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-test
2020-06-04 01:16:04     158099 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-06-04 01:16:04      45984 part-algo-1-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-06-04 01:16:04      71297 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-06-04 01:16:04      25022 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-06-04 01:16:04     133335 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-06-04 01:16:24     145407 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-06-04 01:16:24      20527 part-algo-10-amazon_reviews_us_Home_Entertainment_v1_00.tfrecord
2020-06-04 01:16:24     137989 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-06-04 01:16:24      48058 part-algo-10-amazon_reviews_us_Tools_v1_00.tfrecord
2020-06-04 01:16:34      94658 part-algo-2-amazon_reviews_us_Automotive_v1_00.tfrec

# Specify S3 Distribution Strategy

In [217]:
s3_input_train_data = sagemaker.s3_input(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.s3_input(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.s3_input(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-06-04-01-07-43-728/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Show TensorFlow Training Code

In [218]:
!pygmentize src/tf_bert_reviews.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[3

            [34mprint[39;49;00m([33m'[39;49;00m[33mvalidation_data_filenames {}[39;49;00m[33m'[39;49;00m.format(validation_data_filenames))
            validation_dataset = file_based_input_dataset_builder(
                channel=[33m'[39;49;00m[33mvalidation[39;49;00m[33m'[39;49;00m,
                input_filenames=validation_data_filenames,
                pipe_mode=pipe_mode,
                is_training=[36mFalse[39;49;00m,
                drop_remainder=[36mFalse[39;49;00m,
                batch_size=validation_batch_size,
                epochs=epochs,
                steps_per_epoch=validation_steps,
                max_seq_length=max_seq_length).map(select_data_and_label_from_record)
            
            [34mprint[39;49;00m([33m'[39;49;00m[33mStarting Training and Validation...[39;49;00m[33m'[39;49;00m)
            validation_dataset = validation_dataset.take(validation_steps)
            train_and_validation_history = model.fit(train

# Setup Hyper-Parameters for Classification Layer

In [219]:
print(max_seq_length)

128


In [220]:
epochs=1
learning_rate=0.00001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=50
validation_steps=50
test_steps=50
train_instance_count=1
train_instance_type='ml.c5.9xlarge'
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=True
enable_sagemaker_debugger=True
enable_checkpointing=False
enable_tensorboard=False
input_mode='Pipe'
run_validation=True
run_test=True
run_sample_predictions=True

# Setup Metrics To Track Model Performance

In [221]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

# Setup SageMaker Debugger
Define Debugger Rules

In [222]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'use_losses_collection': 'true',
                'num_steps': '10',
                'diff_percent': '50'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        ),
        Rule.sagemaker(
            rule_configs.overtraining(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'patience_train': '10',
                'patience_validation': '10',
                'delta': '0.5'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        )
    ]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'save_interval': '10', # number of steps
        'export_tensorboard': 'true',
        'tensorboard_dir': 'hook_tensorboard/',
    })

# Specify Checkpoint S3 Location
This is used for Spot Instances Training.  If nodes are replaced, the new node will start training from the latest checkpoint.

In [223]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-835319576252/checkpoints/0afec2b4-b23a-45b4-87d2-bd296084b284/


# Setup Our BERT + TensorFlow Script to Run on SageMaker
Prepare our TensorFlow model to run on the managed SageMaker service

In [224]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py', 
                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=role,
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
#                        train_use_spot_instances=True,
#                        train_max_wait=7200, # Seconds to wait for spot instances to become available
                       checkpoint_s3_uri=checkpoint_s3_uri,
                       py_version='py3',
                       framework_version='2.1.0',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_sagemaker_debugger': enable_sagemaker_debugger,
                                        'enable_checkpointing': enable_checkpointing,
                                        'enable_tensorboard': enable_tensorboard,                                        
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions},
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions,
                       rules=rules,
                       debugger_hook_config=hook_config,                       
#                       train_max_run=7200, # max 2 hours * 60 minutes seconds per hour * 60 seconds per minute
                      )

# Create the Experiment Config

In [225]:
experiment_config = {
    'ExperimentName': experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'train'
}

# Train the Model on SageMaker

In [226]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
              },              
              experiment_config=experiment_config,                   
              wait=False)

INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-06-13-05-48-40-400


In [227]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  tensorflow-training-2020-06-13-05-48-40-400


In [228]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [229]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [230]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [231]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Checkpoint Data</a> After The Training Job Has Completed</b>'.format(bucket, checkpoint_s3_prefix, region)))


# Wait Until the ^^ Training Job ^^ Completes Above!

In [232]:
estimator.latest_training_job.wait(logs=False)


2020-06-13 05:49:59 Starting - Starting the training job
2020-06-13 05:50:01 Starting - Launching requested ML instances..............
2020-06-13 05:51:15 Starting - Preparing the instances for training.......
2020-06-13 05:51:58 Downloading - Downloading input data....
2020-06-13 05:52:25 Training - Downloading the training image..
2020-06-13 05:52:39 Training - Training image download completed. Training in progress.......................
2020-06-13 05:54:31 Uploading - Uploading generated training model
2020-06-13 05:54:39 Failed - Training job failed


UnexpectedStatusException: Error for Training job tensorflow-training-2020-06-13-05-48-40-400: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/bin/python3 tf_bert_reviews.py --enable_checkpointing False --enable_sagemaker_debugger True --enable_tensorboard False --epochs 1 --epsilon 1e-08 --freeze_bert_layer True --learning_rate 1e-05 --max_seq_length 128 --model_dir s3://sagemaker-us-east-1-835319576252/tensorflow-training-2020-06-13-05-48-40-400/model --run_sample_predictions True --run_test True --run_validation True --test_batch_size 128 --test_steps 50 --train_batch_size 128 --train_steps_per_epoch 50 --use_amp True --use_xla True --validation_batch_size 128 --validation_steps 50"

# Show the Experiment Tracking Lineage

In [None]:
from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=['validation:accuracy'],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

In [None]:
lineage_df

In [None]:
sm.describe_trial_component(TrialComponentName=lineage_df.TrialComponentName[0])

# Analyze Debugger Rules

In [None]:
estimator.latest_training_job.rule_job_summary()

In [None]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)

# Pass Variables to the Next Notebook(s)

In [None]:
print(training_job_name)

In [None]:
%store training_job_name

In [None]:
print(experiment_name)

In [None]:
%store experiment_name

In [None]:
print(trial_name)

In [None]:
%store trial_name

In [None]:
print(prepare_trial_component_name)

In [None]:
%store prepare_trial_component_name

In [None]:
print(training_job_debugger_artifacts_path)

In [None]:
%store training_job_debugger_artifacts_path