# BERT Pre-Training, Feature Engineering, and Fine-Tuning with TensorFlow

In some cases, the entire BERT model needs to be **pre-trained** from scratch because the domain-specific dataset contains words and entities that were not in the original Wikipedia and Google Books training datasets used to create the BERT pre-trained models.  In these cases, the BERT neural network architecture is re-used, but the pre-trained model weights are thrown out and the BERT model is trained from scratch.  Fortunately, BERT has been pre-trained on a very large amount of data and therefore contains a huge vocabulary applicable to a large number of domains including ours.  We will safely re-use and build upon the existing pre-trained BERT model.

**Fine-tuning** re-uses the language understanding and semantics learned by the base BERT model (pre-trained on Wikipedia and Google Books) and learns our domain-specific dataset.  Fine-tuning happens very quickly and requires a relatively small number of samples (ie. reviews, in our case).  This translates to lower processing power and lower cost.


![BERT Training](img/bert_training.png)

# Feature Engineering

In the previous section, we've already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

![BERT Pre-Processing](img/bert_preprocessing.png)

# Fine-Tuning

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.


In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [2]:
!pip install -q smdebug==0.7.2
!pip install -q sagemaker-experiments==0.1.11

# Specify the Dataset in S3

## Option 1 - Upload the Features from `./data-tfrecord` to S3

In [3]:
#!aws s3 cp --recursive ./data-tfrecord/ s3://$bucket/data-tfrecord/output/

In [4]:
#scikit_processing_job_name = 'data-tfrecord'

In [5]:
#%store scikit_processing_job_name

## Option 2 - Use the Features in S3 from the Previous Section

In [6]:
%store -r scikit_processing_job_name

In [7]:
print(scikit_processing_job_name)

sagemaker-scikit-learn-2020-04-25-18-17-26-390


In [8]:
# scikit_processing_job_s3_output_prefix = 'data'
print('Previous Scikit Processing Job Name: {}'.format(scikit_processing_job_name))

Previous Scikit Processing Job Name: sagemaker-scikit-learn-2020-04-25-18-17-26-390


In [9]:
prefix_train = '{}/output/bert-train'.format(scikit_processing_job_name)
prefix_validation = '{}/output/bert-validation'.format(scikit_processing_job_name)
prefix_test = '{}/output/bert-test'.format(scikit_processing_job_name)

train_s3_uri = 's3://{}/{}'.format(bucket, prefix_train)
validation_s3_uri = 's3://{}/{}'.format(bucket, prefix_validation)
test_s3_uri = 's3://{}/{}'.format(bucket, prefix_test)

In [10]:
print(train_s3_uri)
!aws s3 ls $train_s3_uri/

s3://sagemaker-us-west-2-086401037028/sagemaker-scikit-learn-2020-04-25-18-17-26-390/output/bert-train
2020-04-25 18:23:34   50961824 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-04-25 18:24:24   71731312 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


# Specify S3 Distribution Strategy

In [11]:
s3_input_train_data = sagemaker.s3_input(s3_data=train_s3_uri, distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.s3_input(s3_data=validation_s3_uri, distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.s3_input(s3_data=test_s3_uri, distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-086401037028/sagemaker-scikit-learn-2020-04-25-18-17-26-390/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-086401037028/sagemaker-scikit-learn-2020-04-25-18-17-26-390/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-086401037028/sagemaker-scikit-learn-2020-04-25-18-17-26-390/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Show TensorFlow Training Code

In [12]:
!cat src/tf_bert_reviews.py

import time
import random
import pandas as pd
from glob import glob
import pprint
import argparse
import json
import subprocess
import sys
import os
import tensorflow as tf
#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'transformers==2.8.0'])
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker-tensorflow==2.1.0.1.0.0'])
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'smdebug==0.7.2'])
from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification
from transformers import TextClassificationPipeline
from transformers.configuration_distilbert import DistilBertConfig

CLASSES = [1, 2, 3, 4, 5]

def select_data_and_label_from_record(record):
    x = {
        'input_ids': record['input_ids'],
        'input_mask': record['input_mask'],
        'segment_ids': record['segment_ids']
 

# Setup Hyper-Parameters For Classification Layer

In [13]:
epochs=1
learning_rate=0.00001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=1000
validation_steps=1000
test_steps=1000
train_instance_count=1
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
use_xla=True
use_amp=True
max_seq_length=128                                # must match setting used when engineering features 
freeze_bert_layer=False
enable_sagemaker_debugger=True                    # Enable SM Debugger
input_mode='Pipe'                                 # 'File' or 'Pipe' Mode == similar to Linux pipes, loaded as needed
run_validation=True
run_test=True
run_sample_predictions=True

# Setup SageMaker Experiment Tracking For Training Job

In [14]:
import time
timestamp = '{}'.format(int(time.time()))

from smexperiments.experiment import Experiment
experiment=Experiment.create(
    experiment_name='Train-Reviews-BERT-Experiment-{}'.format(timestamp),
    description='Train Reviews BERT', 
    sagemaker_boto_client=sm)

# Log The Training Hyper-Parameters

In [15]:
from smexperiments.tracker import Tracker
tracker_display_name='Train-Reviews-BERT-Experiment-{}'.format(timestamp)
print(tracker_display_name)

tracker = Tracker.create(display_name=tracker_display_name, sagemaker_boto_client=sm)
tracker.log_parameters({
    'epochs': epochs,
    'learning_rate': learning_rate,
    'epsilon': epsilon,
    'train_batch_size': train_batch_size,
    'validation_batch_size': validation_batch_size,
    'test_batch_size': test_batch_size,
    'train_steps_per_epoch': train_steps_per_epoch,
    'validation_steps': validation_steps,
    'test_steps': test_steps,
    'train_instance_count': train_instance_count,
    'train_instance_type': train_instance_type,
    'train_volume_size': train_volume_size,
    'use_xla': use_xla,
    'use_amp': use_amp,
    'max_seq_length': max_seq_length,
    'freeze_bert_layer': freeze_bert_layer,
    'enable_sagemaker_debugger': enable_sagemaker_debugger,
    'input_mode': input_mode, 
    'run_validation': run_validation,
    'run_test': run_test,
    'run_sample_predictions': run_sample_predictions,    
})


Train-Reviews-BERT-Experiment-1587839693


# Log the S3 Input Locations

In [16]:
tracker.log_input(name='reviews_dataset_train', media_type='s3/uri', value=train_s3_uri)
tracker.log_input(name='reviews_dataset_validation', media_type='s3/uri', value=validation_s3_uri)
tracker.log_input(name='reviews_dataset_test', media_type='s3/uri', value=test_s3_uri)

# Specify a Trial for this Experiment

In [17]:
from smexperiments.trial import Trial
trial_name='Train-Reviews-BERT-Trial-{}'.format(timestamp)
trial = Trial.create(trial_name=trial_name, 
                     experiment_name=experiment.experiment_name, 
                     sagemaker_boto_client=sm)
trial.add_trial_component(tracker.trial_component)
trial_component_display_name='Train-Reviews-BERT-Trial-{}'.format(timestamp)
experiment_config={'ExperimentName': experiment.experiment_name,
                   'TrialName': trial.trial_name,
                   'TrialComponentDisplayName': trial_component_display_name}

# Setup Metrics To Track Model Performance

In [18]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

# Setup SageMaker Debugger
Define Debugger Rules

In [19]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'use_losses_collection': 'true',
                'num_steps': '10',
                'diff_percent': '50'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        ),
        Rule.sagemaker(
            rule_configs.overtraining(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'patience_train': '10',
                'patience_validation': '10',
                'delta': '0.5'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        )
    ]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'save_interval': '10', # number of steps
        'export_tensorboard': 'true',
        'tensorboard_dir': 'hook_tensorboard/',
    })

# Setup Our BERT + TensorFlow Script to Run on SageMaker
Prepare our TensorFlow model to run on the managed SageMaker service

In [20]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py', 
                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=role,
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
                       py_version='py3',
                       framework_version='2.1.0',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_sagemaker_debugger': enable_sagemaker_debugger,
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions},
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions,
                       rules=rules,
                       debugger_hook_config=hook_config,                       
                       train_max_run=7200 # max 2 hours * 60 minutes seconds per hour * 60 seconds per minute
                      )

# Train the Model on SageMaker

In [21]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
                     },
                     experiment_config=experiment_config,                   
                     wait=False)

INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-04-25-18-34-53-740


In [22]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  tensorflow-training-2020-04-25-18-34-53-740


In [23]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))

In [24]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))

In [25]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))

# Wait Until the Training Job Completes Above!

# Show Experiment Results

In [33]:
from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=experiment.experiment_name,
    sort_by='metrics.validation:accuracy.max',  # Sorting By Validation Accuracy
    sort_order='Descending',
    metric_names=['validation:accuracy'],
    parameter_names=['epochs', 'train_batch_size']
)

In [34]:
analytics_table = trial_component_analytics.dataframe()
analytics_table

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,epochs,train_batch_size,validation:accuracy - Min,validation:accuracy - Max,validation:accuracy - Avg,validation:accuracy - StdDev,validation:accuracy - Last,validation:accuracy - Count
0,tensorflow-training-2020-04-25-18-34-53-740-aw...,Train-Reviews-BERT-Trial-1587839693,arn:aws:sagemaker:us-west-2:086401037028:train...,1.0,128.0,0.7198,0.7198,0.7198,0.0,0.7198,1.0
1,TrialComponent-2020-04-25-183453-qtks,Train-Reviews-BERT-Experiment-1587839693,,,,,,,,,


In [35]:
lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment.experiment_name,
    sort_by="CreationTime",
    sort_order="Ascending",
)
lineage_table.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,enable_sagemaker_debugger,epochs,epsilon,...,train:accuracy - Avg,train:accuracy - StdDev,train:accuracy - Last,train:accuracy - Count,loss_EVAL - Min,loss_EVAL - Max,loss_EVAL - Avg,loss_EVAL - StdDev,loss_EVAL - Last,loss_EVAL - Count
0,TrialComponent-2020-04-25-183453-qtks,Train-Reviews-BERT-Experiment-1587839693,,,,,,,,,...,,,,,,,,,,
1,tensorflow-training-2020-04-25-18-34-53-740-aw...,Train-Reviews-BERT-Trial-1587839693,arn:aws:sagemaker:us-west-2:086401037028:train...,763104351884.dkr.ecr.us-west-2.amazonaws.com/t...,1.0,ml.p3.2xlarge,1024.0,True,1.0,1e-08,...,0.634206,0.139124,0.7248,17.0,0.534383,0.88062,0.710282,0.068981,0.667141,201.0


# Analyze Debugger Rules

In [36]:
estimator.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:086401037028:processing-job/tensorflow-training-2020-0-lossnotdecreasing-18545fcf',
  'RuleEvaluationStatus': 'InProgress',
  'LastModifiedTime': datetime.datetime(2020, 4, 25, 18, 45, 26, 249000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'Overtraining',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:086401037028:processing-job/tensorflow-training-2020-0-overtraining-969ea0ef',
  'RuleEvaluationStatus': 'IssuesFound',
  'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule Overtraining at step 1100 resulted in the condition being met\n',
  'LastModifiedTime': datetime.datetime(2020, 4, 25, 18, 45, 26, 249000, tzinfo=tzlocal())}]

In [37]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)

s3://sagemaker-us-west-2-086401037028/tensorflow-training-2020-04-25-18-34-53-740/debug-output


# Pass Variables to the Next Notebook(s)

In [38]:
%store training_job_debugger_artifacts_path

Stored 'training_job_debugger_artifacts_path' (str)


In [39]:
%store training_job_name

Stored 'training_job_name' (str)
