# Fine-Tuning a BERT Model and Create a Text Classifier

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.


## Feature Engineering

In the previous section, we've already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

![BERT Training](img/bert_training.png)

![BERT Pre-Processing](img/prepare_dataset_bert.png)

In [1]:
!pip install --user -qU 'sagemaker[local]'

In [2]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
#role = sagemaker.get_execution_role()
role = role = 'arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881'
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

ClientError: An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid.

# Specify the Dataset
We are using the train, validation, and test splits created in the previous section.

In [3]:
!ls ./data-tfrecord/

ls: ./data-tfrecord/: No such file or directory


In [4]:
processed_train_data_local = 'file://data-tfrecord/bert-train'
print(processed_train_data_local)

file://data-tfrecord/bert-train


In [5]:
processed_validation_data_local = 'file://data-tfrecord/bert-train'
print(processed_validation_data_local)

file://data-tfrecord/bert-train


In [6]:
processed_test_data_local = 'file://data-tfrecord/bert-train'
print(processed_validation_data_local)

file://data-tfrecord/bert-train


# Show TensorFlow Training Code

In [7]:
!pygmentize src/tf_bert_reviews.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33

# Setup Hyper-Parameters for Classification Layer

In [15]:
epochs=1
learning_rate=0.00001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=50
validation_steps=50
test_steps=50
train_instance_count=1
train_instance_type='local'
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=False
enable_sagemaker_debugger=True                    
input_mode='File'
run_validation=True
run_test=True
run_sample_predictions=True
max_seq_length=128

# Setup Our BERT + TensorFlow Script to Run Locally
Prepare our TensorFlow model to run on your local machine

In [20]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py', 
                       source_dir='src', # put requirements.txt in this directory and it gets picked up
                       role=role,
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
                       train_max_wait=7200, # Seconds to wait for spot instances to become available
                       py_version='py3',
                       framework_version='2.1.0',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions},
                       input_mode=input_mode,                    
                       train_max_run=7200 # max 2 hours * 60 minutes seconds per hour * 60 seconds per minute
                      )

# Train the Model

### Note: Due to an issue with resourceconfig.json not being found by the TensorFlow 2.0.0+ SageMaker Deep Learning Docker container, we can't run this.  TF 1.15 containers are ok, it seems.

We're seeing this error:
```
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/50/1dlms49d3013ybsdl_k9nph0m05pfl/T/tmp0xw_kh44/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
```

In [21]:
estimator.fit(inputs={'train': processed_train_data_local, 
                      'validation': processed_validation_data_local,
                      'test': processed_test_data_local
              },                             
              wait=False)

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


Creating tmp0xw_kh44_algo-1-1eri7_1 ... 
[1BAttaching to tmp0xw_kh44_algo-1-1eri7_12mdone[0m
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:34,486 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:34,500 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:35,770 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:35,801 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:35,823 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:33:35,834 sagemaker-containers INFO     Invoking user script
[36malgo-1-1eri7_1  |[0m 
[36malgo-1-1eri7_1  |[0m Training Env:
[36malgo-1-1eri7_1  |[0m 
[36malgo-1-1eri7_1  |[0m {
[36malgo-1-1eri7_1  |[0

[36malgo-1-1eri7_1  |[0m Collecting transformers==2.8.0
[36malgo-1-1eri7_1  |[0m   Downloading transformers-2.8.0-py3-none-any.whl (563 kB)
[K     |████████████████████████████████| 563 kB 673 kB/s eta 0:00:01
[36malgo-1-1eri7_1  |[0m Collecting tokenizers==0.5.2
[36malgo-1-1eri7_1  |[0m   Downloading tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB)
[K     |████████████████████████████████| 3.7 MB 3.0 MB/s eta 0:00:01
[36malgo-1-1eri7_1  |[0m [?25hCollecting regex!=2019.12.17
[36malgo-1-1eri7_1  |[0m   Downloading regex-2020.5.14-cp36-cp36m-manylinux2010_x86_64.whl (675 kB)
[K     |████████████████████████████████| 675 kB 4.0 MB/s eta 0:00:01
[36malgo-1-1eri7_1  |[0m [?25hCollecting sentencepiece
[36malgo-1-1eri7_1  |[0m   Downloading sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 4.7 MB/s eta 0:00:01
[36malgo-1-1eri7_1  |[0m [?25hCollecting dataclasses; python_version < "3.7"
[36malgo-

Downloading: 100% 232k/232k [00:00<00:00, 460kB/s]  
Downloading: 100% 442/442 [00:00<00:00, 412kB/s]
Downloading: 100% 363M/363M [01:30<00:00, 4.00MB/s]    
[36malgo-1-1eri7_1  |[0m 2020-06-01 18:35:21.696041: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[36malgo-1-1eri7_1  |[0m Sucessfully downloaded after 0 retries.
[36malgo-1-1eri7_1  |[0m ** use_amp True
[36malgo-1-1eri7_1  |[0m enable_sagemaker_debugger False
[36malgo-1-1eri7_1  |[0m *** OPTIMIZER <tensorflow.python.keras.mixed_precision.experimental.loss_scale_optimizer.LossScaleOptimizer object at 0x7f08cc10df98> ***
[36malgo-1-1eri7_1  |[0m Trained model <transformers.modeling_tf_distilbert.TFDistilBertForSequenceClassification object at 0x7f08ec116320>
[36malgo-1-1eri7_1  |[0m Model: "tf_distil_bert_for_sequence_classification"
[36malgo-1-1eri7_1  |[0m ______________________________________________________

'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


[36mtmp0xw_kh44_algo-1-1eri7_1 exited with code 1
[0mAborting on container exit...


RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/50/1dlms49d3013ybsdl_k9nph0m05pfl/T/tmp0xw_kh44/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

In [None]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))

# Wait Until the ^^ Training Job ^^ Completes Above!

In [None]:
estimator.latest_training_job.wait(logs=False)

# Inspect the model