# Fine-Tuning a RoBERTa Model and Create a Text Classifier (Sentiment Analysis)

The BERT model's attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [RoBERTa](https://arxiv.org/abs/1907.11692) - a Robustly Optimized BERT Pretraining Approach.

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Retrieve Pre-Processed Data

In [2]:
%store -r processed_train_data_s3_uri

In [3]:
print(processed_train_data_s3_uri)
!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-train
2021-05-02 16:24:02    9837753 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tsv
2021-05-02 16:24:02   13594432 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tsv


In [4]:
%store -r processed_validation_data_s3_uri

In [5]:
print(processed_validation_data_s3_uri)
!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-validation
2021-05-02 16:24:02     543306 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tsv
2021-05-02 16:24:02     811124 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tsv


In [6]:
%store -r processed_test_data_s3_uri

In [7]:
print(processed_test_data_s3_uri)
!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-test
2021-05-02 16:24:03     534853 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tsv
2021-05-02 16:24:03     791466 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tsv


# Specify S3 `Distribution Strategy`

In [8]:
from sagemaker.inputs import TrainingInput

s3_input_train_data = TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-05-02-16-20-10-330/output/sentiment-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Setup Hyper-Parameters for Classification Layer

## Choosing a `max_seq_length` for RoBERTa
Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `70%` of our reviews.

Remember our distribution of review lengths from a previous section?

<img src="img/review_word_count_distribution.png" width="50%" align="left">

```
mean         67.930174
std         130.954079
min           1.000000
10%           4.000000
20%          14.000000
30%          21.000000
40%          25.000000
50%          31.000000
60%          42.000000
70%          59.000000
80%          87.000000
90%         149.000000
100%       5347.000000
max        5347.000000
```

Review length `59` represents the `70th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `59`.  Reviews with length > `64` will be truncated to `64`.

In [9]:
max_seq_len=64

In [10]:
model_name='roberta-base'
epochs=3
lr=2e-5
train_batch_size=64
train_steps_per_epoch=100
validation_batch_size=64
test_batch_size=64
seed=42
backend='gloo'
train_instance_count=2
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
enable_sagemaker_debugger=True
input_mode='File'
run_validation=True
run_test=False
run_sample_predictions=False

In [11]:
hyperparameters={
        'model_name': model_name,
        'epochs': epochs,
        'lr': lr,
        'train_batch_size': train_batch_size,
        'train_steps_per_epoch': train_steps_per_epoch,
        'validation_batch_size': validation_batch_size,
        'test_batch_size': test_batch_size,
        'seed': seed,
        'max_seq_len': max_seq_len,
        'backend': backend,
        'enable_sagemaker_debugger': enable_sagemaker_debugger,
        'run_validation': run_validation,
        'run_sample_predictions': run_sample_predictions}

# Setup Metrics To Track Model Performance

These sample log lines...
```
[step: 0] val_loss: 0.55 - val_acc: 74.64%
```

...will produce the following 4 metrics in CloudWatch:

`val_loss` =  0.55

`val_accuracy` = 74.64

<img src="img/cloudwatch_train_accuracy.png" width="50%" align="left">

<img src="img/cloudwatch_train_loss.png" width="50%" align="left">

In [12]:
metric_definitions = [
     {'Name': 'train:loss', 'Regex': 'train_loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'train_acc: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_acc: ([0-9\\.]+)'},
]

# Setup SageMaker Debugger
Define Debugger Rules as described here:  https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

In [13]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

In [14]:
debugger_hook_config = DebuggerHookConfig(
    s3_output_path='s3://{}'.format(bucket),
    hook_parameters={
        "save_interval": "10",
    },
    collection_configs=[
        CollectionConfig(
            name="all"
        )
    ]
)

# Setup Our RoBERTa + PyTorch Script to Run on SageMaker
Prepare our PyTorch model to run on the managed SageMaker service

In [15]:
!pygmentize ./src/train_simple.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m 

In [16]:
from sagemaker.pytorch import PyTorch as PyTorchEstimator

estimator = PyTorchEstimator(
    entry_point='train_simple.py',
    source_dir='src',
    role=role,
    instance_count=train_instance_count,
    instance_type=train_instance_type,
    volume_size=train_volume_size,
    py_version='py3',
    framework_version='1.6.0',
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    input_mode=input_mode,
    debugger_hook_config=debugger_hook_config
)

In [17]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
                     },
              wait=False)

In [18]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  pytorch-training-2021-05-02-16-26-03-928


In [19]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [20]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [21]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [22]:
estimator.latest_training_job.wait(logs=False)


2021-05-02 16:26:04 Starting - Starting the training job
2021-05-02 16:26:06 Starting - Launching requested ML instances.................
2021-05-02 16:27:36 Starting - Preparing the instances for training....................
2021-05-02 16:29:22 Downloading - Downloading input data..
2021-05-02 16:29:40 Training - Downloading the training image................
2021-05-02 16:31:04 Training - Training image download completed. Training in progress..........................................................................................................................................................................................................................................................................................................................................................................................................................
2021-05-02 17:05:29 Uploading - Uploading generated training model............................................................
2021-05-02 1

# _Wait Until the ^^ Training Job ^^ Completes Above!_

In [23]:
model_s3_uri = estimator.model_data
print(model_s3_uri)

s3://sagemaker-us-east-1-231218423789/pytorch-training-2021-05-02-16-26-03-928/output/model.tar.gz


In [24]:
!mkdir -p ./tmp/model/

In [25]:
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./tmp/model/model.tar.gz

download: s3://sagemaker-us-east-1-231218423789/pytorch-training-2021-05-02-16-26-03-928/output/model.tar.gz to tmp/model/model.tar.gz


In [26]:
!tar -xvzf ./tmp/model/model.tar.gz -C ./tmp/model/

transformer/
transformer/pytorch_model.bin
transformer/config.json
model.pth
transformer/
transformer/pytorch_model.bin
transformer/config.json
model.pth


# Pass Variables to the Next Notebook(s)

In [39]:
%store model_s3_uri

Stored 'model_s3_uri' (str)


In [40]:
%store training_job_name

Stored 'training_job_name' (str)


In [41]:
%store training_job_debugger_artifacts_path

Stored 'training_job_debugger_artifacts_path' (str)


In [42]:
%store

Stored variables and their in-db values:
balance_dataset                                       -> True
ingest_create_athena_table_parquet_passed             -> True
model_s3_uri                                          -> 's3://sagemaker-us-east-1-231218423789/pytorch-tra
pipeline_experiment_name                              -> 'BERT-pipeline-1617561099'
pipeline_name                                         -> 'BERT-pipeline-1617561099'
pipeline_trial_name                                   -> 'trial-1617561100'
processed_test_data_s3_uri                            -> 's3://sagemaker-us-east-1-231218423789/sagemaker-s
processed_train_data_s3_uri                           -> 's3://sagemaker-us-east-1-231218423789/sagemaker-s
processed_validation_data_s3_uri                      -> 's3://sagemaker-us-east-1-231218423789/sagemaker-s
raw_input_data_s3_uri                                 -> 's3://sagemaker-us-east-1-231218423789/pytorch/ama
s3_private_path_tsv                                

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}