# BertBaseCased - Pytorch
This notebook shows how to fine-tune a "bert base cased" PyTorch model with AWS Trainium (trn1 instances) using NeuronSDK. The original implementation is provided by HuggingFace.

The example has 2 stages:
1. First compile the model using the utility `neuron_parallel_compile` to compile the model to run on the AWS Trainium device.
1. Run the fine-tuning script to train the model based on the associated task (e.g. mrpc). The training job will use 2 workers with data parallel to speed up the training. If you have a larger instance (trn1.32xlarge) you can increase the worker count to 8 or 32.

It has been tested and run on a trn1.2xlarge

**Reference:** https://huggingface.co/bert-base-cased

## 1) Install dependencies

In [1]:
# Set Pip repository  to point to the Neuron repository
%pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
# now restart the kernel

Writing to /home/ec2-user/.config/pip/pip.conf
Note: you may need to restart the kernel to use updated packages.


In [None]:
#Install Neuron Compiler and Neuron/XLA packages
%pip install -U torch-neuronx=="1.11.0.1.*" "numpy<=1.20.0" "protobuf<4" "transformers==4.16.2" datasets sklearn
# use --force-reinstall if you're facing some issues while loading the modules
# now restart the kernel again

## 2) Set the parameters

In [2]:
model_name = "bert-base-cased"
env_var_options = "XLA_USE_BF16=1 NEURON_CC_FLAGS=\"--model-type=transformer\""
num_workers = 2
task_name = "mrpc"
batch_size = 8
max_seq_length = 128
learning_rate = 2e-05
num_train_epochs = 5
model_base_name = model_name

## 3) Compile the model with neuron_parallel_compile

In [7]:
print("Compile model")
COMPILE_CMD = f"""{env_var_options} neuron_parallel_compile python3 ./run_glue.py \
--model_name_or_path {model_name} \
--task_name {task_name} \
--do_train \
--max_seq_length {max_seq_length} \
--per_device_train_batch_size {batch_size} \
--learning_rate {learning_rate} \
--max_train_samples 128 \
--overwrite_output_dir \
--output_dir {model_base_name}-{task_name}-{batch_size}bs |& tee log_compile_{model_base_name}-{task_name}-{batch_size}bs"""

print(f'Running command: \n{COMPILE_CMD}')
! {COMPILE_CMD}

Compile model
Running command: 
XLA_USE_BF16=1 NEURON_CC_FLAGS="--model-type=transformer" neuron_parallel_compile python3 ./run_glue.py --model_name_or_path bert-base-cased --task_name mrpc --do_train --max_seq_length 128 --per_device_train_batch_size 8 --learning_rate 2e-05 --max_train_samples 128 --overwrite_output_dir --output_dir bert-base-cased-mrpc-8bs |& tee log_compile_bert-base-cased-mrpc-8bs
2022-10-19 21:11:57.000896: INFO ||PARALLEL_COMPILE||: Removing existing workdir /tmp/parallel_compile_workdir
2022-10-19 21:11:57.000898: INFO ||PARALLEL_COMPILE||: Running trial run (add option to terminate trial run early; also ignore trial run's generated outputs, i.e. loss, checkpoints)
__main__: Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
__main__: Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=Fa

[INFO|file_utils.py:2140] 2022-10-19 21:12:06,424 >> https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /home/ec2-user/.cache/huggingface/transformers/tmpsutvy1n4
Downloading: 100%|██████████| 426k/426k [00:00<00:00, 1.22MB/s]
[INFO|file_utils.py:2144] 2022-10-19 21:12:07,059 >> storing https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json in cache at /home/ec2-user/.cache/huggingface/transformers/226a307193a9f4344264cdc76a12988448a25345ba172f2c7421f3b6810fddad.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6
[INFO|file_utils.py:2152] 2022-10-19 21:12:07,059 >> creating metadata file for /home/ec2-user/.cache/huggingface/transformers/226a307193a9f4344264cdc76a12988448a25345ba172f2c7421f3b6810fddad.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6
[INFO|tokenization_utils_base.py:1771] 2022-10-19 21:12:07,846 >> loading file https://huggingface.co/bert-base-cased/

[INFO|trainer.py:554] 2022-10-19 21:12:15,618 >> The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
[INFO|trainer.py:1244] 2022-10-19 21:12:15,626 >> ***** Running training *****
[INFO|trainer.py:1245] 2022-10-19 21:12:15,626 >>   Num examples = 128
[INFO|trainer.py:1246] 2022-10-19 21:12:15,626 >>   Num Epochs = 3
[INFO|trainer.py:1247] 2022-10-19 21:12:15,626 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1248] 2022-10-19 21:12:15,626 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1249] 2022-10-19 21:12:15,626 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1250] 2022-10-19 21:12:15,626 >>   Total optimization steps = 48
  0%|          | 0/48 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...

100%|██████████| 48/48 [00:07<00:00,  6.82it/s]
[INFO|trainer.py:2060] 2022-10-19 21:12:24,759 >> Saving model checkpoint to bert-base-cased-mrpc-8bs
[INFO|configuration_utils.py:430] 2022-10-19 21:12:24,761 >> Configuration saved in bert-base-cased-mrpc-8bs/config.json
2022-10-19 21:12:25.000533: DEBUG ||NCC_WRAPPER||: Compiling HLO: /tmp/MODULE_SyncTensorsGraph.20342_16825510754301468670.hlo.pb 
2022-10-19 21:12:25.000533: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_16825510754301468670.
2022-10-19 21:12:25.000534: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_16825510754301468670/MODULE_SyncTensorsGraph.20342_16825510754301468670/186f72f8-bee8-41ba-9f7c-e1e64d8fa0f7
2022-10-19 21:12:25.000540: INFO ||NCC_WRAPPER||: Extracting graphs for ahead-of-time parallel compilation. Nocompilation was done.
[INFO|modeling_utils.py:1074] 2022-10-19 21:12:

## 4) Fine-tune the model

In [8]:
print("Train model")
RUN_CMD = f"""{env_var_options} torchrun --nproc_per_node={num_workers} ./run_glue.py \
--model_name_or_path {model_name} \
--task_name {task_name} \
--do_train \
--do_eval \
--max_seq_length {max_seq_length} \
--per_device_train_batch_size {batch_size} \
--learning_rate {learning_rate} \
--num_train_epochs {num_train_epochs} \
--overwrite_output_dir \
--output_dir {model_base_name}-{task_name}-{num_workers}w-{batch_size}bs |& tee log_train_{model_base_name}-{task_name}-{num_workers}w-{batch_size}bs"""

print(f'Running command: \n{RUN_CMD}')
! {RUN_CMD}

Train model
Running command: 
XLA_USE_BF16=1 NEURON_CC_FLAGS="--model-type=transformer" torchrun --nproc_per_node=2 ./run_glue.py --model_name_or_path bert-base-cased --task_name mrpc --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 8 --learning_rate 2e-05 --num_train_epochs 5 --overwrite_output_dir --output_dir bert-base-cased-mrpc-2w-8bs |& tee log_train_bert-base-cased-mrpc-2w-8bs
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
__main__: Process rank: 1, device: xla:0, n_gpu: 0distributed training: True, 16-bits training: False
__main__: Process rank: 0, device: xla:1, n_gpu: 0distributed training: True, 16-bits training: False
__main__: Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_

[INFO|modeling_utils.py:1427] 2022-10-19 21:21:53,019 >> loading weights file https://huggingface.co/bert-base-cased/resolve/main/pytorch_model.bin from cache at /home/ec2-user/.cache/huggingface/transformers/092cc582560fc3833e556b3f833695c26343cb54b7e88cd02d40821462a74999.1f48cab6c959fc6c360d22bea39d06959e90f5b002e77e836d2da45464875cda
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
- This IS expected if you are initializing BertForSequenc

datasets.arrow_dataset: Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-933fca1f571e4614.arrow
datasets.arrow_dataset: Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-939825a6cb95f0d3.arrow
Running tokenizer on dataset:   0%|          | 0/2 [00:00<?, ?ba/s]Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Running tokenizer on dataset: 100%|██████████| 2/2 [00:00<00:00, 18.83ba/s]
[INFO|trainer.py:554] 2022-10-19 21:21:54,731 >> The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
[INFO|trainer.py:1244] 2022-10-19 21:21:54,739 >> ***** Running training *****
[INFO|trainer.py:1245] 2022-10-19 21:21:54,739 >>   N

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2022-10-19 21:38:47.000210: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_8461474067289264054.
2022-10-19 21:38:47.000212: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_8461474067289264054/MODULE_7_SyncTensorsGraph.20606_8461474067289264054_ip-172-31-51-63.us-west-2.compute.internal-93b4b990-13942-5eb6a089c86ee/61ff5a58-4ed4-4c65-ae16-1fc43e0cd7c9
..............
Compiler status PASS
2022-10-19 21:43:22.000345: INFO ||NCC_WRAPPER||: Exiting with a successfully compiled graph
2022-10-19 21:43:22.000346: INFO ||NCC_WRAPPER||: No candidate found under /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_16960748577934655409.
2022-10-19 21:43:22.000347: INFO ||NCC_WRAPPER||: Cache dir for the neff: /var/tmp/neuron-compile-cache/USER_neuroncc-2.2.0.46+c9e7f6a72/MODULE_16960748577934655409/MODULE_8_SyncTensorsGraph.20610_1696074857793465