# DistilbertBaseUncased - Pytorch
This notebook shows how to fine-tune a "distilbert base uncased" PyTorch model with AWS Trainium (trn1 instances) using NeuronSDK. The original implementation is provided by HuggingFace.

The example has 2 stages:
1. First compile the model using the utility `neuron_parallel_compile` to compile the model to run on the AWS Trainium device.
1. Run the fine-tuning script to train the model based on the associated task (e.g. mrpc). The training job will use 2 workers with data parallel to speed up the training. If you have a larger instance (trn1.32xlarge) you can increase the worker count to 8 or 32.

It has been tested and run on a trn1.2xlarge

**Reference:** https://huggingface.co/distilbert-base-uncased

## 1) Install dependencies

In [None]:
# Set Pip repository  to point to the Neuron repository
%pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
# now restart the kernel

In [None]:
# Please refer to Neuron Document to install Neuron SDK v2.7
# Install extra packages
%pip install -U "numpy<=1.20.0" "protobuf<4" "transformers==4.26.0" datasets sklearn
# use --force-reinstall if you're facing some issues while loading the modules
# now restart the kernel again

In [None]:
# Clone transformers from Gighub
!git clone https://github.com/huggingface/transformers --branch v4.26.0


In [None]:

# workaround for torchrun
!sed -i '49i# Disable DDP for torchrun' transformers/examples/pytorch/text-classification/run_glue.py
!sed -i '50ifrom transformers import __version__, Trainer' transformers/examples/pytorch/text-classification/run_glue.py
!sed -i '51iTrainer._wrap_model = lambda self, model, training=True, dataloader=None: model' transformers/examples/pytorch/text-classification/run_glue.py
# workaround for neuron_parallel_compile
!sed -i '52i# Workaround for neuron_parallel_compile' transformers/examples/pytorch/text-classification/run_glue.py
!sed -i '53iif os.environ.get("NEURON_EXTRACT_GRAPHS_ONLY", None):' transformers/examples/pytorch/text-classification/run_glue.py
!sed -i '54i\ \ \ \ import torch.distributed as dist' transformers/examples/pytorch/text-classification/run_glue.py
!sed -i '55i\ \ \ \ _verify_param_shape_across_processes = lambda process_group, tensors, logger=None: True' transformers/examples/pytorch/text-classification/run_glue.py


## 2) Set the parameters

In [None]:
model_name = "distilbert-base-uncased"
env_var_options = "XLA_USE_BF16=1 NEURON_CC_FLAGS=\"--model-type=transformer --cache_dir=./compiler_cache_torchrun\""
num_workers = 2
task_name = "mrpc"
batch_size = 8
max_seq_length = 128
learning_rate = 2e-05
num_train_epochs = 5
model_base_name = model_name

accuracy_baseline = 0.7

## 3) Compile the model with neuron_parallel_compile

In [None]:
print("Compile model")
COMPILE_CMD = f"""{env_var_options} neuron_parallel_compile \
torchrun --nproc_per_node={num_workers} \
transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path {model_name} \
--task_name {task_name} \
--do_train \
--max_seq_length {max_seq_length} \
--per_device_train_batch_size {batch_size} \
--learning_rate {learning_rate} \
--max_train_samples 128 \
--overwrite_output_dir \
--output_dir {model_base_name}-{task_name}-{batch_size}bs |& tee log_compile_{model_base_name}-{task_name}-{batch_size}bs"""

print(f'Running command: \n{COMPILE_CMD}')
! {COMPILE_CMD}

## 4) Fine-tune the model

In [None]:
print("Train model")
RUN_CMD = f"""{env_var_options} torchrun --nproc_per_node={num_workers} \
transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path {model_name} \
--task_name {task_name} \
--do_train \
--do_eval \
--max_seq_length {max_seq_length} \
--per_device_train_batch_size {batch_size} \
--learning_rate {learning_rate} \
--num_train_epochs {num_train_epochs} \
--overwrite_output_dir \
--output_dir {model_base_name}-{task_name}-{num_workers}w-{batch_size}bs |& tee log_train_{model_base_name}-{task_name}-{num_workers}w-{batch_size}bs"""

print(f'Running command: \n{RUN_CMD}')
! {RUN_CMD}

## 5) Evaluate results

In [None]:
import json

all_results = json.load(open(f'{model_base_name}-{task_name}-{num_workers}w-{batch_size}bs/all_results.json'))
print(all_results['eval_accuracy'])

In [None]:
assert all_results['eval_accuracy'] > accuracy_baseline, f"Accuracy must be greater than {accuracy_baseline}"