# Compile and Train a Vision Transformer model on the ImageNet Dataset for Image Classification on a Single-Node Single-GPU

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [SageMaker environment](#SageMaker-environment)
3. [Processing](#Preprocessing)
    1. [Tokenization](#Tokenization)
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)
4. [SageMaker Training Job](#SageMaker-Training-Job)
    1. [Training with Native TensorFlow](#Training-with-Native-TensorFlow)  
    2. [Training with Optimized TensorFlow](#Training-with-Optimized-TensorFlow)  
    3. [Analysis](#Analysis)
5. [Clean Up](#Clean-Up)


## SageMaker Training Compiler Overview

SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. 

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. 

For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.

## Introduction

In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `RoBERTa` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.

**NOTE:** This notebook uses two `ml.p3.2xlarge` instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). 

In [None]:
from sagemaker.tensorflow import TensorFlow
from sagemaker.training_compiler.config import TrainingCompilerConfig

import boto3

HOPPER_IMAGE_URI='669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-tensorflow-training:2.9.0-gpu-py39-cu112-ubuntu20.04-sagemaker-pr-1839-2022-05-17-00-38-02'
epochs=1
batch = 56
train_steps = int(30000*epochs/batch)
steps_per_loop = train_steps//10
overrides=\
f"runtime.enable_xla=False,"\
f"runtime.num_gpus=1,"\
f"runtime.distribution_strategy=one_device,"\
f"runtime.mixed_precision_dtype=float16,"\
f"task.train_data.global_batch_size={batch},"\
f"task.train_data.input_path=/opt/ml/input/data/training/caltech*,"\
f"task.train_data.cache=False,"\
f"trainer.train_steps={train_steps},"\
f"trainer.steps_per_loop={steps_per_loop},"\
f"trainer.summary_interval={steps_per_loop},"\
f"trainer.checkpoint_interval={train_steps},"\
f"task.model.backbone.type=vit,"
estimator = TensorFlow(
                    git_config={
                        'repo': 'https://github.com/tensorflow/models.git',
                        'branch': 'v2.9.2',
                    },
                    source_dir='.',
                    entry_point='official/projects/vit/train.py',
                    model_dir=False,
                    instance_type='ml.p3.2xlarge',
                    instance_count=1,
                    image_uri=HOPPER_IMAGE_URI,
                    hyperparameters={
                        TrainingCompilerConfig.HP_ENABLE_COMPILER : False,
                        'experiment': 'vit_imagenet_pretrain',
                        'mode' : 'train',
                        'model_dir': '/opt/ml/model',
                        'params_override' : overrides,
                    },
                    debugger_hook_config=None,
                    disable_profiler=True,
                    max_run=60*60*12, #12 hours
                    base_job_name='native-tf29-vit',
                    role=boto3.client('iam').get_role(RoleName='SageMaker-Execution-Role-For-PyTest')['Role']['Arn'],
                )
estimator.fit(inputs='s3://collection-of-ml-datasets/Caltech-256-tfrecords')


Cloning into '/var/folders/5r/j40pqpnd4lv66lxzrjmygjzr0000gs/T/tmpvf3uunxo'...
Note: switching to 'v2.9.2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 675d26469 Make preprocess_ops visible from tensorflow_models import.
