## Fine-tuning Billion Scale Generative AI model using HuggingFace PEFT Library on SageMaker

 “Generative AI refers to artificial intelligence that can generate novel content, rather than simply analyzing or acting on existing data.” by Brandon Kaplan
 
This is in contrast with discriminative models which classify data. The ability of generative models to generate content makes them extremely useful for applications like text generation, image generation, and speech generation.

While generative models have many advantages, they are also very complex and large in size. A generative model usually have large number of parameters which describe the model’s internal dynamics as well as the features of the data the model generates. 

Conventional paradigm is large-scale pretraining on generic large scale data, followed by fine-tuning to downstream tasks. Fine-tuning these pretrained LLMs on downstream datasets results in huge performance gains when compared to using the pretrained LLMs out-of-the-box (zero-shot inference, for example). However, training these large models even with fine-tune datasets which are relatively smaller in size, requires lot of compute as the models might not be able to fit in a single GPU memory along with the batch of data on which it is trained. Additionally, storing and deploying fine-tuned models is also very expensive as they are the same size as orginal models. 

In order to overcome this challenge and to optimize for cost, where you can use consumer hardware for fine-tuning, parameter effitient fine tuning (PEFT) approaches are used. 

In this notebook we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model. 

References: 
https://github.com/huggingface/peft/tree/main/examples/int8_training
https://huggingface.co/blog/peft

## Set up

In [3]:
!pip install sagemaker==2.123.0

Collecting sagemaker==2.123.0
  Using cached sagemaker-2.123.0-py2.py3-none-any.whl
Collecting schema
  Using cached schema-0.7.5-py2.py3-none-any.whl (17 kB)
Collecting boto3<2.0,>=1.26.28
  Downloading boto3-1.26.80-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.7/132.7 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting s3transfer<0.7.0,>=0.6.0
  Using cached s3transfer-0.6.0-py3-none-any.whl (79 kB)
Collecting botocore<1.30.0,>=1.29.80
  Downloading botocore-1.29.80-py3-none-any.whl (10.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting contextlib2>=0.5.5
  Using cached contextlib2-21.6.0-py2.py3-none-any.whl (13 kB)
Installing collected packages: contextlib2, schema, botocore, s3transfer, boto3, sagemaker
  Attempting uninstall: botocore
    Found existing installation: botocore 1.24.13
    Uninstalling 

In [2]:
import sagemaker
sagemaker.__version__

'2.123.0'

In [3]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
role_name = role.split(["/"][-1])
print(f"The Amazon Resource Name (ARN) of the role used for this demo is: {role}")
print(f"The name of the role used for this demo is: {role_name[-1]}")

The Amazon Resource Name (ARN) of the role used for this demo is: arn:aws:iam::706553727873:role/SagemakerEMRNoAuthProductWi-SageMakerExecutionRole-I48AJ9D41LXR
The name of the role used for this demo is: SagemakerEMRNoAuthProductWi-SageMakerExecutionRole-I48AJ9D41LXR


## Fine-tuning OPT6.7B model from HuggingFace hub which is approximately 13GB in size 
The following training job will give the `out of memory error (OOM)` on `ml.g5.2xlarge` GPU, as it only has 1GPU and 24GB of GPU memory. 

In [10]:
# fine tuning model with 
from sagemaker.huggingface import HuggingFace

estimator = HuggingFace(
    base_job_name="hf-peft-optj6",
    source_dir="code",
    entry_point="train-fine-tune.py",
    role=role,
    transformers_version='4.17',
    pytorch_version='1.10',
    py_version='py38',
    instance_count=1,
    instance_type="ml.g5.2xlarge", # relatively smaller GPU
    sagemaker_session=sagemaker_session,
    debugger_hook_config=False,
)

In [12]:
estimator.fit()

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: hf-peft-optj6-2023-02-27-19-46-06-030


2023-02-27 19:46:06 Starting - Starting the training job...
2023-02-27 19:46:30 Starting - Preparing the instances for trainingProfilerReport-1677527166: InProgress
......
2023-02-27 19:47:35 Downloading - Downloading input data
2023-02-27 19:47:35 Training - Downloading the training image.....................
2023-02-27 19:51:07 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-02-27 19:51:40,743 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-02-27 19:51:40,763 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-02-27 19:51:40,765 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-02-27 19:51:40,947 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:

UnexpectedStatusException: Error for Training job hf-peft-optj6-2023-02-27-19-46-06-030: Failed. Reason: ClientError: Please use an instance type with more memory, or reduce the size of training data processed on an instance.

## Fine-tuning OPT6.7B parameter model using PEFT
PEFT approaches enable you to get performance comparable to full fine-tuning while only having a small number of trainable parameters.

In [4]:
!pygmentize code/train-peft.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[37m[39;49;00m
os.environ[[33m"[39;49;00m[33mCUDA_VISIBLE_DEVICES[39;49;00m[33m"[39;49;00m] = [33m"[39;49;00m[33m0[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32minstall[39;49;00m(package):[37m[39;49;00m
    subprocess.check_call([sys.executable, [33m"[39;49;00m[33m-m[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mpip[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33minstall[39;49;00m[33m"[39;49;00m, package])[37m[39;49;00m
[37m[39;49;00m
install([33m"[39;49;00m[33mgit+https://github.com/huggingface/transformers.git@main[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
install([33m"[39;49;00m[33mgit+https://github.com/huggingface/peft.git[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m 

## Understanding the training script

### Step 1 - Load the model
In the screenshot below, note that we are loading the model in 8-bit, which would require around 7GB of memory instead of 13GB if we load the model in half-precision (float16). 

<!-- ![](images/load_model.png) -->
<img src="images/load_model.png"  width="500" height="300">

### Step 2 - Prepare model for training

Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: 
- Cast the layer norm in `float32` for stability purposes
- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training
- Cast the output logits in `float32` for smoother sampling during the sampling procedure

`model = prepare_model_for_int8_training(model)`


### Step 3 - Apply LoRA

Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.
<img src="images/prepare_model_apply_lora.png"  width="800" height="400">

In [9]:
# define metrics
metric_definitions = [{"Name": "loss", "Regex": "'loss': ([0-9\\.]+)"}]

In [10]:
from sagemaker.huggingface import HuggingFace

estimator = HuggingFace(
    base_job_name="hf-peft-optj6",
    source_dir="code",
    entry_point="train-peft.py",
    role=role,
    transformers_version='4.17',
    pytorch_version='1.10',
    py_version='py38',
    instance_count=1,
    # For training with ml.g5.2xlarge instance, which has 1GPU and 24GB of GPU Memory
    instance_type="ml.g5.2xlarge",
    sagemaker_session=sagemaker_session,
    debugger_hook_config=False,
    metric_definitions=metric_definitions,
    keep_alive_period_in_seconds=15*60,
)

In [None]:
estimator.fit()

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: hf-peft-optj6-2023-02-27-23-15-23-077


2023-02-27 23:15:23 Starting - Starting the training job...
2023-02-27 23:15:47 Starting - Preparing the instances for trainingProfilerReport-1677539723: InProgress
......
2023-02-27 23:16:52 Downloading - Downloading input data
2023-02-27 23:16:52 Training - Downloading the training image........[34m82%|████████▏ | 164/200 [11:23<02:24,  4.00s/it][0m
[34m{'loss': 2.0314, 'learning_rate': 7.2e-05, 'epoch': 1.05}[0m
[34m82%|████████▏ | 164/200 [11:23<02:24,  4.00s/it][0m
[34m82%|████████▎ | 165/200 [11:27<02:15,  3.86s/it][0m
[34m{'loss': 1.8068, 'learning_rate': 7e-05, 'epoch': 1.06}[0m
[34m82%|████████▎ | 165/200 [11:27<02:15,  3.86s/it][0m
[34m83%|████████▎ | 166/200 [11:30<02:07,  3.74s/it][0m
[34m{'loss': 1.7299, 'learning_rate': 6.800000000000001e-05, 'epoch': 1.06}[0m
[34m83%|████████▎ | 166/200 [11:30<02:07,  3.74s/it][0m
[34m84%|████████▎ | 167/200 [11:36<02:24,  4.38s/it][0m
[34m{'loss': 1.7871, 'learning_rate': 6.6e-05, 'epoch': 1.07}[0m
[34m84%|███████

## Check logs for trainingable parameters
It should show numbers similar to the following: 

`trainable params: 7340032 || all params: 6058222816 || trainable%: 0.12115817167725645`

indicating that we are only fine-tuning 0.1211% of parameters. 

In [None]:
## Prepare model for inference
