# How to Fine-Tune Phi-2 on Amazon SageMaker

This notebook walks you thorugh how to fine-tune Microsoft Phi-2 from Hugging Face using Amazon SageMaker. 

## Model Description

Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source consisting of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased nearly state-of-the-art performance among models with less than 13 billion parameters.

If you are looking for a lightweight large language model with generalized capability, Phi-2 can be a good choice.

## Fine-tuning task
Here we are fine-tuning the Phi-2 model using summarization samples from the Dolly dataset from Huggingface Hub. The goal is to improve the model's overall summarization capability.

## 1. Setup Development Environment

Our first step is to install Hugging Face Libraries we need on the client to correctly prepare our dataset and start our training/evaluations jobs. 

In [38]:
!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" --upgrade --quiet

## 2. Import and prepare the dataset

In [37]:
import boto3
import sagemaker

sagemaker_session_bucket = None
sess = sagemaker.Session()

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker role arn: arn:aws:iam::275631959608:role/service-role/AmazonSageMaker-ExecutionRole-20231011T123445
sagemaker bucket: sagemaker-us-east-1-275631959608
sagemaker session region: us-east-1



We are going to use `trl: https://huggingface.co/docs/trl/en/index` for fine-tuning, which supports popular instruction and conversation dataset formats. This means we only need to convert our dataset to one of the supported formats and `trl` will take care of the rest. Those formats include:

- conversational format

```json
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

- instruction format

```json
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

In our example we are going to load open-source `dolly` dataset using the 🤗 Datasets library and then convert it into the the conversational format, where we include the schema definition in the system message for our assistant. We'll then save the dataset as jsonl file, which we can then use to fine-tune our model. 

In [23]:
from datasets import load_dataset

# Convert dataset to OAI messages
system_message = """You are a text summarizer. Users will provide you a text in English and you will generate a summary based on the provided SCHEMA.
SCHEMA:
{schema}"""

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message.format(schema=sample["instruction"])},
      {"role": "user", "content": sample["context"]},
      {"role": "assistant", "content": sample["response"]}
    ]
  }

# Load dataset from the hub
#dataset = load_dataset("b-mc2/sql-create-context", split="train")
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Filter the dataset to include only summarization examples
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summarization_dataset = summarization_dataset.remove_columns("category")

#dataset = dataset.shuffle().select(range(12500))

# Convert dataset to instruction format
summarization_dataset = summarization_dataset.map(create_conversation, batched=False)
# split dataset into 1,000 training samples and 2,50 test samples
dataset = summarization_dataset.train_test_split(test_size=2500/12500)

print(dataset["train"][0]["messages"])


[{'content': 'You are a text summarizer. Users will provide you a text in English and you will generate a summary based on the provided SCHEMA.\nSCHEMA:\nWhat is Sim racing', 'role': 'system'}, {'content': "Simulated racing or racing simulation, commonly known as simply sim racing, are the collective terms for racing game software that attempts to accurately simulate auto racing, complete with real-world variables such as fuel usage, damage, tire wear and grip, and suspension settings. To be competitive in sim racing, a driver must understand all aspects of car handling that make real-world racing so difficult, such as threshold braking, how to maintain control of a car as the tires lose traction, and how properly to enter and exit a turn without sacrificing speed. It is this level of difficulty that distinguishes sim racing from arcade racing-style driving games where real-world variables are taken out of the equation and the principal objective is to create a sense of speed as oppose

After we processed the datasets we are going to use the [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [24]:
# Save training dataset to S3 using SageMaker session

training_input_path = f's3://{sess.default_bucket()}/datasets/summarization'

# save datasets to s3
dataset["train"].to_json(f"{training_input_path}/train_dataset.json", orient="records")
dataset["test"].to_json(f"{training_input_path}/test_dataset.json", orient="records")

print(f"Training data uploaded to:")
print(f"{training_input_path}/train_dataset.json")
print(f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={training_input_path.split('/', 3)[-1]}/")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Training data uploaded to:
s3://sagemaker-us-east-1-275631959608/datasets/summarization/train_dataset.json
https://s3.console.aws.amazon.com/s3/buckets/sagemaker-us-east-1-275631959608/?region=us-east-1&prefix=datasets/summarization/


## 3. Fine-Tune Phi-2 with QLoRA on Amazon SageMaker

We are now ready to fine-tune our model. We will use the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` to fine-tune our model. The `SFTTrainer` makes it straightfoward to supervise fine-tune open LLMs. 

We will use the dataset formatting, packing and PEFT features in our example. As peft method we will use [QLoRA](https://arxiv.org/abs/2305.14314) a technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance by using quantization. 

In Addition to QLoRA we will leverage the new [Flash Attention 2 integrationg with Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flash-attention-2) to speed up the training. 


In [25]:
# hyperparameters, which are passed into the training job
hyperparameters = {
  ### SCRIPT PARAMETERS ###
  'dataset_path': '/opt/ml/input/data/training/train_dataset.json', # path where sagemaker will save training dataset
  'model_id': "microsoft/phi-2",           # or `mistralai/Mistral-7B-v0.1`
  'max_seq_len': 3072,                               # max sequence length for model and packing of the dataset
  'use_qlora': True,                                 # use QLoRA model
  ### TRAINING PARAMETERS ###
  'num_train_epochs': 3,                             # number of training epochs
  'per_device_train_batch_size': 1,                  # batch size per device during training
  'gradient_accumulation_steps': 4,                  # number of steps before performing a backward/update pass
  'gradient_checkpointing': True,                    # use gradient checkpointing to save memory
  'optim': "adamw_torch_fused",                      # use fused adamw optimizer
  'logging_steps': 10,                               # log every 10 steps
  'save_strategy': "epoch",                          # save checkpoint every epoch
  'learning_rate': 2e-4,                             # learning rate, based on QLoRA paper
  'bf16': True,                                      # use bfloat16 precision
  'tf32': True,                                      # use tf32 precision
  'max_grad_norm': 0.3,                              # max gradient norm based on QLoRA paper
  'warmup_ratio': 0.03,                              # warmup ratio based on QLoRA paper
  'lr_scheduler_type': "constant",                   # use constant learning rate scheduler
  'report_to': "tensorboard",                        # report metrics to tensorboard
  'output_dir': '/tmp/tun',                          # Temporary output directory for model checkpoints
  'merge_adapters': True,                            # merge LoRA adapters into model for easier deployment
}

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Amazon SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

> Note: Make sure that you include the `requirements.txt` in the `source_dir` if you are using a custom training script. We recommend to just clone the whole repository.

In [26]:
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
job_name = f'phi2-7b-hf-text-to-sql-exp1'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_sft.py',    # train script
    source_dir           = '../scripts/trl',      # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.36',            # the transformers version used in the training job
    pytorch_version      = '2.1',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    disable_output_compression = True,        # not compress output to save training time and cost
    environment          = {
                            "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache models in /tmp
                            # "HF_TOKEN": "REPALCE_WITH_YOUR_TOKEN" # huggingface token to access gated models, e.g. llama 2
                            }, 
)

> You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then it is not possible to use `merge_weights` parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using [merge_adapter_weights.py](../scripts/merge_adapter_weights.py) after training.

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [28]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: phi2-7b-hf-text-to-sql-exp1-2024-03-25-20-42-18-255


2024-03-25 20:42:19 Starting - Starting the training job
2024-03-25 20:42:19 Pending - Training job waiting for capacity...
2024-03-25 20:42:45 Pending - Preparing the instances for training......
2024-03-25 20:43:36 Downloading - Downloading the training image...............
2024-03-25 20:45:51 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-03-25 20:46:47,413 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-03-25 20:46:47,431 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-03-25 20:46:47,441 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-03-25 20:46:47,442 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-03-25 20:46:48

In our example for Phi-2, the SageMaker training job took `1497 seconds`, which is about `0.416 hours`. The ml.g5.4xlarge instance we used costs `$2.03 per hour` for on-demand usage. As a result, the total cost for training our Phi-2 model was only ~`$1`. 

In [29]:
huggingface_estimator.model_data["S3DataSource"]["S3Uri"].replace("s3://", "https://s3.console.aws.amazon.com/s3/buckets/")

'https://s3.console.aws.amazon.com/s3/buckets/sagemaker-us-east-1-275631959608/phi2-7b-hf-text-to-sql-exp1-2024-03-25-20-42-18-255/output/model/'

## 4. Deploy & evaluate LLM on Amazon SageMaker and compare with the base model

In [30]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.4.0",
  session=sess,
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

INFO:sagemaker.image_uris:Defaulting to only available Python version: py310
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.0-gpu-py310-cu121-ubuntu20.04


We can now create a `HuggingFaceModel` using the container uri and the S3 path to our model. We also need to set our TGI configuration including the number of GPUs, max input tokens. You can find a full list of configuration options [here](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher).

In [31]:
import json
from sagemaker.huggingface import HuggingFaceModel

# s3 path where the model will be uploaded
# if you try to deploy the model to a different time add the s3 path here
model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}

# create HuggingFaceModel with the image uri
fine_tuned_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
  env=config
)

In [33]:
# We will also deploy the base Phi-2 model from Huggingface to compare the summarization performance

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'microsoft/phi-2',
	'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_base_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
	env=hub,
	role=role, 
) 


INFO:sagemaker.image_uris:Defaulting to only available Python version: py310
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method.

In [35]:
# Deploy tuned model to an endpoint
tuned_llm = fine_tuned_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to give SageMaker the time to download the model
)


# Deploy base model to an endpoint
base_llm = huggingface_base_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to give SageMaker the time to download the model
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2024-03-25-21-20-07-570
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2024-03-25-21-20-08-384
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2024-03-25-21-20-08-384


--------!

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2024-03-25-21-24-40-613
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2024-03-25-21-24-41-300
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2024-03-25-21-24-41-300


----------!

In [36]:
inputs = "The long-awaited Eglinton Crosstown LRT continues without a completion date as the head of provincial transit agency Metrolinx warns the software “nerve centre” designed to run trains along the route is struggling with defects. Speaking on Monday, Metrolinx CEO Phil Verster said the line was moving forward but warned that software needed to run trains along the route continued to be a problem. The software is so problematic that, come June, it will already be on its seventh iteration, Verster said. All of the major construction is now complete,” Verster said, suggesting only minor tweaks like water leaks or broken tiles remain on the construction to-do list. What concerns me most, though, is the software defects in the signalling and train control system and the rectification of those defects by CTS and Alstom,” he said, referring to the contractors working on the line. They’re making good progress with it, but it’s not as fast as we would like it to be."

# send request
tuned_response = tuned_llm.predict({
	"inputs": inputs,
})

# send request
base_response = base_llm.predict({
	"inputs": inputs,
})
print (f"tuned_response: {tuned_response}")
print (f"base_response: {base_response}")


tuned_response: [{'generated_text': ' The chief engineers on the project are very focused on trying to resolve some of those software defects. ”\nThe terms of reference for Surrey Council’s audit of recycling centre have been adopted and audit kick-offs are scheduled to take place in May. The audit commenced in August 2017 and its findings will be reported to Council at a special meeting scheduled on Monday, September 20, 2018. All findings will be reported to Council in three phases as follows: • Phase A: the findings from the'}]
base_response: [{'generated_text': 'The long-awaited Eglinton Crosstown LRT continues without a completion date as the head of provincial transit agency Metrolinx warns the software “nerve centre” designed to run trains along the route is struggling with defects. Speaking on Monday, Metrolinx CEO Phil Verster said the line was moving forward but warned that software needed to run trains along the route continued to be a problem. The software is so problematic

## Notice that the output from the tuned model (fine-tuned with summarization data) is more concise and is of better quality

## 5. Delete the endpoint

In [29]:
llm.delete_model()
llm.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-pytorch-tgi-inference-2024-03-08-10-50-17-803
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-tgi-inference-2024-03-08-10-50-18-669
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-tgi-inference-2024-03-08-10-50-18-669
