# Efficient Large Language Model training with LoRA and Hugging Face

In this sagemaker example, we are going to learn how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune GPT-J  on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune GPT-J with LoRA and bnb int-8 on Amazon SageMaker
4. Deploy the model to Amazon SageMaker Endpoint

### Quick intro: PEFT or Parameter Efficient Fine-tunin

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)


In [2]:
!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker py7zr --upgrade --quiet
!pip install torch --upgrade --quiet

[0m

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [3]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

bucket = sess.default_bucket()
region = sess.boto_region_name
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")


model_id="EleutherAI/gpt-j-6b"#"bigscience/bloomz-7b1"

prefix = f"lora-finetuning/{model_id}"

print(f"prefix: {prefix}")

sagemaker role arn: arn:aws:iam::376678947624:role/service-role/AmazonSageMaker-ExecutionRole-20230315T093911
sagemaker bucket: sagemaker-us-west-2-376678947624
sagemaker session region: us-west-2
prefix: lora-finetuning/EleutherAI/gpt-j-6b


## 2. Load and prepare the dataset (Changing this to spider dataset)

we will use the [spider](https://huggingface.co/datasets/spider) dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```python
{
  "id": "13818513",
  "question": "How many heads of the departments are older than 56 ?",
  "query": "SELECT count(*) FROM head WHERE age > 56"
}
```

To load the `spider` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [5]:
from datasets import load_dataset

dataset = load_dataset("spider")
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['validation'])}")

Found cached dataset spider (/root/.cache/huggingface/datasets/spider/spider/1.0.0/4e5143d825a3895451569c8b9b55432b91a4bc2d04d390376c950837f4680daa)


  0%|          | 0/2 [00:00<?, ?it/s]

Train dataset size: 7000
Test dataset size: 1034


To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out **[chapter 6](https://huggingface.co/course/chapter6/1?fw=tf)** of the Hugging Face Course.

In [21]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer of the model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

token_dir = "token"
tokenizer.save_pretrained(token_dir)

('token/tokenizer_config.json',
 'token/special_tokens_map.json',
 'token/vocab.json',
 'token/merges.txt',
 'token/added_tokens.json',
 'token/tokenizer.json')

In [6]:
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path = model_id,
    cache_dir="model_cache",
    torch_dtype=torch.float16,
)

model_dir = model_id.split("/")[-1]
model.save_pretrained(model_dir)

Downloading (…)lve/main/config.json:   0%|          | 0.00/930 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/24.2G [00:00<?, ?B/s]

## Package and upload the model

In [22]:
!rm -rf `find -type d -name .ipynb_checkpoints`

In [23]:
!aws s3 cp {token_dir} s3://{bucket}/{prefix}/models/{model_dir}/tokenizer --recursive

upload: token/added_tokens.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/added_tokens.json
upload: token/special_tokens_map.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/special_tokens_map.json
upload: token/tokenizer_config.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/tokenizer_config.json
upload: token/tokenizer.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/tokenizer.json
upload: token/merges.txt to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/merges.txt
upload: token/vocab.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer/vocab.json


In [30]:
!aws s3 cp {model_dir} s3://{bucket}/{prefix}/models/{model_dir}/model --recursive

upload: gpt-j-6b/config.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model/config.json
upload: gpt-j-6b/pytorch_model.bin.index.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model/pytorch_model.bin.index.json
upload: gpt-j-6b/generation_config.json to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model/generation_config.json
upload: gpt-j-6b/pytorch_model-00002-of-00002.bin to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model/pytorch_model-00002-of-00002.bin
upload: gpt-j-6b/pytorch_model-00001-of-00002.bin to s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model/pytorch_model-00001-of-00002.bin


In [31]:
model_url = f"s3://{bucket}/{prefix}/models/{model_dir}/model"
token_url = f"s3://{bucket}/{prefix}/models/{model_dir}/tokenizer"

print(f"Tokenizer uploaded here: {token_url}")
print(f"Model uploaded here: {model_url}")

Tokenizer uploaded here: s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/tokenizer
Model uploaded here: s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/models/gpt-j-6b/model


Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

We defined a `prompt_template` which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. 
We preprocess our dataset before training and save it to disk to then upload it to S3. You could run this step on your local machine or a CPU and upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/datasets-overview).

In [25]:
from random import randint
from itertools import chain
from functools import partial

# custom instruct prompt start
prompt_template = f"Question:\n{{question}}\n---\nQuery:\n{{query}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["question"],
                                            query=sample["query"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_train_dataset)}")



Question:
List the creation year, name and budget of each department.
---
Query:
SELECT creation ,  name ,  budget_in_billions FROM department<|endoftext|>
Total number of samples: 210


After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [26]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{prefix}/data/train'
lm_train_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/210 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-west-2-376678947624/lora-finetuning/EleutherAI/gpt-j-6b/data/train


## 3. Fine-Tune GPT-J with LoRA and bnb int-8 on Amazon SageMaker

In addition to the LoRA technique, we will use [bitsanbytes LLM.int8()](https://huggingface.co/blog/hf-bitsandbytes-integration) to quantize out frozen LLM to int8. This allows us to reduce the needed memory for BLOOMZ ~4x.  

We prepared a [run_clm.py](./scripts/run_clm.py), which implements uses PEFT to train our model. If you are interested in how this works check-out [Efficient Large Language Model training with LoRA and Hugging Face](https://www.philschmid.de/fine-tune-flan-t5-peft) blog, where we explain the training script in detail. T


In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.



In [32]:
import time
# define Training Job Name 
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_path': '/opt/ml/input/data/model',  # pre-trained model
  'token_path': '/opt/ml/input/data/token',
  'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'epochs': 1,                                         # number of training epochs
  'per_device_train_batch_size': 2,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge',   #'ml.g5.2xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',            # the pytorch_version version used in the training job
    py_version           = 'py310',            # the python version used in the training job
    hyperparameters      = hyperparameters,
    # distribution={"torch_distributed":{"enabled":True}},
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [33]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path, 'model': model_url, 'token':token_url}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-peft-2023-07-14-17-37-52-2023-07-14-17-37-53-649


Using provided s3_resource


## 4. Deploy the model to Amazon SageMaker Endpoint

When using `peft` for training, you normally end up with adapter weights. We added the `merge_and_unload()` method to merge the base model with the adatper to make it easier to deploy the model. Since we can now use the `pipelines` feature of the `transformers` library. 

We can now deploy our model using the `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.


In [34]:
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=huggingface_estimator.model_data,
   #model_data="s3://hf-sagemaker-inference/model.tar.gz",  # Change to your model path
   role=role, 
   transformers_version="4.26", 
   pytorch_version="1.13", 
   py_version="py39",
   model_server_workers=1
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type= "ml.g5.4xlarge"
)


INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2023-07-14-18-46-24-417
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2023-07-14-18-46-25-113
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2023-07-14-18-46-25-113


--------!

SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.

Lets test by using a example from the `test` split.

In [35]:
from random import randint
from datasets import load_dataset

# Load dataset from the hub
test_dataset = load_dataset("spider", split="validation")

# select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# format sample
prompt_template = f"Question:\n{{question}}\n---\nQuery:\n"

fomatted_sample = {
  "inputs": prompt_template.format(question=sample["question"]),
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.1,
    "max_new_tokens": 100,
  }
}

# predict
res = predictor.predict(fomatted_sample)

print(res[0]["generated_text"].split("nQuery:")[-1])
# Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.




Question:
Show names of teachers that teach at least two courses.
---
Query:
SELECT T.Teacher_Name FROM Teacher AS T JOIN Course AS C ON T.Teacher_ID = C.Teacher_ID GROUP BY T.Teacher_ID HAVING COUNT(*)  >=  2


Lets compare it to the test result

In [36]:
print(sample["query"])
# Kirsten reminds Alex that the youth group meets this Friday at 7 pm and go bowling.

SELECT T2.Name FROM course_arrange AS T1 JOIN teacher AS T2 ON T1.Teacher_ID  =  T2.Teacher_ID GROUP BY T2.Name HAVING COUNT(*)  >=  2


Finally, clean up after you are done.

In [None]:

predictor.delete_model()
predictor.delete_endpoint()