# Fine-tune TinyLlama 1.1B with Lora and SageMaker remote decorator

## Question & Answering

---

In this demo notebook, we demonstrate how to fine-tune the TinyLlama 1.1B model using Hugging Face PEFT.

We are using SageMaker remote decorator for runinng the fine-tuning job on Amazon SageMaker Training job
---

JupyterLab Instance Type: ml.t3.medium

SageMaker Distribution image: SageMaker Distribution 2.0

PyTorch version: 3.11

Fine-Tuning:
* Instance Type: ml.g5.12xlarge

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [None]:
%pip install -r requirements.txt


## Setup Configuration file path

We are setting the directory in which the config.yaml file resides so that remote decorator can make use of the settings through [SageMaker Defaults](https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk).

This notebook is using the Hugging Face container for the `us-east-1` region. Make sure you are using the right image for your AWS region, otherwise edit [config.yaml](./config.yaml). Container Images are available [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)


In [None]:
import os

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Visualize and upload the dataset

Read train dataset in a Pandas dataframe

In [None]:
import pandas as pd
df = pd.read_csv('train_2.csv.gz', compression='gzip', sep=',')
print("Number of elements: ", len(df))
df.head()

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)

print("Number of train elements: ", len(train))
print("Number of test elements: ", len(test))

Create a prompt template and load the dataset with a random sample to try summarization.

In [None]:
from random import randint

# custom instruct prompt start
prompt_template = f"""
<|user|>
{{question}}
<|assistant|>
{{answer}}
"""

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["question"],
                                            answer=sample["answer"])
    return sample

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [None]:
from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))



To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to Lora, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it.



In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Define the train function

In [None]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

In [None]:
from accelerate import Accelerator
from huggingface_hub import login
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import transformers

# Start training
@remote(
    keep_alive_period_in_seconds=0,
    instance_type="ml.g5.xlarge",
    volume_size=100,
    job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}"
)
def train_fn(
        model_name,
        train_ds,
        test_ds=None,
        lora_r=8,
        lora_alpha=16,
        lora_dropout=0.1,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        gradient_accumulation_steps=1,
        learning_rate=2e-4,
        num_train_epochs=1,
        gradient_checkpointing=False,
        merge_weights=False,
        seed=42,
        token=None
):

    set_seed(seed)

    accelerator = Accelerator()

    if token is not None:
        login(token=token)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token

    with accelerator.main_process_first():
        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:
            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache"
    )

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()
    
    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)
    print_trainable_parameters(model)
    
    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=False,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
        
        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer
    
            torch.cuda.empty_cache()
    
            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
            )
    
            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"
            )
    else:
        trainer.model.save_pretrained("/opt/ml/model", safe_serialization=True)

    if accelerator.is_main_process:
        tokenizer.save_pretrained("/opt/ml/model")

In [None]:
train_fn(
    model_id,
    train_ds=train_dataset,
    test_ds=test_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=3,
    merge_weights=True
)

***

## Load Fine-Tuned model

Note: Run `train_fn` with `merge_weights=True` for merging the trained adapter

### Download model

In [None]:
import json
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bucket_name = sagemaker_session.default_bucket()
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-lora"

In [None]:
def get_last_job_name(job_name_prefix):
    import boto3
    sagemaker_client = boto3.client('sagemaker')
    
    search_response = sagemaker_client.search(
        Resource='TrainingJob',
        SearchExpression={
            'Filters': [
                {
                    'Name': 'TrainingJobName',
                    'Operator': 'Contains',
                    'Value': job_name_prefix
                },
                {
                    'Name': 'TrainingJobStatus',
                    'Operator': 'Equals',
                    'Value': "Completed"
                }
            ]
        },
        SortBy='CreationTime',
        SortOrder='Descending',
        MaxResults=1)

    return search_response['Results'][0]['TrainingJob']['TrainingJobName']

In [None]:
job_name = get_last_job_name(job_prefix)

job_name

#### Inference configurations

In [None]:
instance_count = 1
instance_type = "ml.g5.xlarge"
number_of_gpu = 1
health_check_timeout = 700

In [None]:
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="2.2.0"
)

image_uri

In [None]:
model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
        'QUANTIZE': 'bitsandbytes',
        'MAX_INPUT_LENGTH': '1024',
        'MAX_TOTAL_TOKENS': '2048'
    }
)

In [None]:
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

#### Predict

In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor

In [None]:
endpoint_name = "<ENDPOINT_NAME>" #Required if you want to create a predictor without running the previous code

In [None]:
if 'predictor' not in locals() and 'predictor' not in globals():
    print("Create predictor")
    predictor = HuggingFacePredictor(
        endpoint_name=endpoint_name
    )

#### Load data

In [None]:
import pandas as pd
df = pd.read_csv('train_2.csv.gz', compression='gzip', sep=',')
print("Number of elements: ", len(df))
df.head()

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)
train, valid = train_test_split(train, test_size=10, random_state=42)

print("Number of validation elements: ", len(valid))

#### Test model

In [None]:
import json
import time

evaluation_set = []

for index, row in valid.iterrows():

    base_prompt = f"""
    <|system|>
    You are an experienced chat assistant, specialized in the Porsche brand.
    <|user|>
    {{question}}
    <|assistant|>
    """

    prompt = base_prompt.format(question=row["question"])

    start_time_fine_tuned = time.time()

    response = predictor.predict({
    	"inputs": prompt,
        "parameters": {
            "temperature": 0.2,
            "top_p": 0.9,
            "return_full_text": False
        }
    })

    end_time_fine_tuned = time.time()

    response_fine_tuned = response[0]["generated_text"].strip()

    evaluation_set.append({
        "question": row["question"],
        "target_answer": row["answer"],
        "fine_tuned_answer": response_fine_tuned
    })

    print(f"Generated response with fine-tuned model: {end_time_fine_tuned - start_time_fine_tuned:.6f} seconds")
    print("*****************************")

with open("tiny_evaluation_dataset.json", "w") as f:
    json.dump(evaluation_set, f, indent=4)

#### Delete Endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)