# Fine-tune Meta Llama 3.1 - 8B instruct with SageMaker Jumpstart models

In this notebook, we fine-tune Meta Llama 3.1 - 8B Instruct using SageMaker Jumpstart models

## Prerequisites

In [None]:
%pip install -q -U datasets==3.1.0

***

## Prepare the dataset

In [None]:
import boto3
import sagemaker
import pandas as pd

In [None]:
s3 = boto3.client('s3')

sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
dataset_name = "telco_promotions"

Read the dataset

In [None]:
import pandas as pd

df = pd.read_json(f"./{dataset_name}.json")

df.head()

Next, create a prompt template for using the data in an instruction format for the training job. You will also use this template during model inference.

In [None]:
template = {
    "prompt": (
        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system}<|eot_id|>"
        "<|start_header_id|>user<|end_header_id|>{instruction}<|eot_id|>"
    ),
    "completion": "<|start_header_id|>assistant<|end_header_id|>{completion}<|eot_id|>",
}

In [None]:
import json

df.to_json("train.jsonl")

# Write JSON template to local dir
with open("template.json", "w") as f:
    json.dump(template, f)

### Upload to Amazon S3

In [None]:
import sagemaker

In [None]:
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix

In [None]:
from sagemaker.s3 import S3Uploader

if default_prefix:
    default_path = f"{bucket_name}/{default_prefix}/datasets/workshop-fine-tuning"
else:
    default_path = f"{bucket_name}/datasets/workshop-fine-tuning"

train_data_location = f"s3://{default_path}/{dataset_name}"

S3Uploader.upload("train.jsonl", train_data_location)
S3Uploader.upload("template.json", train_data_location)

print(f"Training data location: {train_data_location}")

***

## Train the model

In this section you will fine-tune the model. Finetuning scripts are based on scripts provided by [this repo](https://github.com/facebookresearch/llama-recipes/tree/main). To learn more about the fine-tuning scripts, please checkout section [5. Few notes about the fine-tuning method](#5-few-notes-about-the-fine-tuning-method). For a list of supported hyper-parameters and their default values, please see section [3. Supported Hyper-parameters for fine-tuning](#3-supported-hyper-parameters-for-fine-tuning). By default, these models train via domain adaptation, so you must indicate instruction tuning through the `instruction_tuned` hyperparameter.


In [None]:
import sagemaker
from sagemaker.jumpstart.estimator import JumpStartEstimator

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
instance_type = "ml.g5.12xlarge"
instance_count = 1

model_id = "meta-textgeneration-llama-3-1-8b-instruct"

In [None]:
estimator = JumpStartEstimator(
    model_id=model_id,
    environment={"accept_eula": "true"},  # set "accept_eula": "true" to accept the EULA for gated models
    instance_type=instance_type,
    instance_count=instance_count,
    disable_output_compression=False,
    hyperparameters={
        "instruction_tuned": "True",
        "epoch": "5",
        "chat_dataset": "False",
        "enable_fsdp": "True",
    },
    sagemaker_session=sagemaker_session,
)

In [None]:
estimator.fit({"training": train_data_location})

## Deploy and invoke the fine-tuned model

You can deploy the fine-tuned model to an endpoint directly from the estimator.


In [None]:
import boto3
from datetime import datetime
import pytz
import sagemaker

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
def get_last_job_name(job_name_prefixes):
    sagemaker_client = boto3.client('sagemaker')
    latest_job = None
    # Set latest_creation_time to the minimum possible datetime with timezone info
    latest_creation_time = datetime.min.replace(tzinfo=pytz.UTC)

    for prefix in job_name_prefixes:
        search_response = sagemaker_client.search(
            Resource='TrainingJob',
            SearchExpression={
                'Filters': [
                    {
                        'Name': 'TrainingJobName',
                        'Operator': 'Contains',
                        'Value': prefix
                    },
                    {
                        'Name': 'TrainingJobStatus',
                        'Operator': 'Equals',
                        'Value': "Completed"
                    }
                ]
            },
            SortBy='CreationTime',
            SortOrder='Descending',
            MaxResults=1
        )

        if search_response['Results']:
            current_job = search_response['Results'][0]['TrainingJob']
            creation_time = current_job['CreationTime']

            if creation_time > latest_creation_time:
                latest_job = current_job
                latest_creation_time = creation_time

    if latest_job:
        return latest_job['TrainingJobName']
    else:
        return None

In [None]:
job_name_prefixes = [
    "llama-3-1-8b-instruct",
    "jumpstart-dft-meta-textgeneration-l"
]

# Invoke the function with the prefixes
job_name = get_last_job_name(job_name_prefixes)

job_name

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator.attach(job_name)

In [None]:
instance_count = 1
instance_type = "ml.g5.4xlarge"

In [None]:
predictor = estimator.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
)

Next, we use the test data to invoke the fine-tuned model. To do this, create a helper function to template an example datapoint and query the model. For instruction fine-tuning, we insert a special demarkation key between the example input and output text, so this is added here in the templating process for inference.

In [None]:
from datetime import date

# Get the current date
current_date = date.today()

# Format the date as "YYYY-MM-DD"
date_string = current_date.strftime("%Y-%m-%d")

base_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system}<|eot_id|><|start_header_id|>user<|end_header_id|>{instruction}. Today is {current_date}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

prompt = base_prompt.format(
    system="You are a marketing bot for a US-based telecom company called AnyCompany. Based on a customer profile, you choose an appropriate promotion for that customer and write a personalized marketing message to send to that customer.",
    instruction="Choose an appropriate promotion and write a personalized marketing message for a customer by following these steps: 1. Read the customer profile, 2. Think step-by-step to choose an appropriate promotion from the list of approved promotions and enclose it in <promotion> tags, 3. Write a personalized message for the customer based on the chosen promotion, the customer profile, and the time of year and enclose it in <personalized_message> tags. For the next customer, you can choose from the list of approved promotions. <approved_promotions> - $5 monthly winter holiday discount every month for the months of November-January - 10GB extra phone data for winter holidays every month from November to January - 20GB extra internet data for winter holidays every month from November to January - 10 extra minutes for winter holidays every month from November to January - 2GB extra phone data for birthday month - 5GB extra internet data for birthday month - $5 discount for birthday month - 30 extra minutes for birthday month - $2 discount on annual plan for customer with 2+ year tenure - $5 discount on annual plan for customer with 5+ year tenure - 5% discount for 6 months on new internet plan for customer with existing phone plan - 5% discount for 6 months on new phone plan for customer with existing internet plan - $5 voucher to spend on any phone or internet product </approved_promotions> Choose an appropriate promotion and write a personalized message for the customer below. Remember to use <promotion> and <personalized_message> tags. <customer_data> Name: Emily State: California DOB: 1985-11-12 Job: Software Engineer Join Date: 2018-05-15 Internet Service: 200Mbps Fiber Internet Contract: Annual Monthly Internet Costs: $65.99 Phone Service: 100GB Unlimited Minutes SmartPhone Contract: Annual Monthly Phone Costs: $89.99</customer_data>",
    current_date=date_string
)

print(prompt)

In [None]:
predictor.predict({
	"inputs": prompt,
    "parameters": {
        "max_new_tokens": 1000,
        "temperature": 0.2,
        "top_p": 0.9,
        "return_full_text": False,
        "stop": ['<|eot_id|>', '<|end_of_text|>']
    }
})

## Clean up

In [None]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

## Appendix

### 1. Supported Inference Parameters

This model supports the following payload parameters. You may specify any subset of these parameters when invoking an endpoint.

* **do_sample:** If True, activates logits sampling. If specified, it must be boolean.
* **max_new_tokens:** Maximum number of generated tokens. If specified, it must be a positive integer.
* **repetition_penalty:** A penalty for repetitive generated text. 1.0 means no penalty.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.
* **seed**: Random sampling seed.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **truncate:** Truncate inputs tokens to the given size.
* **typical_p:** Typical decoding mass, according to [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666).
* **best_of:** Generate best_of sequences and return the one if the highest token logprobs.
* **watermark:** Whether to perform watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226).
* **details:** Return generation details, to include output token logprobs and IDs.
* **decoder_input_details:** Return decoder input token logprobs and IDs.
* **top_n_tokens:** Return the N most likely tokens at each step.

### 2. Dataset formatting instruction for training

We currently offer two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. You can easily switch to one of the training 
methods by specifying parameter `instruction_tuned` being 'True' or 'False'.


#### 2.1. Domain adaptation fine-tuning
The Text Generation model can also be fine-tuned on any domain specific dataset. After being fine-tuned on the domain specific dataset, the model
is expected to generate domain specific text and solve various NLP tasks in that specific domain with **few shot prompting**.

Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file. 
  - For CSV/JSON files, the train or validation data is used from the column called 'text' or the first column if no column called 'text' is found.
  - The number of files under train and validation (if provided) should equal to one, respectively. 
- **Output:** A trained model that can be deployed for inference. 

Below is an example of a TXT file for fine-tuning the Text Generation model. The TXT file is SEC filings of Amazon from year 2021 to 2022.

```Note About Forward-Looking Statements
This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.
GENERAL
Embracing Our Future ...
```


#### 2.2. Instruction fine-tuning
The Text generation model can be instruction-tuned on any text data provided that the data 
is in the expected format. The instruction-tuned model can be further deployed for inference. 
Below are the instructions for how the training data should be formatted for input to the 
model.

Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train and an optional validation directory. Train and validation directories should contain one or multiple JSON lines (`.jsonl`) formatted files. In particular, train directory can also contain an optional `*.json` file describing the input and output formats. 
  - The best model is selected according to the validation loss, calculated at the end of each epoch.
  If a validation set is not given, an (adjustable) percentage of the training data is
  automatically split and used for validation.
  - The training data must be formatted in a JSON lines (`.jsonl`) format, where each line is a dictionary
representing a single data sample. All training data must be in a single folder, however
it can be saved in multiple jsonl files. The `.jsonl` file extension is mandatory. The training
folder can also contain a `template.json` file describing the input and output formats. If no
template file is given, the following template will be used:
  ```json
  {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}",
    "completion": "{response}"
  }
  ```
  - In this case, the data in the JSON lines entries must include `instruction`, `context` and `response` fields. If a custom template is provided it must also use `prompt` and `completion` keys to define
  the input and output templates.
  Below is a sample custom template:

  ```json
  {
    "prompt": "question: {question} context: {context}",
    "completion": "{answer}"
  }
  ```
Here, the data in the JSON lines entries must include `question`, `context` and `answer` fields. 
- **Output:** A trained model that can be deployed for inference. 


### 3. Supported Hyper-parameters for fine-tuning

- epoch: The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default: 5
- learning_rate: The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default: 1e-4.
- instruction_tuned: Whether to instruction-train the model or not. Must be 'True' or 'False'. Default: 'False'
- per_device_train_batch_size: The batch size per GPU core/CPU for training. Must be a positive integer. Default: 4.
- per_device_eval_batch_size: The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default: 1
- max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of training samples. Must be a positive integer or -1. Default: -1. 
- max_val_samples: For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of validation samples. Must be a positive integer or -1. Default: -1. 
- max_input_length: Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default: -1. 
- validation_split_ratio: If validation channel is none, ratio of train-validation split from the train data. Must be between 0 and 1. Default: 0.2. 
- train_data_split_seed: If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default: 0.
- preprocessing_num_workers: The number of processes to use for the preprocessing. If None, main process is used for preprocessing. Default: "None"
- lora_r: Lora R. Must be a positive integer. Default: 8.
- lora_alpha: Lora Alpha. Must be a positive integer. Default: 32
- lora_dropout: Lora Dropout. must be a positive float between 0 and 1. Default: 0.05. 
- int8_quantization: If True, model is loaded with 8 bit precision for training. Default for 7B/13B: False. Default for 70B: True.
- enable_fsdp: If True, training uses Fully Sharded Data Parallelism. Default for 7B/13B: True. Default for 70B: False.

Note 1: int8_quantization is not supported with FSDP. Also, int8_quantization = 'False' and enable_fsdp = 'False' is not supported due to CUDA memory issues for any of the g5 family instances. Thus, we recommend setting exactly one of int8_quantization or enable_fsdp to be 'True'
Note 2: Due to the size of the model, 70B model can not be fine-tuned with enable_fsdp = 'True' for any of the supported instance types.

### 4. Supported Instance types

We have tested our scripts on the following instances types:

| Model | Model ID | All Supported Instances Types for fine-tuning |
| - | - | - |
| Llama 2 7B | meta-textgeneration-llama-2-7b | ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge |
| Llama 2 13B | meta-textgeneration-llama-2-13b | ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge |
| Llama 2 70B | meta-textgeneration-llama-2-70b | ml.g5.48xlarge |
| Llama 3 8B | meta-textgeneration-llama-3-8b | ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge |
| Llama 3 70B | meta-textgeneration-llama-3-70b | ml.g5.48xlarge, ml.p4d.24xlarge |
| Llama 3.1 8B | meta-textgeneration-llama-3-1-8b | ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge |
| Llama 3.1 70B | meta-textgeneration-llama-3-1-70b | ml.g5.48xlarge, ml.p4d.24xlarge |
| Llama 3.1 405B FP8 | meta-textgeneration-llama-3-1-405b-fp8 | ml.p5.48xlarge |


Other instance types may also work to fine-tune. Note: When using p3 instances, training will be done with 32 bit precision as bfloat16 is not supported on these instances. Thus, training job would consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

### 5. Few notes about the fine-tuning method

- Fine-tuning scripts are based on [this repo](https://github.com/facebookresearch/llama-recipes/tree/main). 
- Instruction tuning dataset is first converted into domain adaptation dataset format before fine-tuning. 
- Fine-tuning scripts utilize Fully Sharded Data Parallel (FSDP) as well as Low Rank Adaptation (LoRA) method fine-tuning the models
