# Fine-tuning Small Language Models through SageMaker Jumpstart (Notebook for larger instances)

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to fine-tuning and deploy Small Language models for text generation. For fine-tuning, we include two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. 

Our goal is to prove that Fine-tuning a Small Language Model is more cost-effective and a better strategy than larger models.

---

We begin by installing and upgrading necessary packages. Restart the kernel after executing the cell below for the first time.

In [None]:
!pip install sagemaker datasets --upgrade --quiet --no-warn-conflicts

## 1. Deploying Llama 3 through Jumpstart

We'll start by deploying our Llama 3 8B model and testing the endpoint

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

In [None]:
model_id, model_version = "meta-textgeneration-llama-3-8b", "2.3.0"

In [None]:
pretrained_model = JumpStartModel(model_id=model_id, model_version=model_version)
# Please change the following line to have accept_eula = True
pretrained_predictor = pretrained_model.deploy(accept_eula=True, instance_type="ml.g5.2xlarge")

---
Next, we invoke the endpoint with some sample queries. Later, in this notebook, we will fine-tune this model with a custom dataset and carry out inference using the fine-tuned model. We will also show comparison between results obtained via the pre-trained and the fine-tuned models.

---

In [None]:
def print_response(payload, response):
    print(payload["inputs"])
    print(f"> {response.get('generated_text')}")
    print("\n==================================\n")

In [None]:
payload = {
    "inputs": "I believe the meaning of life is",
    "parameters": {
        "max_new_tokens": 64,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": False,
    },
}
try:
    response = pretrained_predictor.predict(
        payload, custom_attributes="accept_eula=True"
    )
    print_response(payload, response)
except Exception as e:
    print(e)

In [None]:
pretrained_predictor.delete_model()
pretrained_predictor.delete_endpoint()

## 2. Fine-tuning Llama 3

Now, we demonstrate how to instruction-tune `meta-textgeneration-llama-3-8b` model for a new task. The Llama-3-8B Large Language Model (LLM) is a pretrained generative text model with 8 billion parameters. Training a model requires larger instances. In this case we will use g5.12xlarge, which aren't available in workshops.

### 2.1. Preparing training data

You can fine-tune on the dataset with domain adaptation format or instruction tuning format. In this section, we will use a subset of [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) in an instruction tuning format. Dolly dataset contains roughly 15,000 instruction following records for various categories such as question answering, summarization, information extraction etc. It is available under Apache 2.0 license. We will select the summarization examples for fine-tuning.

Training data is formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The training folder can also contain a template.json file describing the input and output formats.

In [None]:
import boto3
import sagemaker
import json

# Get current region, role, and default bucket
aws_region = boto3.Session().region_name
aws_role = sagemaker.session.Session().get_caller_identity_arn()
output_bucket = sagemaker.Session().default_bucket()

# This will be useful for printing
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

print(f"{bold}aws_region:{unbold} {aws_region}")
print(f"{bold}aws_role:{unbold} {aws_role}")
print(f"{bold}output_bucket:{unbold} {output_bucket}")

In [None]:
from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# To train for question answering/information extraction, you can replace the assertion in next line to example["category"] == "closed_qa"/"information_extraction".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summarization_dataset = summarization_dataset.remove_columns("category")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")

In [None]:
train_and_test_dataset["train"][0]

The training data must be formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The .jsonl file extension is mandatory. The training folder can also contain a template.json file describing the input and output formats.

If no template file is given, the following default template will be used:

```json
{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}`,
    "completion": "{response}",
}
```

In this case, the data in the JSON lines entries must include `instruction`, `context`, and `response` fields. If a custom template is provided it must also use `prompt` and `completion` keys to define
  the input and output templates.
  Below is a sample custom template:

  ```json
  {
    "prompt": "question: {question} context: {context}",
    "completion": "{answer}"
  }
  ```

Different from using the default prompt template, in this demo we are going to use a custom template (see below).

In [None]:
import json

template = {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

Next, we are going to reformat the SQuAD 2.0 dataset. The processed data is saved as `task-data.jsonl` file. Given the prompt template defined in above cell, each entry in the `task-data.jsonl` file include **`context`** and **`question`** fields. For demonstration purpose, we limit the number of training examples to be 2000.

In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolly_dataset_mistral"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")

Upload the prompt template (`template.json`) and training data (`task-data.jsonl`) into S3 bucket.

### 2.2. Prepare training parameters

In [None]:
from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)
print(my_hyperparameters)

You can overwrite the hyperparameters, or simply input them in the training jobs. **Note. You can select the LoRA method for your fine-tuning by selecting peft_type=`lora` in the hyper-parameters.**

If you decide to change hyperparameters, it is recommended to validate the changes.

In [None]:
hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

### 2.3. Starting training

If your selected model is gated, you will need to set accept_eula to True to accept the model end-user license agreement (EULA).

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

instruction_tuned_estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    environment={"accept_eula": "true"},  # Please change {"accept_eula": "true"} 
    disable_output_compression=True,
    instance_type="ml.g5.12xlarge",  # For Llama-3-70b, add instance_type = "ml.g5.48xlarge"
)

# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
instruction_tuned_estimator.set_hyperparameters(
    instruction_tuned="True", epoch="5", max_input_length="1024"
)

instruction_tuned_estimator.fit({"train": train_data_location})

Extract Training performance metrics. Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook.

### 2.4. Deploying inference endpoints

In [None]:
instruction_tuned_predictor = instruction_tuned_estimator.deploy()

Comparing the pre-trained with the fine-tuned

In [None]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = train_and_test_dataset["test"]

(
    inputs,
    ground_truth_responses,
    responses_before_finetuning,
    responses_after_finetuning,
) = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": template["prompt"].format(
            instruction=datapoint["instruction"], context=datapoint["context"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    # Please change the following line to "accept_eula=true"
    pretrained_response = pretrained_predictor.predict(
        payload, custom_attributes="accept_eula=false"
    )
    responses_before_finetuning.append(pretrained_response.get("generated_text"))
    # Fine Tuned Llama 3 models doesn't required to set "accept_eula=true"
    finetuned_response = instruction_tuned_predictor.predict(payload)
    responses_after_finetuning.append(finetuned_response.get("generated_text"))


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

### 2.5 Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
instruction_tuned_predictor.delete_model()
instruction_tuned_predictor.delete_endpoint()

## 3 Serving LoRA-based Llama 2 adapters with high performance on SageMaker 

Now, we will demonstrate how you can deploy multiple fine-tuned LoRA adapters with a single base model copy on SageMaker using the DJL Serving Large Model Inference DLC. LoRA (Low Rank Adapters) is a powerful technique for fine-tuning large language models. This technique significantly reduces the number of trainable parameters compared to traditional fine-tuning while achieving comparable or superior performance. You can learn more about the LoRA technique in this paper.


### 3.1 Install, import the required libraries; set some variables

In [None]:
!pip install sagemaker boto3 huggingface_hub awscli --upgrade --quiet --no-warn-conflicts

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from huggingface_hub import snapshot_download

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [None]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts
s3_code_prefix = "hf-large-model-djl/multi-lora/Llama-2-7b-fp16/code"  # folder within bucket where model/code artifacts will go
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

We will be deploying an endpoint with 3 LoRA adapters. These are the models we will be using:

Base Model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-fp16
LoRA Fine Tuned Adapter 1: https://huggingface.co/UnderstandLing/llama-2-7b-chat-ru
LoRA Fine Tuned Adapter 2: https://huggingface.co/UnderstandLing/llama-2-7b-chat-es
LoRA Fine Tuned Adapter 3: https://huggingface.co/UnderstandLing/llama-2-7b-chat-fr

The core structure to cover here is the model directory. We include both the base model and LoRA adapters in the model directory like this:

```
|- model_dir
    |- adapters/
        |--- <adapter_1>/
        |--- <adapter_2>/
        |--- ...
        |--- <adapter_n>/
    |- serving.properties
    |- model.py (optional)

```

It is also possible to have model files located in a separate s3 bucket by specifying that location using an s3 `option.model_id` in the serving.properties. In this case, the adapters directory can be located either alongside the `serving.properties` or alongside the model files in s3.

Each of the adapters in the `adapters` directory contains the LoRA adapter artifacts. Typically there are two files: `adapter_model.bin` and `adapter_config.json` which are the adapter weights and adapter configuration respectively. These are typically obtained from the Peft library via the `PeftModel.save_pretrained()` method.

In [None]:
!rm -rf lora-multi-adapter
!mkdir -p lora-multi-adapter/adapters
!echo "Lora Multi Adapter Model" > lora-multi-adapter/README.txt

In [None]:
snapshot_download("UnderstandLing/llama-2-7b-chat-ru", local_dir="lora-multi-adapter/adapters/ru")

In [None]:
snapshot_download("UnderstandLing/llama-2-7b-chat-es", local_dir="lora-multi-adapter/adapters/es")

In [None]:
snapshot_download("UnderstandLing/llama-2-7b-chat-fr", local_dir="lora-multi-adapter/adapters/fr")

In [None]:
!rm -f model.tar.gz
!rm -rf lora-multi-adapter/.ipynb_checkpoints
!tar czvf model.tar.gz -C lora-multi-adapter .

In [None]:
s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", model_bucket, s3_code_prefix)

### 3.2 LMI configuration parameters and container¶

To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the usecase.

Hence, based on the usecase, you need to:
1. set the configuration parameters for the container.
2. select the appropriate container image to be used for inference.

SageMaker offers optimized [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) that contains different frameworks for model parallelism enabling inference of LLMs on multiple GPUs.

In this scenario, since we are leveraging `vllm` as the batching technique, we leverage the `deepspeed` container that has frameworks like deepspeed, vllm, etc.

In [None]:
deepspeed_image_uri = image_uris.retrieve(
    framework="djl-lmi", 
    region=sess.boto_session.region_name, 
    version="0.28.0"
)

env_generation = {"HUGGINGFACE_HUB_CACHE": "/tmp",
                  "TRANSFORMERS_CACHE": "/tmp",
                  "SERVING_LOAD_MODELS": "test::Python=/opt/ml/model",
                  "OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "max",
                  "OPTION_ROLLING_BATCH": "vllm",
                  "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
                  "OPTION_DTYPE": "fp16",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

In [None]:
# - Select the appropriate environment variable which will tune the deployment server.
env = env_generation # use this in case it is 'generation' task 
# - now we select the appropriate container 
inference_image_uri = deepspeed_image_uri # use this in case it is 'generation' task 
#inference_image_uri = trtllm_image_uri # enable this in case your use case is summarization ( high input and medium output sizes ) 

print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

To create the end point the steps are:
- Create the Model using the inference image container

- Create the endpoint config using the following key parameters

In this notebook we leverage the boto3 SDK. You can also use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/).

### 3.3 Create the Model
Leverage the `inference_image_uri` to create a model object. We will leverage the Least routing algorithim -- [Least Routing Algorithim](https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/). This innovation from sagemnaker has shown to reduce latency by 10% or more when we have multiple instances configured to serve the endpoints

In [None]:
model_name = sagemaker.utils.name_from_base("lmi-llama2-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
        "ModelDataUrl": s3_code_artifact_accelerate,

    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

### 3.4 Create an endpoint config
Create an endpoint configuration using the appropriate instance type. Set the `ContainerStartupHealthCheckTimeoutInSeconds` to account for the time taken to download the LLM weights from S3 or the model hub; and the time taken to load the model on the GPUs.

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 500,
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)
endpoint_config_response

### 3.5 Create an endpoint using the model and endpoint config

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### This step can take ~15 mins or longer

In [None]:
#
# Using helper function to wait for the endpoint to be ready
#
sess.wait_for_endpoint(endpoint_name)

### 3.6 Invoke the endpoint with a sample prompt

In [None]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Piensa en una excusa creativa para decir que no necesito ir a la fiesta."],
                     "adapters": ["es"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### 3.7 Clean up the environment

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

# Appendix

### 1. Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 


### Notes
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 8K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 8B and 70B models, we recommend to set `max_new_tokens` no greater than 1500 and 500 respectively, while keeping the total number of tokens less than 8K.
- In order to support a 8k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

---

### 2. Supported Hyper-parameters for fine-tuning
---
- epoch: The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default: 5
- learning_rate: The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default: 1e-4.
- instruction_tuned: Whether to instruction-train the model or not. Must be 'True' or 'False'. Default: 'False'
- per_device_train_batch_size: The batch size per GPU core/CPU for training. Must be a positive integer. Default: 4.
- per_device_eval_batch_size: The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default: 1
- max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of training samples. Must be a positive integer or -1. Default: -1. 
- max_val_samples: For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of validation samples. Must be a positive integer or -1. Default: -1. 
- max_input_length: Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default: -1. 
- validation_split_ratio: If validation channel is none, ratio of train-validation split from the train data. Must be between 0 and 1. Default: 0.2. 
- train_data_split_seed: If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default: 0.
- preprocessing_num_workers: The number of processes to use for the preprocessing. If None, main process is used for preprocessing. Default: "None"
- lora_r: Lora R. Must be a positive integer. Default: 8.
- lora_alpha: Lora Alpha. Must be a positive integer. Default: 32
- lora_dropout: Lora Dropout. must be a positive float between 0 and 1. Default: 0.05. 
- int8_quantization: If True, model is loaded with 8 bit precision for training. Default for 8B: False. Default for 70B: True.
- enable_fsdp: If True, training uses Fully Sharded Data Parallelism. Default for 8B: True. Default for 70B: False.

Note 1: int8_quantization is not supported with FSDP. Also, int8_quantization = 'False' and enable_fsdp = 'False' is not supported due to CUDA memory issues for any of the g5 family instances. Thus, we recommend setting exactly one of int8_quantization or enable_fsdp to be 'True'
Note 2: Due to the size of the model, 70B model can not be fine-tuned with enable_fsdp = 'True' for any of the supported instance types.

---

### 3. Supported Instance types for fine-tuning Llama 3

---
We have tested our scripts on the following instances types for fine-tuning Llama 3:

| Model | Model ID | All Supported Instances Types for fine-tuning |
| - | - | - |
| Llama 3 8B | meta-textgeneration-llama-3-8b | ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge |
| Llama 3 8B Instruct | meta-textgeneration-llama-3-8b-instruct | ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge  |
| Llama 3 70B | meta-textgeneration-llama-3-70b | ml.g5.48xlarge, ml.p4d.24xlarge |
| Llama 3 70B Instruct | meta-textgeneration-llama-3-70b-instruct | ml.g5.48xlarge, ml.p4d.24xlarge |

Other instance types may also work to fine-tune. Note: When using p3 instances, training will be done with 32 bit precision as bfloat16 is not supported on these instances. Thus, training job would consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

---

### 4. Few notes about the fine-tuning method

---
- Fine-tuning scripts are based on [this repo](https://github.com/facebookresearch/llama-recipes/tree/main). 
- Instruction tuning dataset is first converted into domain adaptation dataset format before fine-tuning. 
- Fine-tuning scripts utilize Fully Sharded Data Parallel (FSDP) as well as Low Rank Adaptation (LoRA) method fine-tuning the models

---

In [None]:
from sagemaker import TrainingJobAnalytics

training_job_name = instruction_tuned_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

### 6. LMI configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **Python** engine.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.g5.12xlarge` instance that has 4 GPUs; hence this is set to 4.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference.
In scenarios that involves open ended generation and chatbots, there is a need for having a high throughput. [vLLM](https://arxiv.org/pdf/2309.06180.pdf) is a fast LLM inference and serving framework that uses techniques like PagedAttention and continuous batching to improve the throughput. Hence, we set the `rolling_batch` parameter to `vllm`. When using `vllm`, you can also use some [additional parameters](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md#vllm).

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.

6. `OPTION_ENABLE_LORA`: This config enables support for LoRA adapters. Default: false.

7. `OPTION_MAX_LORAS`: This config determines the maximum number of LoRA adapters that can be run at once. Allocates more GPU memory for those adapters. Default: 4

8. `OPTION_MAX_LORA_RANK`: This config determines the maximum rank allowed for a LoRA adapter. Setting a larger value will enable more adapters at a greater memory usage cost. Default: 16

9. `OPTION_LORA_EXTRA_VOCAD_SIZE`: This config determines the maximum additional vocabulary that can be added through a LoRA adapter. Default: 256

10. `OPTION_MAX_CPU_LORAS`: This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. Default: None


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


### 7. Domain adaptation fine-tuning

We also have domain adaptation fine-tuning enabled for Mistral models. Different from instruction fine-tuning, you do not need prepare instruction-formatted dataset and can directly use unstructured text document which is demonstrated as below. However, the model that is domain-adaptation fine-tuned may not give concise responses as the instruction-tuned model because of less restrictive requirements on training data formats.

We will use financial text from SEC filings to fine tune Mistral 7B model for financial applications. 

Here are the requirements for train and validation data.

- **Input**: A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file.
    - For CSV/JSON files, the train or validation data is used from the column called 'text' or the first column if no column called 'text' is found.
    - The number of files under train and validation (if provided) should equal to one.
- **Output**: A trained model that can be deployed for inference.

Below is an example of a TXT file for fine-tuning the Text Generation model. The TXT file is SEC filings of Amazon from year 2021 to 2022.

---
```
This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.

...
```
---
SEC filings data of Amazon is downloaded from publicly available [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch). Instruction of accessing the data is shown [here](https://www.sec.gov/os/accessing-edgar-data).

#### 7.1. Preparing training data

The training data of SEC filing of Amazon has been pre-saved in the S3 bucket.

In [None]:
from sagemaker.jumpstart.utils import get_jumpstart_content_bucket

# Sample training data is available in this bucket
data_bucket = get_jumpstart_content_bucket(aws_region)
data_prefix = "training-datasets/sec_data"

training_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/train/"
validation_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/validation/"

#### 7.2. Prepare training parameters

We pick the `max_input_length` to be 2048 on `g5.12xlarge`. You can use higher input length on larger instance type.

In [None]:
from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

my_hyperparameters["epoch"] = "3"
my_hyperparameters["per_device_train_batch_size"] = "2"
my_hyperparameters["instruction_tuned"] = "False"
my_hyperparameters["max_input_length"] = "2048"
print(my_hyperparameters)

Validate hyperparameters

In [None]:
hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

#### 7.3. Starting training

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

domain_adaptation_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=my_hyperparameters,
    instance_type="ml.g5.12xlarge",
)
domain_adaptation_estimator.fit(
    {"train": training_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True
)

Extract Training performance metrics. Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook

In [None]:
from sagemaker import TrainingJobAnalytics

training_job_name = domain_adaptation_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

#### 7.4. Deploying inference endpoint

We deploy the domain-adaptation fine-tuned model

In [None]:
domain_adaptation_predictor = domain_adaptation_estimator.deploy()

In [None]:
domain_adaptation_predictor.delete_model()
domain_adaptation_predictor.delete_endpoint()