# Fine tune and host your model for Code Generation in Amazon Sagemaker

# Introduction 
Publicly available code LLMs such as Code Llama are great at generating code that adheres to general programming language principles and syntax, but they may not align with an organization’s internal conventions, or be aware of their internal proprietary libraries, coding standard/rubric and design best practices. In this notebook, we’ll see show how you can fine-tune a code LLM on private code bases to enhance its contextual awareness and improve a model’s usefulness to your organization’s needs. 



# Solution overview and approach

In this demo notebook, we demonstrate how to use the SageMaker Python SDK to fine-tune Code LIama, deploy, and evaluate for code generation
## Model details

Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets and sampling more data from that same dataset for longer. Code Llama features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (for example, “Write me a function that outputs the Fibonacci sequence”). You can also use it for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (JavaScript), C#, Bash, and more.
Details about the Code Llama model can be found [here](https://huggingface.co/codellama)

Meta published Code Llama [performance benchmarks](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) on HumanEval and MBPP for common coding languages such as Python, Java, and JavaScript. The performance of Code Llama Python models on HumanEval demonstrated varying performance across different coding languages and tasks ranging from 38% on 7B Python model to 57% on 70B Python models. In addition, fine-tuned Code Llama models on SQL programming language have shown better results, as evident in SQL evaluation benchmarks. These published benchmarks highlight the potential benefits of fine-tuning Code Llama models, enabling better performance, customization, and adaptation to specific coding domains and tasks.

## Dataset

We will be using Codellama model as a baseline and finetune it for [dolphin-coder dataset](https://huggingface.co/datasets/cognitivecomputations/dolphin-coder)

The dataset format required for the instruction fine-tuning. The training data should be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a data sample. Each sample in the JSON lines files must include system_prompt, question, and response fields.

## Fine-tuning techniques 
Language models such as Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly high CUDA memory. Furthermore, training these models can be very slow due to the size of the model. Therefore, for efficient fine-tuning, we use the following optimizations:

* Low-Rank Adaptation (LoRA)
This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. With this method, you freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 7 billion parameters for Llama 2 7B, you can fine-tune less than 1% of the parameters. This helps in significant reduction of the memory requirement because you only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps in reduction of training time as well as the cost. 

* Int8 quantization 
Even with optimizations such as LoRA, models such as Llama 70B are still too big to train. To decrease the memory footprint during training, you can use Int8 quantization during training. Quantization typically reduces the precision of floating point data types. Although this decreases the memory required to store model weights, it degrades the performance due to loss of information. Int8 quantization uses only a quarter precision but doesn’t incur degradation of performance because it doesn’t simply drop the bits. It rounds the data from one type to the another. 


* Fully Sharded Data Parallel (FSDP) 
This is a type of data-parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance via communication and computation overlapping during training.

We will be finetuning the Codellama model from Sagemaker Jumpstart using LoRA techniques. 


## Notebook Walkthrough section

### High level architecture

![arch](./images/codellama.png)

We will be walking through two different ways to host codellama models in Sagemaker. 

### Option 1 - Using model from hugging face etc. and host it on Sagemaker

![huggingface](./images/huggingface.png)

### Option 2 - Using Sagemaker Jumpstart

![jumpstart](./images/jumpstart.png)

The notebook requires users to specify following variables to start with.
* Specify `model_id` (default value: `meta-textgeneration-llama-codellama-7b`)
* Specify `accept_eula` argument to be True in `model.deploy()` to accept the end-user license agreement (EULA) before deployment the model in an endpoint, given Code LIama model is gated.
* Sepcify `"accept_eula": "true"` in argument `environment` to accept the end-user license agreement (EULA) before fine-tuning.

We will also see evaluation and comparison.

So, let us get started!


## 1. Pre-req Setup
First, upgrade to the latest sagemaker SDK to ensure all available models are deployable.

In [None]:
%pip install --quiet --upgrade sagemaker jmespath datasets

# Option 1: Import the model from hugging face and host it on AWS Sagemaker

## 1. Import

In [None]:
import json
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

## Select the appropriate configuration parameters and container
To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the usecase.

Hence, based on the usecase, you need to:
1. set the configuration parameters for the container.
2. select the appropriate container image to be used for inference.

### Set the configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **MPI**. **MPI** is an engine that allows the model server to start distributed processes to load and serve the model.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.g5.4xlarge` instance that has 1 GPU; this is set to `max` to utilize all the GPUs on the instance.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference. [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is a TensorRT Toolbox for Optimized Large Language Model Inference on Nvidia GPUs. To leverage this, we set this parameter to `trtllm`.

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)

In [None]:
env_trtllm = {"HUGGINGFACE_HUB_CACHE": "/tmp",
              "TRANSFORMERS_CACHE": "/tmp",
              "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
              "OPTION_MODEL_ID": "codellama/CodeLlama-7b-hf",
              "OPTION_TRUST_REMOTE_CODE": "true",
              "OPTION_TENSOR_PARALLEL_DEGREE": "max",
              "OPTION_ROLLING_BATCH": "trtllm",
              "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
              "OPTION_DTYPE":"fp16"
             }

We leverage the tensorRT container; for other containers refer [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [None]:
trtllm_image_uri = image_uris.retrieve(
    framework="djl-tensorrtllm",
    region=sess.boto_session.region_name,
    version="0.26.0"
)

### When generating a large number of output tokens (> 1024), use the following configuration

For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


In [None]:
env_lmidist = {"HUGGINGFACE_HUB_CACHE": "/tmp",
               "TRANSFORMERS_CACHE": "/tmp",
               "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
               "OPTION_MODEL_ID": "codellama/CodeLlama-7b-hf",
               "OPTION_TRUST_REMOTE_CODE": "true",
               "OPTION_TENSOR_PARALLEL_DEGREE": "max",
               "OPTION_ROLLING_BATCH": "lmi-dist",
               "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
               "OPTION_DTYPE":"fp16"
              }

deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", 
    region=sess.boto_session.region_name, 
    version="0.26.0"
)

In [None]:
# - Select the appropriate environment variable which will tune the deployment server.
#env = env_trtllm
env = env_lmidist # use this when generating tokens > 1024  

# - now we select the appropriate container 
inference_image_uri = deepspeed_image_uri # use this when generating tokens > 1024 
#inference_image_uri = trtllm_image_uri

print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

## 2. Host the model on Sagemaker

To create the end point the steps are:
1. Create the Model using the inference image container

2. Create the endpoint config using the following key parameters



In [None]:
# create the model
model_name = sagemaker.utils.name_from_base("lmi-codellama-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.4xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### This step can take ~15 mins or longer

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## 3. Test/Invoke the deployed LLM

In [None]:
prompt = """import socket \n def ping_exponential_backoff(host: str):"""
params = { "max_new_tokens":256, 
              "temperature":0.1}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": params
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

## 4. Cleanup

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

# Option 2 - Using Sagemaker Jumpstart

## 2. Deploy model

Create a `JumpStartModel` object, which initializes default model configurations conditioned on the selected instance type. JumpStart already sets a default instance type, but you can deploy the model on other instance types by passing `instance_type` to the `JumpStartModel` class.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

model_id="meta-textgeneration-llama-codellama-7b"
model_version = "*"

model = JumpStartModel(model_id=model_id)

You can now deploy the vanilla model using SageMaker JumpStart. If the selected model is gated, you will need to accept the end-user license agreement (EULA) prior to deployment. This is accomplished by providing the `accept_eula=True` argument to the `deploy` method. The deployment might take few minutes. 

In [None]:
predictor = model.deploy(
    accept_eula=True
)  # please change `accept_eula` to be True to accept EULA.

JumpStart stores model-specific default example payloads in its SDK. You can retrieve and view them using following code.

In [None]:
predictor.delete_predictor()

### Invoke the endpoint

This section demonstrates how to invoke the endpoint using example payloads that are retrieved programmatically from the `JumpStartModel` object. You can replace these example payloads with your own payloads.

In [None]:
example_payloads = model.retrieve_all_examples()

In [None]:
import jmespath


for payload in example_payloads:
    response = predictor.predict(payload.body)
    generated_text = jmespath.search(payload.raw_payload["output_keys"]["generated_text"], response)
    print("Input:\n", payload.body[payload.prompt_key])
    print("Output:\n", generated_text.strip())
    print("\n===============\n")

## 3. Fine-tune model with LoRA

### Dataset preparation for instruction fine-tuning

The training data must be formatted in a JSON lines (`.jsonl`) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The `.jsonl` file extension is mandatory. The training
folder can also contain a `template.json` file describing the input and output formats. If no template file is given, the following template will be used:
  ```json
  {
    "prompt": "{prompt}",
    "completion": "{completion}"
  }
  ```

In this case, the data in the JSON lines entries must include `prompt` and `completion` fields. If a custom template is provided it must also use `prompt` and `completion` keys to define the input and output templates. Below is a sample custom template:
  
  ```json
{
    "prompt": "{system_prompt} \n\n### Input: {question}",
    "completion": " {response}",
}
  ```
Here, each example in the JSON lines must include `system_prompt`, `question` and `response` fields.

In this demo, we will use a subset of [Dolphin-coder dataset](https://huggingface.co/datasets/cognitivecomputations/dolphin-coder) in an instruction tuning format. The dataset is available under Apache 2.0 license.

In [None]:
from datasets import load_dataset


dolphin = load_dataset("cognitivecomputations/dolphin-coder", split="train")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = dolphin.train_test_split(test_size=0.9, seed=0)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")
train_and_test_dataset["test"].select(range(10)).to_json("test.jsonl")

In [None]:
train_and_test_dataset["train"][0]

Next, we prepare prompt template used for processing the data in an instruction format.

In [None]:
import json

template = {
    "prompt": """{system_prompt}

### Input:
{question}
""",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

### Upload dataset to S3

In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolphin_coder_dataset"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")

Retrieve and customize hyperparameters

In [None]:
from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

print(my_hyperparameters)

In [None]:
my_hyperparameters["epoch"] = "1"
print(my_hyperparameters)

hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator


estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters=my_hyperparameters,
    environment={
        "accept_eula": "true"
    },  # please change `accept_eula` to be `true` to accept EULA.
)

estimator.fit({"training": train_data_location})

### Deploy the fine-tuned model
Next, we deploy the fine-tuned model. We will compare the performance of fine-tuned and pre-trained model.

In [None]:
finetuned_predictor = estimator.deploy()

# Evaluation and comparison

## Qualitatively evaluate the pre-trained and fine-tuned model
Next, we use the test data to evaluate the performance of the fine-tuned model and compare it with the pre-trained model. 


In [None]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = load_dataset("json", data_files="test.jsonl")["train"]
prompt_inference = template["prompt"]
inputs, ground_truth_responses, responses_before_finetuning, responses_after_finetuning = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": prompt_inference.format(
            system_prompt=datapoint["system_prompt"], question=datapoint["question"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    pretrained_response = predictor.predict(payload)
    responses_before_finetuning.append(pretrained_response[0]["generated_text"])
    finetuned_response = finetuned_predictor.predict(payload)
    responses_after_finetuning.append(finetuned_response[0]["generated_text"])


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

## 4.2 Quantitatively evaluate the pre-trained and fine-tuned models using [Human-Eval repository](https://github.com/openai/human-eval)

Lets now evaluate if our model has improved on the HumanEval metric from OpenAI. HumanEval is a standard benchmark for code generation models that was created using hand written python problems. This version of HumanEval is using python for its language of choice. We will generate solutions to 164 python related questions and then run a test suite on the solutions to generate a score. If you want to read more [here is the official paper.](https://arxiv.org/abs/2107.03374)

In [None]:
!pip3 install human_eval --quiet

In [None]:
from human_eval.evaluation import evaluate_functional_correctness
from human_eval.data import write_jsonl, read_problems
from tqdm import tqdm


def generate_one_completion(prompt, predictor):
    body = {"inputs": prompt, "parameters": {"max_new_tokens": 384, "temperature": 0.2}}

    response = predictor.predict(body)

    completion = (response[0]["generated_text"]).replace(prompt, "").split("\n\n\n")[0]
    # if prompt is returned from response
    completion = completion.replace("```", "")
    # if markdown code block is created
    print(f"payload: {prompt}")
    print(f"completion: {completion}")
    return completion


# perform HumanEval
problems = read_problems()

num_samples_per_task = 1

Generate responses from pre-trained and fine-tuned models for 164 python related questions

In [None]:
samples = [
    dict(
        task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"], predictor)
    )
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("pretrained.jsonl", samples)

In [None]:
evaluate_functional_correctness("./pretrained.jsonl")

Now lets compare the previous pretrained model to our new fine-tuned model!

In [None]:
samples = [
    dict(
        task_id=task_id,
        completion=generate_one_completion(problems[task_id]["prompt"], finetuned_predictor),
    )
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("fine-tuned.jsonl", samples)

In [None]:
evaluate_functional_correctness("./fine-tuned.jsonl")

### Clean up the endpoint
Don't forget to clean up resources when finished to avoid unnecessary charges.

In [None]:
predictor.delete_predictor()
finetuned_predictor.delete_predictor()

# Conclusion

In this notebook, we saw how you can finetune and host LLM for code generation in Amazon Sagemaker. You can iterate further with your own dataset, different LLMs and different finetuning techniques and hyperparameters. 