# Fine-tune Code LIama, Deploy and Evaluate the Fine-tuning with [Human-eval Repository](https://github.com/openai/human-eval)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

---

In this demo notebook, we demonstrate how to use the SageMaker Python SDK to fine-tune Code LIama, deploy, and evaluate the fine-tuning performance with [human-eval repository](https://github.com/openai/human-eval).

Below is the content of the notebook.

1. [Setup](#1.-Setup)
2. [Deploy model](#2.-Deploy-model)
3. [Fine-tune model with LoRA](#3.-Fine-tune-model)
4. [Qualitatively evaluate the pre-trained and fine-tuned model](#4.1-Qualitatively-evaluate-the-pre-trained-and-fine-tuned-model)
5. [Quantitatively evaluate the pre-trained and fine-tuned model using Human-Eval repository](#4.2-Quantitatively-evaluation-using-Human-Eval-repository)

The notebook requires users to specify following variables to start with.
* Specify `model_id` (default value: `meta-textgeneration-llama-codellama-7b`)
* Specify `accept_eula` argument to be True in `model.deploy()` to accept the end-user license agreement (EULA) before deployment the model in an endpoint, given Code LIama model is gated.
* Sepcify `"accept_eula": "true"` in argument `environment` to accept the end-user license agreement (EULA) before fine-tuning.

## 1. Setup
First, upgrade to the latest sagemaker SDK to ensure all available models are deployable.

In [None]:
%pip install --quiet --upgrade sagemaker jmespath datasets

Select the desired model to deploy. The provided dropdown filters all text generation models available in SageMaker JumpStart.

In [None]:
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models


try:
    dropdown = Dropdown(
        options=list_jumpstart_models("search_keywords includes Text Generation"),
        value="meta-textgeneration-llama-codellama-7b",
        description="Select a JumpStart text generation model:",
        style={"description_width": "initial"},
        layout={"width": "max-content"},
    )
    display(dropdown)
except:
    dropdown = None
    pass

In [None]:
if dropdown:
    model_id = dropdown.value
else:
    model_id = "meta-textgeneration-llama-codellama-7b"
model_version = "*"

## 2. Deploy model

Create a `JumpStartModel` object, which initializes default model configurations conditioned on the selected instance type. JumpStart already sets a default instance type, but you can deploy the model on other instance types by passing `instance_type` to the `JumpStartModel` class.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel


model = JumpStartModel(model_id=model_id, model_version=model_version)

You can now deploy the model using SageMaker JumpStart. If the selected model is gated, you will need to accept the end-user license agreement (EULA) prior to deployment. This is accomplished by providing the `accept_eula=True` argument to the `deploy` method. The deployment might take few minutes. 

In [None]:
predictor = model.deploy(
    accept_eula=False
)  # please change `accept_eula` to be True to accept EULA.

### Invoke the endpoint

This section demonstrates how to invoke the endpoint using example payloads that are retrieved programmatically from the `JumpStartModel` object. You can replace these example payloads with your own payloads.

JumpStart stores model-specific default example payloads in its SDK. You can retrieve and view them using following code.

In [None]:
example_payloads = model.retrieve_all_examples()

In [None]:
import jmespath


for payload in example_payloads:
    response = predictor.predict(payload.body)
    generated_text = jmespath.search(payload.raw_payload["output_keys"]["generated_text"], response)
    print("Input:\n", payload.body[payload.prompt_key])
    print("Output:\n", generated_text.strip())
    print("\n===============\n")

## 3. Fine-tune model with LoRA

### Dataset preparation for instruction fine-tuning

The training data must be formatted in a JSON lines (`.jsonl`) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The `.jsonl` file extension is mandatory. The training
folder can also contain a `template.json` file describing the input and output formats. If no template file is given, the following template will be used:
  ```json
  {
    "prompt": "{prompt}",
    "completion": "{completion}"
  }
  ```

In this case, the data in the JSON lines entries must include `prompt` and `completion` fields. If a custom template is provided it must also use `prompt` and `completion` keys to define the input and output templates. Below is a sample custom template:
  
  ```json
{
    "prompt": "{system_prompt} \n\n### Input: {question}",
    "completion": " {response}",
}
  ```
Here, each example in the JSON lines must include `system_prompt`, `question` and `response` fields.

In this demo, we will use a subset of [Dolphin-coder dataset](https://huggingface.co/datasets/cognitivecomputations/dolphin-coder) in an instruction tuning format. The dataset is available under Apache 2.0 license.

In [None]:
from datasets import load_dataset


dolphin = load_dataset("cognitivecomputations/dolphin-coder", split="train")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = dolphin.train_test_split(test_size=0.9, seed=0)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")
train_and_test_dataset["test"].select(range(10)).to_json("test.jsonl")

In [None]:
train_and_test_dataset["train"][0]

Next, we prepare prompt template used for processing the data in an instruction format.

In [None]:
import json

template = {
    "prompt": """{system_prompt}

### Input:
{question}
""",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

### Upload dataset to S3

In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolphin_coder_dataset"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")

Retrieve and customize hyperparameters

In [None]:
from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

print(my_hyperparameters)

In [None]:
my_hyperparameters["epoch"] = "1"
print(my_hyperparameters)

hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator


estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters=my_hyperparameters,
    environment={
        "accept_eula": "false"
    },  # please change `accept_eula` to be `true` to accept EULA.
)

estimator.fit({"training": train_data_location})

### Deploy the fine-tuned model
Next, we deploy the fine-tuned model. We will compare the performance of fine-tuned and pre-trained model.

In [None]:
finetuned_predictor = estimator.deploy()

## 4.1 Qualitatively evaluate the pre-trained and fine-tuned model
Next, we use the test data to evaluate the performance of the fine-tuned model and compare it with the pre-trained model. 


In [None]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = load_dataset("json", data_files="test.jsonl")["train"]
prompt_inference = template["prompt"]
inputs, ground_truth_responses, responses_before_finetuning, responses_after_finetuning = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": prompt_inference.format(
            system_prompt=datapoint["system_prompt"], question=datapoint["question"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    pretrained_response = predictor.predict(payload)
    responses_before_finetuning.append(pretrained_response[0]["generated_text"])
    finetuned_response = finetuned_predictor.predict(payload)
    responses_after_finetuning.append(finetuned_response[0]["generated_text"])


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

## 4.2 Quantitatively evaluate the pre-trained and fine-tuned models using [Human-Eval repository](https://github.com/openai/human-eval)

Lets now evaluate if our model has improved on the HumanEval metric from OpenAI. HumanEval is a standard benchmark for code generation models that was created using hand written python problems. This version of HumanEval is using python for its language of choice. We will generate solutions to 164 python related questions and then run a test suite on the solutions to generate a score. If you want to read more [here is the official paper.](https://arxiv.org/abs/2107.03374)

In [None]:
!pip3 install human_eval --quiet

In [None]:
from human_eval.evaluation import evaluate_functional_correctness
from human_eval.data import write_jsonl, read_problems
from tqdm import tqdm


def generate_one_completion(prompt, predictor):
    body = {"inputs": prompt, "parameters": {"max_new_tokens": 384, "temperature": 0.2}}

    response = predictor.predict(body)

    completion = (response[0]["generated_text"]).replace(prompt, "").split("\n\n\n")[0]
    # if prompt is returned from response
    completion = completion.replace("```", "")
    # if markdown code block is created
    print(f"payload: {prompt}")
    print(f"completion: {completion}")
    return completion


# perform HumanEval
problems = read_problems()

num_samples_per_task = 1

Generate responses from pre-trained and fine-tuned models for 164 python related questions

In [None]:
samples = [
    dict(
        task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"], predictor)
    )
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("pretrained.jsonl", samples)

In [None]:
evaluate_functional_correctness("./pretrained.jsonl")

Now lets compare the previous pretrained model to our new fine-tuned model!

In [None]:
samples = [
    dict(
        task_id=task_id,
        completion=generate_one_completion(problems[task_id]["prompt"], finetuned_predictor),
    )
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
write_jsonl("fine-tuned.jsonl", samples)

In [None]:
evaluate_functional_correctness("./fine-tuned.jsonl")

### Clean up the endpoint
Don't forget to clean up resources when finished to avoid unnecessary charges.

In [None]:
predictor.delete_predictor()
finetuned_predictor.delete_predictor()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|jumpstart-foundation-models|code-llama-fine-tuning-evaluate-human-eval.ipynb)
