# Evaluation

Now we'll evaluate our fine-tuned LLM to see how well it performs on our task. Here is the roadmap for our notebook:

<div class="alert alert-info">
<b> Here is the roadmap for this notebook:</b>

<ul>
    <li><b>Part1:</b> Overview of LLM Evaluation</li>
    <li><b>Part2:</b> Loading Test Data</li>
    <li><b>Part3:</b> Forming our Inputs and Outputs</li>
    <li><b>Part4:</b> Running Model Inference</li>
    <li><b>Part5:</b> Generating Evaluation Metrics</li>
    <li><b>Part6:</b> Comparing with a Baseline Model</li>
</ul>

</div>


## Imports

In [None]:
import os
from typing import Any, Optional

import anyscale
import numpy as np
import pandas as pd
import ray
import re

from rich import print as rprint
from transformers import AutoTokenizer
from vllm.lora.request import LoRARequest
from vllm import LLM, SamplingParams

In [None]:
ctx = ray.data.DataContext.get_current()
ctx.enable_operator_progress_bars = False
ctx.enable_progress_bars = False

## 0. Overview of LLM Evaluation

Here are the main steps for evaluating a language model:

1. Prepare Evaluation Data:
    1. Get data representative of the task you want to evaluate the model on.
    2. Prepare it in the proper format for the model.
2. Generate responses using your LLM
    1. Run batch inference on the evaluation data.
3. Produce evaluation metrics
    1. Choose a metric based on the model's output.
    2. Compare the model's performance to a baseline model to see if it's better.

Here is a diagram of the evaluation process:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/e2e-llms/evaluation_metrics_v3.png" width="700">



## 1. Load model artifacts

Now that our finetuning is complete, we can load the model artifacts from cloud storage to a local [cluster storage](https://docs.anyscale.com/workspaces/storage/#cluster-storage) to use for other workloads.

To retrieve information about your fine-tuned model, Anyscale provides a convenient model registry SDK.

<b style="background-color: yellow;">&nbsp;🔄 REPLACE&nbsp;</b>: Use the job ID of your fine-tuning run

In [None]:
model_info = anyscale.llm.model.get(job_id="prodjob_123") # REPLACE with the job ID for your fine-tuning run
rprint(model_info)

Let's extract the model ID from the model info.

In [None]:
model_id = model_info.id

We will download the artifacts from the cloud storage bucket to our local cluster storage

In [None]:
s3_storage_uri =  (
    f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}"
    f"/lora_fine_tuning/{model_id}"
)
# s3_storage_uri = model_info.storage_uri 
s3_path_wo_bucket = '/'.join(s3_storage_uri.split('/')[3:])

local_artifacts_dir = "/mnt/cluster_storage"
local_artifacts_path = os.path.join(local_artifacts_dir, s3_path_wo_bucket)

In [None]:
!aws s3 sync {s3_storage_uri} {local_artifacts_path}

<div class="alert alert-block alert-info">

<b>Backup:</b> In case you don't have access to a successful finetuning job, you can download the artifacts by running this code in a python cell.

```python
model_id = "mistralai/Mistral-7B-Instruct-v0.1:aitra:qzoyg"
local_artifacts_path = f"/mnt/cluster_storage/llm-finetuning/lora_fine_tuning/{model_id}"
!aws s3 sync s3://anyscale-public-materials/llm-finetuning/lora_fine_tuning/{model_id} {local_artifacts_path}
```

</div>

## 2. Reading the test data

Let's start by reading the test data to evaluate our fine-tuned LLM. This test data has undergone the same preparation process as the training data - i.e. it is in the correct schema format.

In [None]:
test_ds = (
    ray.data.read_json(
        "s3://anyscale-public-materials/llm-finetuning/viggo_inverted/test/data.jsonl"
    )
)
test_ds

In [None]:
test_ds = test_ds.limit(100)  # We limit to 100 for the sake of time but still sufficient size.

<div class="alert alert-block alert-warning">

<b>NOTE:</b> It is important to split the dataset into a train, validation, and test set. The test set should be used only for evaluation purposes. The model should not be trained or tuned on the test set.

</div>

## 3. Forming our Inputs and Outputs

Let's split the test data into inputs and outputs. Our inputs are the "system" and "user" prompts, and the outputs are the responses generated by the "assistant".


In [None]:
def split_inputs_outputs(row):
    row["input_messages"] = [
        message for message in row["messages"] if message["role"] != "assistant"
    ]
    row["output_messages"] = [
        message for message in row["messages"] if message["role"] == "assistant"
    ]
    del row["messages"]
    return row

test_ds_inputs_outputs = test_ds.map(split_inputs_outputs)

Let's inspect a sample batch

In [None]:
sample_batch = test_ds_inputs_outputs.take_batch(1)
sample_batch

We choose to fetch the LLM model files from an s3 bucket instead of huggingface. This is much more likely what you might do in a production environment.

In [None]:
base_model = "/mnt/cluster_storage/mistralai--Mistral-7B-Instruct-v0.1/"

In [None]:
!aws s3 sync "s3://anyscale-public-materials/llm/mistralai--Mistral-7B-Instruct-v0.1/" {base_model} --region us-west-2

We'll load the appropriate tokenizer to apply to our input data.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

A tokenizer encodes the input text into a list of token ids that the model can understand.

In [None]:
tokenizer.encode("Hello there")

The token ids are simply the indices of the tokens in the model's vocabulary.

In [None]:
tokenizer.tokenize("Hello there", add_special_tokens=True)

In addition to tokenizing, we will need to convert the prompt into the template format that the model expects.

In [None]:
tokenizer.apply_chat_template(
    conversation=sample_batch["input_messages"][0],
    add_generation_prompt=True,
    tokenize=False,
    return_tensors="np",
)

To apply the prompt template and tokenize the input data, we'll use the following stateful transformation:

In [None]:
class MistralTokenizer:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained(base_model)

    def __call__(self, row: dict[str, Any]) -> dict[str, Any]:
        row["input_tokens"] = self.tokenizer.apply_chat_template(
            conversation=row["input_messages"],
            add_generation_prompt=True,
            tokenize=True,
            return_tensors="np",
        ).squeeze()
        return row


test_ds_inputs_tokenized = test_ds_inputs_outputs.map(
    MistralTokenizer,
    concurrency=2,
)

In [None]:
sample_tokenized_batch = test_ds_inputs_tokenized.take_batch(1)
sample_tokenized_batch["input_tokens"][0].shape

We can then proceed to materialize the dataset.

In [None]:
test_ds_inputs_tokenized = test_ds_inputs_tokenized.materialize()

Materializing the dataset could be useful if we want to compute metrics on the tokens like the maximum input token length for instance.

In [None]:
def compute_token_length(row: dict) -> dict:
    row["token_length"] = len(row["input_tokens"])
    return row

max_input_length = test_ds_inputs_tokenized.map(compute_token_length).max(on="token_length")
max_input_length

## 5. Running Model Inference



#### Quick Intro to vLLM

vLLM is a library for high throughput generation of LLM models by leveraging various performance optimizations, primarily: 

* Efficient management of attention key and value memory with PagedAttention 
* Fast model execution with CUDA/HIP graph
* Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
* Optimized CUDA kernels



vLLM makes available an `LLM` class which can be called along with sampling parameters to generate outputs.

Here is how we can build a stateful transformation to perform batch inference on our test data:

In [None]:
class LLMPredictor:
    def __init__(
        self, hf_model: str, sampling_params: SamplingParams, lora_path: str = None
    ):
        # 1. Load the LLM
        self.llm = LLM(
            model=hf_model,
            enable_lora=bool(lora_path),
            gpu_memory_utilization=0.95,
            kv_cache_dtype="fp8",
        )

        self.sampling_params = sampling_params
        # 2. Prepare a LoRA request if a LoRA path is provided
        self.lora_request = (
            LoRARequest(
                lora_name="lora_adapter", lora_int_id=1, lora_local_path=lora_path
            )
            if lora_path
            else None
        )

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        # 3. Generate outputs
        responses = self.llm.generate(
            prompt_token_ids=[ids.squeeze().tolist() for ids in batch["input_tokens"]],
            sampling_params=self.sampling_params,
            lora_request=self.lora_request,
        )

        return {
            "prompt": [
                " ".join(message["content"] for message in messages)
                for messages in batch["input_messages"]
            ],
            "expected_output": [
                message["content"]
                for messages in batch["output_messages"]
                for message in messages
            ],
            "generated_text": [resp.outputs[0].text for resp in responses],
        }


We then apply the transformation like so:

In [None]:
sampling_params = SamplingParams(temperature=0, max_tokens=1024, detokenize=True)

test_ds_responses = test_ds_inputs_tokenized.map_batches(
    LLMPredictor,
    fn_constructor_kwargs={
        "hf_model": base_model,
        "sampling_params": sampling_params,
        "lora_path": local_artifacts_path,
    },
    concurrency=1,  # number of LLM instances
    num_gpus=1,  # GPUs per LLM instance
    batch_size=40,
)

test_ds_responses = test_ds_responses.materialize()

<div class="alert alert-block alert-warning">

<b>Note:</b> Running inference can take a long time depending on the size of the dataset and the model. Additional time may be required for the model to automatically scale up to handle the workload.

</div>

In [None]:
sample_response = test_ds_responses.take_batch(2)
sample_response

<div class="alert alert-block alert-info">

### Activity: Find the optimal batch size

To run batch inference efficiently, we should always look to maximize our hardware utilization. 

To that end, you need to find the batch size that will maximize our GPU memory usage. 

Hint: make use of the metrics tab to look at the hardware utilization and iteratively find your batch size.


```python
test_ds_inputs_tokenized.map_batches(
    LLMPredictor,
    fn_constructor_kwargs={
        "hf_model": base_model,
        "sampling_params": sampling_params,
        "lora_path": local_artifacts_path,
    },
    concurrency=1,  
    num_gpus=1,  
    batch_size=40, # Hint: find the optimal batch size.
).materialize()
```




</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">

<details>

<summary> Click here to see the solution </summary>

```python
test_ds_inputs_tokenized.map_batches(
    LLMPredictor,
    fn_constructor_kwargs={
        "hf_model": base_model,
        "sampling_params": sampling_params,
        "lora_path": local_artifacts_path,
    },
    concurrency=1,  
    num_gpus=1, 
    batch_size=70,
).materialize()
```

</details>

</div>

## 6. Generating Evaluation Metrics

Depending on your task, you will want to choose the proper evaluation metric. 

In our functional representation task, the output is constrained into a limited set of categories and therefore standard classification evaluation metrics are a good choice.

In more open-ended response generation tasks, you might want to consider making use of an LLM as a judge to generate a scoring metric.

### Post-processing the responses

We will evaluate the accuracy at two levels:
- accuracy of predicting the correct function type
- accuracy of predicting the correct attribute types (a much more difficult task)

Lets post process the outputs to extract the ground-truth vs model predicted function types and attriute types

In [None]:
def extract_function_type(response: str) -> Optional[str]:
    """Extract the function type from the response."""
    if response is None:
        return None

    # pattern to match is "{function_type}({attributes})"
    expected_pattern = re.compile(r"^(?P<function_type>.+?)\((?P<attributes>.+)\)$")

    # remove any "Output: " prefix and strip the response
    match = expected_pattern.match(response.split("Output: ")[-1].strip())

    if match is None:
        return None

    # return the function type
    ret = match.group("function_type")
    return ret.replace("\\_", "_") # handle escapes of underscores

Given this expected response

In [None]:
expected_output = sample_response['expected_output'][0]
expected_output

We extract its function type like so

In [None]:
extract_function_type(expected_output)

Given the generated output from our finetuned LLM

In [None]:
generated_output = sample_response["generated_text"][0]
generated_output

We extract its function type like so

In [None]:
extract_function_type(generated_output)

We define a similar function to extract the attribute types.

In [None]:
def extract_attribute_types(response: Optional[str]) -> list[str]:
    if response is None:
        return []

    # pattern to match is "{function_type}({attributes})"
    expected_pattern = re.compile(r"^(?P<function_type>.+?)\((?P<attributes>.+)\)$")

    # remove any "Output: " prefix and strip the response
    match = expected_pattern.match(response.split("Output: ")[-1].strip())

    if match is None:
        return []

    attributes = match.group("attributes")

    # pattern is "{attribute_type}[{attribute_value}], ..."
    attr_types = re.findall(r"(\w+)\[", attributes)

    return attr_types

Given a sample expected output

In [None]:
expected_output

Here are the expected attribute types to output

In [None]:
extract_attribute_types(expected_output)

Lets take our finetuned LLM generated output

In [None]:
generated_output

We can now extract its attribute types

In [None]:
extract_attribute_types(generated_output)

Let's apply this post processing to our entire dataset

In [None]:
def post_process(row: dict[str, Any]) -> dict[str, Any]:
    row.update(
        {
            "ground_truth_fn_type": extract_function_type(row["expected_output"]),
            "ground_truth_attr_types": extract_attribute_types(row["expected_output"]),
            "model_fn_type": extract_function_type(row["generated_text"]),
            "model_attr_types": extract_attribute_types(row["generated_text"]),
        }
    )
    return row


test_ds_responses_processed = test_ds_responses.map(post_process)
sample_processed = test_ds_responses_processed.take_batch(2)
sample_processed

In [None]:
def check_function_type_accuracy(batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    batch["fn_type_match"] = batch["ground_truth_fn_type"] == batch["model_fn_type"]
    return batch

fn_type_accuracy_percent = test_ds_responses_processed.map_batches(check_function_type_accuracy).mean(on="fn_type_match") * 100 
print(f"The correct function type is predicted at {fn_type_accuracy_percent}% accuracy")

In [None]:
def check_attribute_types_accuracy(batch: pd.DataFrame) -> pd.DataFrame:
    batch["attr_types_match"] = batch["ground_truth_attr_types"].apply(set) == batch["model_attr_types"].apply(set)
    return batch

attr_types_accuracy_percent = test_ds_responses_processed.map_batches(check_attribute_types_accuracy, batch_format="pandas").mean(on="attr_types_match") * 100 
print(f"The correct attribute types are predicted at {attr_types_accuracy_percent}% accuracy")

<div class="alert alert-block alert-info">

### Activity: Change the attribute types accuracy metric

Our current metric for attribute types is not very strict. 

Can you make it stricter by setting `attr_types_match` to `True` only when the model's predicted attribute types and the ground truth attribute types are exactly the same in the order they appear?



</div>

In [None]:
# Write your solution here

<div class="alert alert-block alert-info">

<details>

<summary> Click here to see the solution </summary>

```python
def check_attribute_types_accuracy(batch: pd.DataFrame) -> pd.DataFrame:
    batch["attr_types_match"] = batch["ground_truth_attr_types"].apply(list) == batch["model_attr_types"].apply(list)
    return batch

attr_types_accuracy_percent = test_ds_responses_processed.map_batches(check_attribute_types_accuracy, batch_format="pandas").mean(on="attr_types_match") * 100 
print(f"The correct attribute types are predicted at {attr_types_accuracy_percent}% accuracy")
```

</details>


## 7. Running Baseline Model Inference

We will benchmark the performance to the unfinetuned version of the same LLM. 

### Using Few-shot learning for the baseline model

We will augment the prompt with few-shot examples as a prompt-engineering approach to provide a fair comparison between the finetuned and unfinetuned models given the unfinetuned model fails to perform the task out of the box.

Let us read in from our training data up to 20 examples 

In [None]:
df_few_shot = ray.data.read_json("s3://anyscale-public-materials/llm-finetuning/viggo_inverted/train/subset-500.jsonl").limit(20).to_pandas()
examples = df_few_shot['messages'].tolist()
examples[:2]

Let's take a sample conversation from our test dataset

In [None]:
sample_conversations = test_ds.take_batch(2)
sample_conversations["messages"][0]

Here is how we will build our prompt with few shot examples

In [None]:
def few_shot(messages: list, examples: list) -> list:
    """Build a prompt for few-shot learning given a user input and examples."""
    system_message, user_message, assistant_message = messages
    user_text = user_message["content"]

    example_preface = (
        "Examples are printed below."
        if len(examples) > 1
        else "An example is printed below."
    )
    example_preface += (
        ' Note: you are to respond with the string after "Output: " only.'
    )
    examples_parsed = "\n\n".join(
        [
            f"{user['content']}\nOutput: {assistant['content']}"
            for (system, user, assistant) in examples
        ]
    )
    response_preface = "Now please provide the output for:"
    user_text = f"{example_preface}\n\n{examples_parsed}\n\n{response_preface}\n{user_text}\nOutput: "
    return [system_message, {"role": "user", "content": user_text}, assistant_message]

Now we apply `few_shot` function with only two examples

In [None]:
conversation = sample_conversations["messages"][0]
conversation_with_few_shot = few_shot(conversation, examples[:2])
conversation_with_few_shot

Here is the updated user prompt

In [None]:
print(conversation_with_few_shot[1]["content"])

Let's map this across our entire dataset

In [None]:
def apply_few_shot(row: dict[str, Any]) -> dict[str, Any]:
    row["messages"] = few_shot(row["messages"], examples)
    return row

test_ds_with_few_shot = test_ds.map(apply_few_shot)

We now proceed to generate responses

In [None]:
sampling_params = SamplingParams(temperature=0, max_tokens=2048, detokenize=True)

test_ds_responses_few_shot = (
    test_ds_with_few_shot.map(split_inputs_outputs)
    .map(
        MistralTokenizer,
        concurrency=2,
    )
    .map_batches(
        LLMPredictor,
        fn_constructor_kwargs={
            "hf_model": base_model,
            "sampling_params": sampling_params,
        },
        concurrency=1,  # number of LLM instances
        num_gpus=1,  # GPUs per LLM instance
        batch_size=10,
    )
    .map(post_process)
    .materialize()
)

## 8. Comparing Evaluation Metrics

Let's produce the evaluation metrics on our baseline to compare

In [None]:
fn_type_accuracy_percent_few_shot = test_ds_responses_few_shot.map_batches(check_function_type_accuracy).mean(on="fn_type_match") * 100 
print(f"The correct function type is predicted at {fn_type_accuracy_percent_few_shot}% accuracy")

In [None]:
attr_types_accuracy_percent_few_shot = test_ds_responses_few_shot.map_batches(check_attribute_types_accuracy, batch_format="pandas").mean(on="attr_types_match") * 100 
print(f"The correct attribute types are predicted at {attr_types_accuracy_percent_few_shot}% accuracy")

In [None]:
# clean up - uncomment to delete the artifacts
# !rm -rf /mnt/cluster_storage/llm-finetuning/