# Evaluating Quantized FM Models Utilizing SageMaker Clarify and Inference

In this sample notebook we use the [SageMaker Inference Optimization Toolkit](https://aws.amazon.com/blogs/machine-learning/achieve-up-to-2x-higher-throughput-while-reducing-costs-by-50-for-generative-ai-inference-on-amazon-sagemaker-with-the-new-inference-optimization-toolkit-part-1/) to showcase how you can quantize a Llama3-70B model and deploy on a smaller GPU instance type. Quantization is a popular model compression technique that uses lower precision data types to reduce the memory footprint and accelerate inference. Via the Inference Optimization Toolkit, [AWQ](https://arxiv.org/abs/2306.00978) is currently supported as a Quantization method. 

Quantization does have a trade off in the fact that by reducing the precision of the model's parameters this can lead to a decrease in the actual accuracy of the model itself. In this notebook we explore how you can use SageMaker Clarify's Foundation Model Evaluation Tool [(FMEval)](https://github.com/aws/fmeval/tree/main) to evaluate a base and quantized Llama3-70B model to benchmark for accuracy and get a holistic understanding of the accuracy difference of the two models.

Note that with the FMEval package, you must pick the NLP task and algorithm of your choice. For the purpose of this notebook we will utilize the Factual Knowledge algorithm, but if you are dealing with a Summarization use-case for instance, ensure that you instantiate that algorithm to reflect your use-case. For more FMEval samples, please refer to this [link](https://github.com/aws/fmeval/tree/main/examples).

## Additional Resources
- [Inference Optimization Toolkit Blog](https://aws.amazon.com/blogs/machine-learning/achieve-up-to-2x-higher-throughput-while-reducing-costs-by-50-for-generative-ai-inference-on-amazon-sagemaker-with-the-new-inference-optimization-toolkit-part-1/)
- [FMEval Intro Blog](https://aws.amazon.com/blogs/machine-learning/evaluate-large-language-models-for-quality-and-responsibility/)
- [Model Builder SageMaker Python SDK Class](https://aws.amazon.com/blogs/machine-learning/package-and-deploy-classical-ml-and-llms-easily-with-amazon-sagemaker-part-1-pysdk-improvements/)

## Setup & Base Model Deployment

For this example we use the Model Builder class via the SageMaker Python SDK to deploy a Llama3-70B model. We will deploy the base Llama model with the default hardware configuration and then quantize the model to deploy on a smaller instance via the Inference Optimization toolkit. All of this will be easily accessible to us via the Model Builder Class which abstracts out container and parameter selection for end users and provides default optimized configurations for popular LLMs such as Llama3-70B.

In [None]:
%pip install sagemaker>=2.225.0 accelerate boto3 jsonlines fmeval --upgrade --quiet

In [None]:
import boto3
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.session import Session
import logging
from pathlib import Path

In [None]:
sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

# can specify a JumpStart Model ID for deployment via Model Builder
js_model_id = "meta-textgeneration-llama-3-70b"
gpu_instance_type = "ml.p4d.24xlarge"

In [None]:
# define input/output payload shapes for the Model Builder class
response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

In [None]:
base_model = model_builder.build()

In [None]:
base_predictor = base_model.deploy(instance_type = gpu_instance_type, accept_eula=True)

In [None]:
base_predictor.predict(sample_input)

## Quantization
Via the SageMaker Inference Toolkit we can specify AWQ quantization, this will reduce the memory footprint of the model and allow for us to deploy on a smaller GPU instance.

In [None]:
# define quantized model builder class
quantized_model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

In [None]:
# an optimization job can take up to 2 hours you can also view this in the Studio UI
optimized_model = quantized_model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq",
        },
    },
    output_path=f"s3://{artifacts_bucket_name}/awq-quantization/",
)

In [None]:
quantized_instance_type = "ml.g5.12xlarge"
quantized_predictor = optimized_model.deploy(instance_type=quantized_instance_type, accept_eula=True)

In [None]:
quantized_predictor.predict(sample_input)['generated_text']

## Evaluation via SageMaker Foundation Model Evaluations
Now that we have both the base model deployed and the quantized model, we can take a deeper look into evaluating both models using the open-source FMEval package. Ensure that you have installed the library and have the requisite built-in datasets locally (attached to this repo as well). In this case we use the Factual Knowledge algorithm to test both models abilities against fact based questions, for this we will use the built-in trex-sample.jsonl dataset.

For using FMEval there are two methods:
- <b>evaluate_sample</b>: Here you can evaluate a single record by providing the model output against the target (ground truth) output.
- <b>evaluate</b>: Pass an entire dataset to run an evaluation across.

In [None]:
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig
eval_algo = FactualKnowledge(FactualKnowledgeConfig("<OR>"))

In [None]:
sample_fact_input = {
    "inputs": "Aurillac is the capital of",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}
model_output = quantized_predictor.predict(sample_fact_input)['generated_text']
model_output

In [None]:
# evaluate a single sample
eval_algo.evaluate_sample(target_output="Cantal", model_output=model_output)

## Evaluate Quantized and Base Model
We can take the subset dataset and run inference with both the quantized model and base model to create two datasets for evaluation. We define two helper methods below to create the datasets that we can then run evaluation on with the FMEval package.

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("trex_sample_subset.jsonl"):
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

In [None]:
import json
import jsonlines
runtime_client = boto3.client('sagemaker-runtime')
content_type = "application/json"

def create_payload(prompt: str, parameters: dict = {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6}) -> dict:
    """
    Creates a model invocation payload.
    
    Args:
        prompt (str): Prompt for the LLM
        parameters (dict): Customizable model invocation parameters
    
    Returns:
        Payload to be used when invoking the model.
    """
    
    if len(prompt) == 0:
        raise ValueError("Please provide a non-empty prompt.")
    
    return {
        "inputs": prompt,
        "parameters": parameters
    }


def create_eval_files(endpoint_name: str, model_outputs_file: str, input_file: str = "trex_sample_subset.jsonl") -> str:
    try:
        with jsonlines.open(input_file) as input_fh, jsonlines.open(model_outputs_file, "w") as output_fh:
            for line in input_fh:
                if "question" in line:
                    question = line["question"]
                    payload = create_payload(question)
                    #print(payload)
                    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload), ContentType=content_type)
                    model_output = json.loads(response['Body'].read().decode())['generated_text']
                    #print(f"Model output: {model_output}")
                    #print("==============================")
                    line["model_output"] = model_output
                    output_fh.write(line)
    except Exception as e:
            print(f"An error occurred: {e}")
    print(f"Created Model Outputs File: {model_outputs_file}")

quantized_model_output_file = create_eval_files(endpoint_name = quantized_predictor.endpoint_name, model_outputs_file = "quantized_model_outputs.jsonl")
base_model_output_file = create_eval_files(endpoint_name = base_predictor.endpoint_name, model_outputs_file = "base_model_outputs.jsonl")

### Prepare Data Config
For FMEval we need to define Data Config objects that point towards our datasets we just created.

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES


# prepare data config for quantized model
quantized_data_config = DataConfig(
    dataset_name="trex_sample_with_quantized_model_outputs",
    dataset_uri="quantized_model_outputs.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    model_output_location="model_output"
)

# data config for base model
base_data_config = DataConfig(
    dataset_name="trex_sample_with_base_model_outputs",
    dataset_uri="base_model_outputs.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    model_output_location="model_output"
)

### Evaluate Methods
Once defining our Data Config objects and instantiating our Eval Algorithms, we can run the evaluate method on both the quantized and base model outputs to see the accuracy difference for the Factual Knowledge algorithm. For the results across each datapoint you can check the following directory post evaluation: ```/tmp/eval_results/```.

In [None]:
eval_output_quantized = eval_algo.evaluate(dataset_config=quantized_data_config, save=True)
print(eval_output_quantized)

In [None]:
eval_output_base = eval_algo.evaluate(dataset_config=base_data_config, save=True)
print(eval_output_base)