# How to optimize the Meta Llama-3 70B Amazon JumpStart model for inference using Amazon SageMaker model optimization jobs
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to apply state-of-the-art optimization techniques to an Amazon JumpStart model (JumpStart model ID: `meta-textgeneration-llama-3-70b`) using Amazon SageMaker ahead-of-time (AOT) model optimization capabilities. Each example includes the deployment of the optimized model to an Amazon SageMaker endpoint. In all cases, the inference image will be the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

You will successively:

Quantize the model weights using the AWQ algorithm.


**Notices:**
* Make sure that the `ml.p4d.24xlarge` and `ml.inf2.48xlarge` instance types required for this tutorial are available in your AWS Region.
* Make sure that the value of your "ml.p4d.24xlarge for endpoint usage" and "ml.inf2.48xlarge for endpoint usage" Amazon SageMaker service quotas allow you to deploy at least one Amazon SageMaker endpoint using these instance types.

This notebook leverages the [Model Builder Class](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) within the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to abstract out container and model server management/tuning. Via the Model Builder Class you can easily interact with JumpStart Models, HuggingFace Hub Models, and also custom models via pointing towards an S3 path with your Model Data. For this sample we will focus on the JumpStart Optimization path.

### License agreement
* This model is under the Meta license, please refer to the original model card.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.225.0 

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install sagemaker>=2.225.0 boto3 huggingface_hub --upgrade --quiet

### Setup

In [None]:
import boto3
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.session import Session
import logging
import huggingface_hub
from pathlib import Path

In [None]:
sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-70b"
gpu_instance_type = "ml.p4d.24xlarge"
neuron_instance_type = "ml.inf2.48xlarge"

In [None]:
response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

## 3. Run optimization job to quantize the model using AWQ, then deploy the quantized model
In this section, you will quantize the `meta-textgeneration-llama-3-70b` JumpStart model with the AWQ quantization algorithm by running an Amazon SageMaker optimization job. 

### What is quantization?
In our particular context, quantization means casting the weights of a pre-trained LLM to a data type with a lower number of bits and therefore a smaller memory footprint. The benefits of LLM quantization include:
* Reduced hardware requirements for model serving: A quantized model can be served using less expensive and more available GPUs or even made accessible on consumer devices or mobile platforms.
* Increased space for the KV cache to enable larger batch sizes and/or sequence lengths.
* Faster decoding latency. As the decoding process is memory bandwidth bound, less data movement from reduced weight sizes directly improves decoding latency, unless offset by dequantization overhead.
* A higher compute-to-memory access ratio (through reduced data movement), known as arithmetic intensity. This allows for fuller utilization of available compute resources during decoding.  

AWQ (Activation-aware Weight Quantization) is a post-training weight-only quantization algorithm introduced by [J. Lin et al. (MLSys 2024)](https://arxiv.org/abs/2306.00978) that allows to quantize LLMs to low-bit integer types like INT4 with virtually no loss in model accuracy.

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

Quantizing the model is as easy as supplying the following inputs:
* The location of the unquantized model artifacts, here the Amazon SageMaker JumpStart model ID.
* The quantization configuration.
* The Amazon S3 URI where the output quantized artifacts needs to be stored.
Everything else (compute provisioning and configuration for example) is managed by SageMaker. This operation takes around 120min.

In [None]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq",
        },
    },
    output_path=f"s3://{artifacts_bucket_name}/awq-quantization/",
)

Now let's deploy the quantized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [None]:
quantized_instance_type = "ml.g5.12xlarge"  # We can use a smaller instance type once quantized
predictor = optimized_model.deploy(instance_type=quantized_instance_type, accept_eula=True)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method as see below. Once you have tested it, you can use this endpoint to connect to the generative ai application builder on AWS to query.

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)