# How to optimize the Meta Llama-3 70B Amazon JumpStart model for inference using Amazon SageMaker model optimization jobs
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to apply state-of-the-art optimization techniques to an Amazon JumpStart model (JumpStart model ID: `meta-textgeneration-llama-3-70b`) using Amazon SageMaker ahead-of-time (AOT) model optimization capabilities. Each example includes the deployment of the optimized model to an Amazon SageMaker endpoint. In all cases, the inference image will be the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

You will successively:

Customize the speculative decoding with open-source draft model.


**Notices:**
* Make sure that the `ml.p4d.24xlarge` and `ml.inf2.48xlarge` instance types required for this tutorial are available in your AWS Region.
* Make sure that the value of your "ml.p4d.24xlarge for endpoint usage" and "ml.inf2.48xlarge for endpoint usage" Amazon SageMaker service quotas allow you to deploy at least one Amazon SageMaker endpoint using these instance types.

This notebook leverages the [Model Builder Class](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) within the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to abstract out container and model server management/tuning. Via the Model Builder Class you can easily interact with JumpStart Models, HuggingFace Hub Models, and also custom models via pointing towards an S3 path with your Model Data. For this sample we will focus on the JumpStart Optimization path.

### License agreement
* This model is under the Meta license, please refer to the original model card.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.225.0 

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install sagemaker>=2.225.0 boto3 huggingface_hub --upgrade --quiet

### Setup

In [None]:
import boto3
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.session import Session
import logging
import huggingface_hub
from pathlib import Path

In [None]:
sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-70b"
gpu_instance_type = "ml.p4d.24xlarge"
neuron_instance_type = "ml.inf2.48xlarge"

In [None]:
response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

## 2. Customize the speculative decoding with open-source draft model, then deploy the optimized model
In this section and instead of relying on a pre-optimized model, you will use an Amazon SageMaker optimization toolkit to enable speculative decoding on the `meta-textgeneration-llama-3-70b` JumpStart model. In this example, the draft model is from HuggingFace model hub. We use the HF-Hub model package to download these artifacts to S3 directly, optionally you can also provide your HF Model ID. In this case for the draft model we use Meta-Llama-3-8B, for this model ensure you have access to the artifacts via HF.

In [None]:
custom_draft_model_id="meta-llama/Meta-Llama-3-8B"

hf_local_download_dir = Path.cwd() / "model_repo"
hf_local_download_dir.mkdir(exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id=custom_draft_model_id,
    revision="main",
    local_dir=hf_local_download_dir,
    local_dir_use_symlinks=False,
)

In [None]:
custom_draft_model_uri = sagemaker_session.upload_data(
    path=hf_local_download_dir.as_posix(),
    bucket=artifacts_bucket_name,
    key_prefix="spec-dec-custom-draft-model",
)

In [None]:
draft_uri = custom_draft_model_uri + "/" #need to point towards the uncompressed model artifacts
draft_uri

In [None]:
!aws s3 ls {draft_uri} #verify model artifacts

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

The optimization operation may take a few minutes.

In [None]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    speculative_decoding_config={
        "ModelSource": draft_uri
    },
)

Now let's deploy the quantized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [None]:
predictor = optimized_model.deploy(accept_eula=True)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method as see below. Once you have tested it, you can use this endpoint to connect to the generative ai application builder on AWS to query.

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)