# Deploy 4bit quantized Llama V2 70b Chat to SageMaker with g5.12xlarge

[GGML](https://github.com/ggerganov/llama.cpp) is a popular framework to run LLMs, including Llama V2. GGML offers aggressively quantized formats of LLMs, including up to 4bit. Using this notebook, you will be able to deploy LLMs with instance types that have limited GPU resources. With this notebook, you would be able to deploy LLMs in certain regions without availability of particular instance types, and/or optimize cost efficiency of LLM inference. 

## Deployment details

This notebook does the following deployment:
- Model: [Llama V2 Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
- Quantization: GGML Q4_K_M, quantized weights are uploaded [here](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/)
- Web Framework: [llama-cpp-python, which uses FastAPI and Starlette](https://github.com/abetlen/llama-cpp-python)
- Instance Type: ml.g5.12xlarge, region availability and pricing can be found [here](https://aws.amazon.com/sagemaker/pricing/)

## Installing packages

In [None]:
%%bash
pip install -U pip --quiet
pip install -U sagemaker boto3 huggingface_hub --quiet

## Model weights upload
We download a specific quantized format of the model, create a compressed archive, and upload it to S3 for use by SageMaker.

In [None]:
download_link = "https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/resolve/main/llama-2-70b-chat.ggmlv3.q4_K_M.bin"
!apt-get -y install wget
!wget {download_link}

In [None]:
!tar -czf llama-2-70b-chat.tar.gz llama-2-70b-chat.ggmlv3.q4_K_M.bin

In [None]:
import sagemaker

sess = sagemaker.Session()

sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

In [None]:
from sagemaker.s3 import S3Uploader

res = S3Uploader.upload(
    "llama-2-70b-chat.tar.gz", f"s3://{sagemaker_session_bucket}/ggml-quantized-models"
)

## Image packaging

We will use [SageMaker Docker Build CLI](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/) to package our custom GGML LLM package.

In [None]:
!pip install sagemaker-studio-image-build --quiet -U

In [None]:
## modified llama-cpp-python repository
!git clone https://github.com/billcai/llama-cpp-python

In [None]:
%%sh
##
account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

dockerName=llama-cpp-python-example

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${dockerName}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${dockerName}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${dockerName}" > /dev/null
fi

In [None]:
%%sh
##
account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

dockerName=llama-cpp-python-example

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${dockerName}:latest"
cd llama-cpp-python && sm-docker build . --repository "${dockerName}:latest" --file docker/cuda_simple/Dockerfile

## Deploy model for SageMaker
We deploy the model as a SageMaker endpoint, given the packaged image and weights.

In [None]:
import boto3
from sagemaker import get_execution_role

client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]
region_name = sagemaker.Session().boto_region_name
dockerName = "llama-cpp-python-example"
role = get_execution_role()

In [None]:
image_uri = f"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{dockerName}:latest"
model_artifact_location = (
    f"s3://{sagemaker_session_bucket}/ggml-quantized-models/llama-2-70b-chat.tar.gz"
)

In [None]:
from sagemaker.model import Model

llama70bchat = Model(
    image_uri=image_uri,
    model_data=model_artifact_location,
    role=role,
    env={
        "MODEL": "/opt/ml/model/llama-2-70b-chat.ggmlv3.q4_K_M.bin",
        "N_CTX": "4096",
        "N_GPU_LAYERS": "999",
        "N_GQA": "8",
        "PORT": "8080",
        "USE_MLOCK": "0",
    },
)
predictor = llama70bchat.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    endpoint_name="llama-2-70b-chat-quantized",
)

## Let's test the endpoint
We will test the endpoint with a sample query. Feel free to use your own query below.

In [None]:
input_data = {
    "messages": [
        {
            "content": "You are an expert developer providing advice to AWS customers about how to build cloud solutions.",
            "role": "system",
        },
        {
            "content": "Provide a detailed step-by-step guide for customers deploying a PyTorch model onto Amazon SageMaker, \
      including sample codes if needed.",
            "role": "user",
        },
        {
            "role": "assistant",
            "content": "Certainly! Here's a step-by-step guide to help you deploy your PyTorch model onto AWS using SageMaker, \
        along with some sample code snippets. Before we begin, I recommend you have an understanding of the AWS Cloud and \
        basic knowledge of Python and machine learning. Let's dive in!",
        },
        {"content": "Step One:", "role": "user"},
    ],
    "max_tokens": 4096,
    "temperature": 0.9,
}

In [None]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model = Predictor(
    endpoint_name="llama-2-70b-chat-quantized",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

In [None]:
model.predict(data=input_data)