# Deploy a Bitsandbytes and GPTQ quantized models on SageMaker with Hugging Face TGI

In this notebook we will deploy a pre-quantized Bitsandbytes 7 billion parameter [Llama 2 Chat model](https://huggingface.co/Trelis/Llama-2-7b-chat-hf-function-calling). , and then a pre-quantized GPTQ 7 billion parameter [Llama 2 Chat model](https://huggingface.co/Trelis/Llama-2-7b-chat-hf-function-calling-GPTQ)

The original models is stored and served in half-precision fp16 format which translates to 2 bytes per parameter. Given that the model has 13 billion parameters, the model size translates 26GB which is too large to fit in the memory of a single A10 GPU which has only 24GB of memory. This requires us to use a more expensive multi-gpu instance such as a `ml.g5.12xlarge`. An alternative is to quantize the model which can significantly reduce the amount of VRAM required to host the model.  

In this notebook, we will 1st deploy a quantized model weights to int8, thereby greatly reducing the memory footprint of the model from the initial FP16. See this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) from Hugging Face for additional information 

Then we deploy a 7 billion parameter model that has been pre-quantized to 4-bits using [GPTQ algorithm](https://arxiv.org/abs/2210.17323). With 4bit quantization the amount of memory per parameter is reduced from 2 bytes to 4 bits (0.5 bytes) which translates to a 75% reduction in memory footprint. This allows us to host the model on a single A10 GPU instance, such as a `ml.g5.xlarge`, which is significantly cheaper than a multi-gpu instance. As a disclaimer, quantization does result in a slight drop in model accuracy. However, the drop in accuracy is small and the model is still able to generate coherent responses, but it is important to evaluate the model on your use case.  


*Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved

In [None]:
%pip install -Uq sagemaker

In [None]:
import boto3
import sagemaker
import json
from sagemaker import Model
from sagemaker.huggingface import get_huggingface_llm_image_uri
import time
from pathlib import Path

boto3_session = boto3.session.Session()

smr = boto3_session.client(
    "sagemaker-runtime"
)  # sagemaker runtime client for invoking the endpoint
sm = boto3_session.client("sagemaker")  # sagemaker client for creating the endpoint


role = sagemaker.get_execution_role()  # execution role for the endpoint

sess = sagemaker.session.Session(
    boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr
)  # sagemaker session for interacting with different AWS APIs

bucket = (
    sess.default_bucket()
)  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment

We will be using the Hugging Face Text Generation Inference (TGI) Container which runs the optimized [TGI](https://github.com/huggingface/text-generation-inference) LLM hosting solution from HuggingFace


In [None]:
# retrieve the llm image uri
llm_image_uri = get_huggingface_llm_image_uri("huggingface", version="1.0.3", session=sess)
llm_image_uri

The helper function below will deploy the model to a SageMaker endpoint.

In [None]:
def deploy_model(
    endpoint_name,
    instance_type,
    env=None,
    image_uri=None,
    model_artifact=None,
    s3_bucket=None,
    s3_prefix=None,
    wait=True,
):
    """Uploads the model artifact to S3 and deploys the model to SageMaker."""
    if model_artifact:
        code_artifact = sess.upload_data(model_artifact, s3_bucket, s3_prefix)
        print(f"Inference Code tar ball uploaded to --- > {code_artifact}")
    else:
        code_artifact = None

    model = Model(
        sagemaker_session=sess,
        image_uri=image_uri,
        model_data=code_artifact,
        env=env,
        role=role,
    )

    model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        wait=wait,
    )

    return model

## Deploy Llama 2 7B Chat quantized model with Bitsandbytes

It will take around 8 minutes for the endpoint to be ready. 

In [None]:
%%time
llama_7b_bitsandbytes_endpoint_name = sagemaker.utils.name_from_base("llama2-7b-bitsandbytes")

llama_7b_bitsandbytes_env = {
    # "HF_MODEL_ID":"openlm-research/open_llama_7b_v2",
    "HF_MODEL_ID":"Trelis/Llama-2-7b-chat-hf-function-calling",
    "HF_TASK":"text-generation",
    "SM_NUM_GPUS": "1",
    "HF_MODEL_QUANTIZE": "bitsandbytes"
}

llama_7b_bitsandbytes_model = deploy_model(
    endpoint_name=llama_7b_bitsandbytes_endpoint_name,
    instance_type="ml.g5.2xlarge",
    env=llama_7b_bitsandbytes_env,
    image_uri=llm_image_uri,
    wait=True,
)

Let's build a prompt

In [None]:
system_message = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

prompt = "Why is New York City sometimes referred to as the Big Apple?"
prompt_template = f"""[INST] <<SYS>>{system_message}<</SYS>>\n{prompt}[/INST]"""

In [None]:
%%time
# invoke llm
# first invocation is going to be slower. Subsequent ones should be faster

body = {
    "inputs": prompt_template,
    "parameters": {
        "max_new_tokens": 250,
        "temperature": 0.7,
        "return_full_text": False, # if True this will return our original prompt along with the generated text
    },
}
resp = smr.invoke_endpoint(
    EndpointName=llama_7b_bitsandbytes_endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)
output = json.loads(resp["Body"].read().decode("utf-8"))
print(output[0]["generated_text"])

We see it took around 11 seconds. This is since model takes time to load. Let's run the prompt again after the model is loaded

In [None]:
%%time
resp = smr.invoke_endpoint(
    EndpointName=llama_7b_bitsandbytes_endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)
output = json.loads(resp["Body"].read().decode("utf-8"))
print(output[0]["generated_text"])

We see it takes less time where Llama 2 7B Chat quantized model with Bitsandbytes is already loaded.  

In [None]:
# Clean up
sm.delete_endpoint(EndpointName=llama_7b_bitsandbytes_endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=llama_7b_bitsandbytes_endpoint_name)

## Deploy Llama 2 7B Chat quantized model with GPTQ

The repo maintained by [TheBloke](https://huggingface.co/TheBloke) contains many pre-quantized models. You can also easily quantize your own models with GPTQ by using the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library.

In [None]:
%%time
llama_7b_gptq_endpoint_name = sagemaker.utils.name_from_base("llama2-7b-gptq")

llama_7b_gptq_env = {
    # "HF_MODEL_ID": "TheBloke/Llama-2-7b-Chat-GPTQ",  # model_id from hf.co/models
    "HF_MODEL_ID": "Trelis/Llama-2-7b-chat-hf-function-calling-GPTQ",  # model_id from hf.co/models
    "HF_TASK":"text-generation",
    "SM_NUM_GPUS": "1",  # Number of GPU used per replica
    "HF_MODEL_QUANTIZE": "gptq",  # serve a pre-quantized model,
}

llama_7b_gptq_model = deploy_model(
    endpoint_name=llama_7b_gptq_endpoint_name,
    instance_type="ml.g5.2xlarge",
    env=llama_7b_gptq_env,
    image_uri=llm_image_uri,
    wait=True,
)

Once the endpoint is deployed we can invoke it using the boto3 SDK. We will use the [Llama2 recommended prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) along with a user instruction to create the final prompt which we will pass as a payload to the model. 

In [None]:
%%time
# invoke llm
# first invocation is going to be slower. Subsequent ones should be faster

body = {
    "inputs": prompt_template,
    "parameters": {
        "max_new_tokens": 250,
        "temperature": 0.8,
        "return_full_text": False, # if True this will return our original prompt along with the generated text
    },
}
resp = smr.invoke_endpoint(
    EndpointName=llama_7b_gptq_endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)
output = json.loads(resp["Body"].read().decode("utf-8"))
print(output[0]["generated_text"])

We see it took 11 seconds. This is since model takes time to load. Let's run the prompt again after the model is loaded.

In [None]:
%%time
resp = smr.invoke_endpoint(
    EndpointName=llama_7b_gptq_endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)
output = json.loads(resp["Body"].read().decode("utf-8"))
print(output[0]["generated_text"])

We see it takes less time where Llama 2 7B Chat quantized model with GPTQ is already loaded.  

Run the code below to delete the endpoint and avoid any additional charges.

In [None]:
# Clean up
sm.delete_endpoint(EndpointName=llama_7b_gptq_endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=llama_7b_gptq_endpoint_name)

# Conclusion

In this notebook we deployed Llama 2 7B Chat quantized model Bitsandbytes and then with GPTQ. For each model we sent a prompt for inference. We saw that 1st request on a new endpoint took time because, but the 2nd invocation took less time.  

Lastly we compared the inference time between the Bitsandbytes and the GPTQ one, and saw that GPTQ took around 50% less time than the Bitsandbytes. 