# Run Small and Large Language Models (SLM, LLM) on AWS Graviton CPU Instances with Amazon SageMaker

> This notebook has been tested on the Python 3 kernel of a SageMaker Jupternotebook instance on a ml.m5.xlarge instance with 50GB of disk size


Small language models (SLMs) are can be a compelling choice for applications requiring lower latency, reduced compute requirements, and cost-effectiveness. This workshop will focus on deploying and scaling both SLMs and LLMs on AWS Graviton or x86 CPU-based ML instances. Attendees will gain hands-on experience in optimizing model performance, understanding the trade-offs between model size and computational efficiency, and implementing scalable inference using Amazon SageMaker.


Amazon SageMaker AI provides the ability to build Docker containers to run on SageMaker endpoints, where they listen for health checks on /ping and receive real-time inference requests on /invocations. Using SageMaker AI for inference offers several benefits:

- **Scalability**: SageMaker AI can automatically scale your inference endpoints up and down based on demand, ensuring your models can handle varying workloads.
- **High Availability**: SageMaker AI manages the infrastructure and maintains the availability of your inference endpoints, so you don't have to worry about managing the underlying resources.
- **Monitoring and Logging**: SageMaker AI provides built-in monitoring and logging capabilities, making it easier to track the performance and health of your inference endpoints.
- **Security**: SageMaker AI integrates with other AWS services, such as AWS Identity and Access Management (IAM), to provide robust security controls for your inference workloads.

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port 8080.
- Your container must accept POST requests to the /invocations and /ping real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [**Llama.cpp**](https://github.com/ggml-org/llama.cpp) framework to run on SageMaker AI endpoints. Llama.cpp is an open-source C++ inference engine developed by Georgi Gerganov and community, that enables efficient CPU-based inference for a large set of language model architectures, including Llama, Mistral, Qwen, and Falcon..

By using Llama.cpp and building a custom Docker container, you can run Small and Large Language models like the [Qwen 2.5 3B Instruct
](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF) on a SageMaker AI endpoint using CPU-based Graviton or x86 ML instances.


### Clone LLama.cpp

In [None]:
! git clone https://github.com/ggml-org/llama.cpp.git

### Setup SageMaker

Install the necessary dependencies

In [None]:
! pip install -U sagemaker boto3 "huggingface_hub[cli]"

Import the necessary packages and initialize the variables

In [None]:
import json
import sagemaker
import boto3
from typing import List, Dict
from datetime import datetime

region = boto3.Session().region_name
session = sagemaker.session.Session(boto_session=boto3.Session(region_name=region))
role = sagemaker.get_execution_role()
client = boto3.client('sagemaker-runtime', region_name=region)
sagemaker_client = boto3.client('sagemaker', region_name=region)
bucket = session.default_bucket()

### Prepare the Llama.cpp SageMaker container

SageMaker AI makes extensive use of Docker containers for build and runtime tasks. Using containers, you can train machine learning algorithms and deploy models quickly and reliably at any scale. See [this link](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image) to understand how SageMaker AI runs your inference image. 

- For model inference, SageMaker AI runs the container as:
```
docker run image serve
```

- You can provide your entrypoint script as `exec` form to provide instruction of how to perform the inference process, for example:
```
ENTRYPOINT ["python", "inference.py"]
```

- To receive inference requests, the container must have a web server listening on port `8080` and must accept `POST` requests to the `/invocations` and `/ping` endpoints.


If you already have a docker image, you can see more instructions for [adapting your own inference container for SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html). Also it is important to note that, SageMaker AI provided containers automatically implements a web server for serving requests that responds to `/invocations` and `/ping` (for healthcheck) requests. You can find more about the [prebuilt SageMaker AI docker images for deep learning in our SageMaker doc](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html).




Llama.cpp has provided the based [Dockerfile here](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md). You can directly extend the base image with

In this example, we have copied the whole base Dockerfile and added the below lines to add the AWS CLI for mdeol download from S3 and 
You can add additional layers in the container image to accomodate your specific use case.

In [None]:
%%writefile ./llama.cpp/.devops/sagemaker.Dockerfile
ARG UBUNTU_VERSION=22.04

FROM ubuntu:$UBUNTU_VERSION AS build

ARG TARGETARCH

ARG GGML_CPU_ARM_ARCH=armv8-a

RUN apt-get update && \
    apt-get install -y build-essential git cmake libcurl4-openssl-dev

WORKDIR /app

COPY . .


# Build llama.cpp for x86 or Graviton

RUN ARCH=`uname -m` && \
    echo "Building for architecture: $ARCH" && \
    if [ "$ARCH" = "x86_64" ]; then \
        cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=ON -DGGML_NATIVE=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON; \
    elif [ "$ARCH" = "aarch64" ]; then \
        cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=ON -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=${GGML_CPU_ARM_ARCH}; \
    else \
        echo "Unsupported architecture"; \
        exit 1; \
    fi && \
    cmake --build build -j $(nproc)

RUN mkdir -p /app/lib && \
    find build -name "*.so" -exec cp {} /app/lib \;

RUN mkdir -p /app/full \
    && cp build/bin/* /app/full \
    && cp *.py /app/full \
    && cp -r gguf-py /app/full \
    && cp -r requirements /app/full \
    && cp requirements.txt /app/full \
    && cp .devops/tools.sh /app/full/tools.sh

## Base image
FROM ubuntu:$UBUNTU_VERSION AS base

RUN apt-get update \
    && apt-get install -y libgomp1 unzip curl\
    && apt autoremove -y \
    && apt clean -y \
    && rm -rf /tmp/* /var/tmp/* \
    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
    && find /var/cache -type f -delete

COPY --from=build /app/lib/ /app


### Server, Server only
FROM base AS server

ENV LLAMA_ARG_HOST=0.0.0.0
ENV MODEL_S3_PATH=""

# Install AWS CLI and curl in a single RUN command to reduce layers for x86 or Graviton

RUN ARCH=`uname -m` && \
    echo "Installing AWS CLI for architecture: $ARCH" && \
    if [ "$ARCH" = "x86_64" ]; then \
        curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"; \
    elif [ "$ARCH" = "aarch64" ]; then \
        curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"; \
    else \
        echo "Unsupported architecture"; \
        exit 1; \
    fi

# Clean up unnecessary files after installation
RUN unzip -qq awscliv2.zip && \
    ./aws/install && \
    rm -rf awscliv2.zip aws && \
    mkdir -p /models

COPY --from=build /app/full/llama-server /app

WORKDIR /app

# Expose port for the application to run on, has to be 8080
EXPOSE 8080

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]


# Add serve argument to entrypoint to download model from S3 and start llama-server
ENTRYPOINT ["/bin/bash", "-c", "\
    echo \"Starting entrypoint with arg: $1\"; \
    if [ \"$1\" = \"serve\" ]; then \
        if [ ! -z \"$MODEL_S3_PATH\" ]; then \
            echo \"serve command detected and MODEL_S3_PATH is set to: $MODEL_S3_PATH\"; \
            MODEL_FILE=$(basename $MODEL_S3_PATH); \
            echo \"Downloading model file: $MODEL_FILE\"; \
            aws s3 cp $MODEL_S3_PATH /models/; \
            echo \"Starting llama-server with model: /models/$MODEL_FILE\"; \
            /app/llama-server -m /models/$MODEL_FILE; \
        else \
            echo \"MODEL_S3_PATH not set, starting llama-server without model\"; \
            /app/llama-server; \
        fi \
    else \
        echo \"'serve' command not provided\"; \
    fi", "—"]



Next, we will need to create an ECR repository for the custom docker image and build the image locally and push to the ECR repository. Note that you need to make sure the IAM role you used here has permission to push to ECR. 

The below cell might take sometime, please be patient. If you have already built the docker image from other development environment, please feel free to skip the below cell.

<div class="alert alert-block alert-danger">
If you want to deploy the model on a Graviton CPU-based ML instance such as the c7g instance family, <b>you must build the docker image on an ARM64/AARCH64 CPU</b>, since Graviton is based on ARM64/AARCH64 architecture.

If you only have access to an x86-based CPU, you can still build the image and use it to deploy the model on an x86 CPU ML instance on Amazon SageMaker like c7i instance family.
</div>

You need to adapt the llama.cpp source code to adapt for SageMaker's required paths */ping* and */invocations* on port *8080*

To do that, you must follow the following steps:

1. Go to **./llama.cpp/tools/server/server.cpp**

In [None]:
! cat ./llama.cpp/tools/server/server.cpp | grep 'svr->'

2. Find the API routes and add three new routes to the list

```c++
    svr->Post("/invocations",         handle_chat_completions);
    svr->Get ("/ping",                handle_health);
    svr->Post("/ping",                handle_health);

```

In [None]:
%%sh
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPOSITORY_NAME=llama.cpp-sagemaker

# Create ECR repository if needed
if aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" &>/dev/null; then
    echo "Repository ${REPOSITORY_NAME} already exists"
else
    echo "Creating ECR repository ${REPOSITORY_NAME}..."
    aws ecr create-repository \
        --repository-name "${REPOSITORY_NAME}" \
        --region "${REGION}"
fi

#build docker image and push to ECR repository
cd ./llama.cpp
sudo docker build -t llama.cpp-sagemaker --target server -f .devops/sagemaker.Dockerfile .
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker tag llama.cpp-sagemaker:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

Now, you can download a model from Hugging Face. Llama.cpp expects models to be in GGUF format. You can either convert your favorite model from Safetensors to GGUF, or you can just download a model in GGUF format from Hugging Face.

In this example, we will download the model a GGUF Qwen 2.5 3B model in 8-bit precision.

In [None]:
! huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF qwen2.5-3b-instruct-q8_0.gguf --local-dir ./models

In [None]:
! aws s3 cp ./models/qwen2.5-3b-instruct-q8_0.gguf s3://{bucket}/models/qwen2.5-3b-instruct-q8_0.gguf

### Prepare and deploy the model on SageMaker

Choose an appropriate image URI, model name and endpoint name for hosting your model.

In [None]:
llama_cpp_inference_image_uri = f'{session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama.cpp-sagemaker:latest'
# add datetime to names

model_name = f"qwen-2-5-3b-llama-cpp-{datetime.now().strftime('%m-%d-%Y-%Hh%Mm')}"
endpoint_name = f"{model_name}-ep-{datetime.now().strftime('%m-%d-%Y-%Hh%Mm')}"
s3_model_path = f's3://{bucket}/models/qwen2.5-3b-instruct-q8_0.gguf'

<div class="alert alert-block alert-warning">
In the SageMaker Llama.cpp container image, you can use the environment variables of Llama.cpp's llama-server to choose GGUF model from Hugging Face using <b><i>LLAMA_ARG_HF_FILE</i></b> and <b><i>LLAMA_ARG_HF_REPO</i></b>.
<br>
<br>
    
⚠️ However, for better download performance, we recommend using S3 to store and download the GGUF model for better download performance. Once you upload your model to your S3 bucket of choice, replace <b><i>MODEL_S3_PATH</i></b> environment variable value with the URI of your GGUF model.
</div>

In [None]:
llama_cpp_model = sagemaker.Model(
    image_uri=llama_cpp_inference_image_uri,
    env={
        "LLAMA_ARG_PORT": "8080",
#        "LLAMA_ARG_HF_FILE": "qwen2.5-3b-instruct-q8_0.gguf",
#        "LLAMA_ARG_HF_REPO": "Qwen/Qwen2.5-3B-Instruct-GGUF",
##################
#       OR       #
##################
        "MODEL_S3_PATH": s3_model_path,
    },
    role=role,
    name=model_name,
    sagemaker_session=sagemaker.Session()
)

> Make sure you have enough quota for the ML instance types you want to use. In this section below, we will use memory-optimized ML instances like r7i or r8g which provide better memory bandwidth, which is essential for token generation, with respect to general purpose ML instances.

In [None]:
pretrained_llama_cpp_predictor = llama_cpp_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.r7i.2xlarge",
##################
#       OR       #
##################
#    instance_type="ml.r8g.2xlarge",
    container_startup_health_check_timeout=1200,
    wait=True
)
print(f"Your Llama.cpp Model Endpoint: {endpoint_name} is being deployed! ")

> The endpoint startup time should be around 3 minutes.

In [None]:
prompt="What is artificial intelligence?"


response = client.invoke_endpoint_with_response_stream(
                EndpointName=endpoint_name,
                ContentType="application/json",
                Body=json.dumps({
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": 4000,
                    "temperature": 0.6,
                    "top_p": 0.9,
                    "stream": True,
                    "logprobs": False,
                    "stream_options":{
                        "include_usage": False
                    }
                })
            )

full_response = ""
for event in response['Body']:
    try:
        chunk = event['PayloadPart']['Bytes'].decode()
        if 'finish_reason":"stop"' in chunk:
            break
        chunk = chunk.replace("data: ", "")
        chunk = json.loads(chunk)
        
        if 'choices' in chunk and len(chunk['choices']) > 0:
            content = chunk['choices'][0].get('delta', {}).get('content', '')
            print(content, end='')
            full_response += content
    except:
        pass

### Cleanup

In [None]:
# Delete model and endpoint
try:
    print(f"Deleting model: {model_name} ✅")
    sagemaker_client.delete_model(ModelName=model_name)
except Exception as e:
    print(f"{e}")

try:
    print(f"Deleting endpoint: {endpoint_name} ✅")
    sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
except Exception as e:
    print(f"{e}")

try:
    print(f"Deleting endpoint config: {endpoint_name} ✅")
    sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
except Exception as e:
    print(f"{e}")

print(f"Done")