## Installation and Configuration Guide for GGUF Model Deployment to SageMaker Real-Time Inference

In this tutorial, you will learn how to integrate GGUF models with the llama.cpp server using the BYOC approach, and deploy them to SageMaker Real-Time Inference.

### 1. Preparation  

#### 1.1 Docker Validation
Ensure the Notebook supports Docker commands for container building.

In [None]:
!docker version

#### 1.2 Initialize Environment Variables

In [None]:
import subprocess
S3_BUCKET_NAME = "llama-model-file" # Please change the Bucketname when you running this cell
ECR_REPOSITORY_NAME = "llama-gguf-ecr"
AWS_ACCOUNT_ID = subprocess.check_output("aws sts get-caller-identity --query Account --output text", shell=True).decode().strip()
AWS_REGION = subprocess.check_output("aws configure get region", shell=True).decode().strip()
ECR_URI=f"{AWS_ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com"
ECR_REPOSITORY_URI=f"{ECR_URI}/{ECR_REPOSITORY_NAME}"
IMAGE_TAG="latest"
MODEL_NAME="Meta-Llama-3-8B.Q4_K_M.gguf"

##### Print and Validate Environment Variables

In [None]:
!echo $S3_BUCKET_NAME
!echo $AWS_ACCOUNT_ID
!echo $AWS_REGION
!echo $ECR_URI
!echo $ECR_REPOSITORY_URI
!echo $IMAGE_TAG
!echo $MODEL_NAME

#### 1.3 Confirm permissions, need to add corresponding Push ECR permissions to IAM role

In [None]:
import sagemaker
import boto3
import json

# Get SageMaker session
sagemaker_session = sagemaker.Session()

# Get the role ARN attached to the Notebook instance
role_arn = sagemaker_session.get_caller_identity_arn()
print(f"SageMaker Notebook Role ARN: {role_arn}")
role_name = role_arn.split('/')[-1]
print(f"\nOpen this link to check ExecutionPolicy: https://console.aws.amazon.com/iam/home?#/roles/details/{role_name}?section=permissions")

##### Check ExecutionPolicy in IAM role, whether it includes the following ECR-related permissions. If not, please add them.

```json
{
    "Effect": "Allow",
    "Action": [
        "ecr:CompleteLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:InitiateLayerUpload",
        "ecr:BatchCheckLayerAvailability",
        "ecr:PutImage",
        "ecr:BatchGetImage"
    ],
    "Resource": "arn:aws:ecr:*:*:*"
}
```

#### 1.4 Download GGUF model file from HuggingFace

In [None]:
%cd /home/ec2-user/SageMaker
!mkdir /home/ec2-user/SageMaker/workspace
!wget -O /home/ec2-user/SageMaker/workspace/$MODEL_NAME https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/resolve/main/Meta-Llama-3-8B.Q4_K_M.gguf

#### 1.5 Create S3 Bucket for Storing Models

In [None]:
!aws s3 mb s3://$S3_BUCKET_NAME

#### 1.6 Upload Model file to S3

In [None]:
!aws s3 cp ./workspace/Meta-Llama-3-8B.Q4_K_M.gguf s3://$S3_BUCKET_NAME

#### 1.7 Create ECR Repository for Storing BYOC Model

In [None]:
!aws ecr create-repository --repository-name $ECR_REPOSITORY_NAME

### 2. Build BYOC Code

Build the code in the Docker image according to the following file structure. The structure is as follows:
```structured text
workspace
|-- Dockerfile
|-- main.py
|-- requirements.txt
|-- serve
|-- server.sh
```

- **Dockerfile**: Describes how to build the container, which will be based on the llama.cpp container.
- **main.py**: A WSGI HTTP server based on the Flask framework, used to interact with llama.cpp and meet the hosting requirements of SageMaker inference nodes. It accepts POST requests to the `/invocations` and `/ping` endpoints.
- **requirements.txt**: Python-related dependencies.
- **serve**: The entry file for SageMaker inference nodes. It uses port 8080 to start the WSGI server of main.py and related processes, and interacts with the llama.cpp service.
- **server.sh**: Uses port 8181 to start the llama.cpp runtime server.

*References:*
- *https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html*
- *https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.html*

In [None]:
#Switch the working directory to workspace
%cd /home/ec2-user/SageMaker/workspace

#### 2.1 Create Dockerfile
The Dockerfile in the example uses a CUDA-based llama.cpp image to support inference on GPU. You can also use the llama.cpp:full image for CPU-based inference.

In [None]:
%%writefile Dockerfile
FROM ghcr.io/ggerganov/llama.cpp:full-cuda

# Sets dumping log messages directly to stream instead of buffering
ENV PYTHONUNBUFFERED=1
# Set MODELPATH environment variable
ENV MODELPATH=/app/llm_model.bin

ENV PATH=$PATH:/app

ENV BUCKET=""
ENV BUCKET_KEY=""
ENV GPU_LAYERS=999

# The working directory in the Docker image
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    unzip \
    libcurl4-openssl-dev \
    python3 \
    python3-pip \
    python3-dev \
    git \
    psmisc \
    pciutils 

# Copy requirements.txt and install Python dependencies
COPY requirements.txt ./requirements.txt
#main application file
COPY main.py /app/
#sagemaker endpoints expects serve file to run the application
COPY serve /app/
COPY server.sh /app/

RUN chmod u+x serve
RUN chmod u+x server.sh

RUN pip3 install -r requirements.txt
RUN export PATH=/app:$PATH

ENTRYPOINT ["/bin/bash"]

# Expose port for the application to run on, has to be 8080
EXPOSE 8080

#### 2.2 Create the main.py file

This code is modified based on https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/api_like_OAI.py

This is a simple HTTP API server and a basic web frontend for interacting with llama.cpp. To enable GGUF model inference deployment in the Amazon SageMaker environment, this article has made targeted modifications to the original GitHub code. These modifications mainly include the following aspects:

Port Configuration

- The original `api_like_OAI.py` used default port 8081, connecting to llama.cpp on port 8080
- To accommodate SageMaker inference listening ports, main.py's default port was changed to 8080, connecting to the llama.cpp API port on 8081

Routing Differences

- main.py modified the route from `/chat/completions` to `/invocations` to meet SageMaker inference requirements
- A new health check route `/ping` was added in main.py

Model Loading

- main.py added an update_model function, integrating AWS S3 functionality to support automatic model file download when the container starts
- Supports dynamic replacement of model files during container runtime, improving flexibility and efficiency of model management
- After the model download is complete, a subprocess is launched on port 8181 to run the llama.cpp service


In [None]:
%%writefile main.py
#!/usr/bin/env python3
import argparse
from asgiref.wsgi import WsgiToAsgi
from flask import Flask, jsonify, request, Response
import urllib.parse
import requests
import time
import json
import boto3
import os
import subprocess
import traceback


app = Flask(__name__)
slot_id = -1

parser = argparse.ArgumentParser(description="An example of using server.cpp with a similar API to OAI. It must be used together with server.cpp.")
parser.add_argument("--chat-prompt", type=str, help="the top prompt in chat completions(default: 'A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n')", default='A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.\\n')
parser.add_argument("--user-name", type=str, help="USER name in chat completions(default: '\\nUSER: ')", default="\\nUSER: ")
parser.add_argument("--ai-name", type=str, help="ASSISTANT name in chat completions(default: '\\nASSISTANT: ')", default="\\nASSISTANT: ")
parser.add_argument("--system-name", type=str, help="SYSTEM name in chat completions(default: '\\nASSISTANT's RULE: ')", default="\\nASSISTANT's RULE: ")
parser.add_argument("--stop", type=str, help="the end of response in chat completions(default: '</s>')", default="</s>")
parser.add_argument("--llama-api", type=str, help="Set the address of server.cpp in llama.cpp(default: http://127.0.0.1:8081)", default='http://127.0.0.1:8081')
parser.add_argument("--api-key", type=str, help="Set the api key to allow only few user(default: NULL)", default="")
parser.add_argument("--host", type=str, help="Set the ip address to listen.(default: 127.0.0.1)", default='127.0.0.1')
parser.add_argument("--port", type=int, help="Set the port to listen.(default: 8080)", default=8080)

args, unknown = parser.parse_known_args()

def is_present(json, key):
    try:
        buf = json[key]
    except KeyError:
        return False
    if json[key] == None:
        return False
    return True

#convert chat to prompt
def convert_chat(messages):
    prompt = "" + args.chat_prompt.replace("\\n", "\n")

    system_n = args.system_name.replace("\\n", "\n")
    user_n = args.user_name.replace("\\n", "\n")
    ai_n = args.ai_name.replace("\\n", "\n")
    stop = args.stop.replace("\\n", "\n")


    for line in messages:
        if (line["role"] == "system"):
            prompt += f"{system_n}{line['content']}"
        if (line["role"] == "user"):
            prompt += f"{user_n}{line['content']}"
        if (line["role"] == "assistant"):
            prompt += f"{ai_n}{line['content']}{stop}"
    prompt += ai_n.rstrip()

    return prompt

def make_postData(body, chat=False, stream=False):
    postData = {}
    if (chat):
        postData["prompt"] = convert_chat(body["messages"])
    else:
        postData["prompt"] = body["prompt"]
    if(is_present(body, "temperature")): postData["temperature"] = body["temperature"]
    if(is_present(body, "top_k")): postData["top_k"] = body["top_k"]
    if(is_present(body, "top_p")): postData["top_p"] = body["top_p"]
    if(is_present(body, "max_tokens")): postData["n_predict"] = body["max_tokens"]
    if(is_present(body, "presence_penalty")): postData["presence_penalty"] = body["presence_penalty"]
    if(is_present(body, "frequency_penalty")): postData["frequency_penalty"] = body["frequency_penalty"]
    if(is_present(body, "repeat_penalty")): postData["repeat_penalty"] = body["repeat_penalty"]
    if(is_present(body, "mirostat")): postData["mirostat"] = body["mirostat"]
    if(is_present(body, "mirostat_tau")): postData["mirostat_tau"] = body["mirostat_tau"]
    if(is_present(body, "mirostat_eta")): postData["mirostat_eta"] = body["mirostat_eta"]
    if(is_present(body, "seed")): postData["seed"] = body["seed"]
    if(is_present(body, "logit_bias")): postData["logit_bias"] = [[int(token), body["logit_bias"][token]] for token in body["logit_bias"].keys()]
    if (args.stop != ""):
        postData["stop"] = [args.stop]
    else:
        postData["stop"] = []
    if(is_present(body, "stop")): postData["stop"] += body["stop"]
    postData["n_keep"] = -1
    postData["stream"] = stream
    postData["cache_prompt"] = True
    postData["slot_id"] = slot_id
    return postData

def make_resData(data, chat=False, promptToken=[]):
    resData = {
        "id": "chatcmpl" if (chat) else "cmpl",
        "object": "chat.completion" if (chat) else "text_completion",
        "created": int(time.time()),
        "truncated": data["truncated"],
        "model": "LLaMA_CPP",
        "usage": {
            "prompt_tokens": data["tokens_evaluated"],
            "completion_tokens": data["tokens_predicted"],
            "total_tokens": data["tokens_evaluated"] + data["tokens_predicted"]
        }
    }
    if (len(promptToken) != 0):
        resData["promptToken"] = promptToken
    if (chat):
        #only one choice is supported
        resData["choices"] = [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": data["content"],
            },
            "finish_reason": "stop" if (data.get("stopped_eos", False) or data.get("stopped_word", False)) else "length"
        }]
    else:
        #only one choice is supported
        resData["choices"] = [{
            "text": data["content"],
            "index": 0,
            "logprobs": None,
            "finish_reason": "stop" if (data.get("stopped_eos", False) or data.get("stopped_word", False)) else "length"
        }]
    return resData

def make_resData_stream(data, chat=False, time_now = 0, start=False):
    resData = {
        "id": "chatcmpl" if (chat) else "cmpl",
        "object": "chat.completion.chunk" if (chat) else "text_completion.chunk",
        "created": time_now,
        "model": "LLaMA_CPP",
        "choices": [
            {
                "finish_reason": None,
                "index": 0
            }
        ]
    }
    if (chat):
        if (start):
            resData["choices"][0]["delta"] =  {
                "role": "assistant"
            }
        else:
            resData["choices"][0]["delta"] =  {
                "content": data["content"]
            }
            if (data["stop"]):
                resData["choices"][0]["finish_reason"] = "stop" if (data.get("stopped_eos", False) or data.get("stopped_word", False)) else "length"
    else:
        resData["choices"][0]["text"] = data["content"]
        if (data["stop"]):
            resData["choices"][0]["finish_reason"] = "stop" if (data.get("stopped_eos", False) or data.get("stopped_word", False)) else "length"

    return resData

def update_model(bucket, key):
    try:
        s3 = boto3.client('s3')
        s3.download_file(bucket, key, os.environ.get('MODELPATH'))
        subprocess.run(["/app/server.sh", os.environ.get('MODELPATH')])
        return True
    except Exception as e:
        print(e)
        print(str(traceback.format_exc()))
        return False

@app.route('/ping', methods=['GET'])
def ping():
    return Response(status=200)

@app.route("/invocations", methods=['POST'])
def completion():
    if (args.api_key != "" and request.headers["Authorization"].split()[1] != args.api_key):
        return Response(status=403)
    body = request.get_json()
    stream = False
    tokenize = False
    if (is_present(body, "configure")): 
        res = update_model(body["configure"]["bucket"], body["configure"]["key"])
        return Response(status=200) if (res) else Response(status=500)
    if(is_present(body, "stream")): stream = body["stream"]
    if(is_present(body, "tokenize")): tokenize = body["tokenize"]
    postData = make_postData(body, chat=False, stream=stream)

    promptToken = []
    if (tokenize):
        tokenData = requests.request("POST", urllib.parse.urljoin(args.llama_api, "/tokenize"), data=json.dumps({"content": postData["prompt"]})).json()
        promptToken = tokenData["tokens"]

    if (not stream):
        data = requests.request("POST", urllib.parse.urljoin(args.llama_api, "/completion"), data=json.dumps(postData))
        print(data.json())
        resData = make_resData(data.json(), chat=False, promptToken=promptToken)
        return jsonify(resData)
    else:
        def generate():
            data = requests.request("POST", urllib.parse.urljoin(args.llama_api, "/completion"), data=json.dumps(postData), stream=True)
            time_now = int(time.time())
            for line in data.iter_lines():
                if line:
                    decoded_line = line.decode('utf-8')
                    resData = make_resData_stream(json.loads(decoded_line[6:]), chat=False, time_now=time_now)
                    yield 'data: {}\n'.format(json.dumps(resData))
        return Response(generate(), mimetype='text/event-stream')

update_model(os.environ.get('BUCKET'), os.environ.get('BUCKET_KEY'))

asgi_app = WsgiToAsgi(app)

#if __name__ == '__main__':
#    app.run(args.host, port=args.port)


#### 2.3 Create requirements.txt

In [None]:
%%writefile requirements.txt
flask
asgiref
boto3
starlette
uvicorn
requests

#### 2.4 Create the entry file 'serve' for SageMaker inference node

In [None]:
%%writefile serve
#!/bin/sh
echo "serve"
uvicorn 'main:asgi_app' --host 0.0.0.0 --port 8080 --workers 8

#### 2.5 Create the startup script for llama.cpp service

In [None]:
%%writefile server.sh
#!/bin/sh
echo "server.sh"
echo "args: $1"
echo "GPU Layer: $GPU_LAYERS"

# Check if NVIDIA GPU is available
if lspci | grep -i nvidia &> /dev/null; then
  echo "NVIDIA GPU is available."
  NGL="$GPU_LAYERS"
  CPU_PER_SLOT=1
else
  echo "No NVIDIA GPU found."
  NGL=0
  CPU_PER_SLOT=4
fi

killall llama-server
/app/llama-server -m "$1" -c 2048 -t $(nproc --all) --host 0.0.0.0 --port 8081 -cb -np $(($(nproc --all) / $CPU_PER_SLOT)) -ngl $NGL &

#### 2.6 Package Docker Image and push to ECR

In [None]:
!aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_URI

In [None]:
!docker build -t "$ECR_REPOSITORY_URI":"$IMAGE_TAG" .

In [None]:
!docker push "$ECR_REPOSITORY_URI":"$IMAGE_TAG"

### 3. Deploy to SageMaker inference endpoint

In [None]:
import sagemaker
import os
import boto3
import json

# Setup role and sagemaker session
iam_role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.session.Session()
region = sagemaker_session._region_name
sagemaker_runtime = boto3.client('sagemaker-runtime')

In [None]:
container_uri = f"{ECR_REPOSITORY_URI}:{IMAGE_TAG}"
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("llama-cpp-gguf-byoc")
print(container_uri)
print(endpoint_name)

In [None]:
model = sagemaker.Model(
    image_uri=container_uri,
    role=iam_role,
    name=endpoint_name,
    env={
        "MODELPATH": f"/app/{MODEL_NAME}",
        "BUCKET": S3_BUCKET_NAME,
        "BUCKET_KEY": MODEL_NAME,
        "GPU_LAYERS": "32",
    }
)

In [None]:
# Deploy your model to a SageMaker Endpoint and create a Predictor to make inference requests
# Estimated Deploy time: 10min"
from datetime import datetime

print(datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
model.deploy(
    instance_type=instance_type,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
)
print(datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

You can pass the S3 bucket name and object name to the SageMaker inference instance, allowing you to replace model files at runtime.

---
```python
payload = {
    "configure": {
        "bucket": S3_BUCKET_NAME,
        "key": MODEL_NAME
    }
}
response = sagemaker_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType='application/json',
            Body=json.dumps(payload)
        )
print(f"response: {response}")
```
---

In [None]:
# Define the Invoke function, the first trigger of the model requires some time to load

def invoke_sagemaker_endpoint(endpoint_name, llama_args):
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps(llama_args),
        ContentType='application/json',
    )
    response_body = json.loads(response['Body'].read().decode())
    return response_body


In [None]:
"""
Non-streaming inference example. 
"""
llama_args = {
    "prompt": "What are the most popular tourist attractions in Beijing?",
    "max_tokens": 512,
    "temperature": 3,
    "repeat_penalty":10,
    "frequency_penalty":1.1,
    "top_p": 1
}
inference = invoke_sagemaker_endpoint(endpoint_name, llama_args)
print(inference['choices'][0]['text'])

In [None]:
# Define Streaming processing function

def invoke_sagemaker_streaming_endpoint(endpoint_name, llama_args):
    response = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(llama_args),
        ContentType='application/json',
    )    
    event_stream = response['Body']
    print(event_stream)
    for line in event_stream:
        itm = line['PayloadPart']['Bytes'][6:]
        try:
            res = json.loads(itm, strict=False )
            print(res["choices"][0]["text"], end='')
        except:
            #non-valid json, e.g. empty token 
            pass

In [None]:
"""
Streaming inference example
to enable streaming mode, set stream=True
"""

llama_args = {
    "prompt": "What are the most popular tourist attractions in Beijing?",
    "max_tokens": 512,
    "temperature": 3,
    "repeat_penalty":10,
    "frequency_penalty":1.1,
    "top_p": 1,
    "stream": True
}

invoke_sagemaker_streaming_endpoint(endpoint_name, llama_args)

#### Delete Model, Endpoint, Endpoint config
*If you need to delete previously created model and endpoint, please execute the following script*

In [None]:
sagemaker_client = boto3.client('sagemaker')
response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
config_name = response['EndpointConfigName']

# Delete the model
try:
    model.delete_model()
    print(f"Deleted model: {model.name}")
except Exception as e:
    print(f"Error deleting model: {e}")

# Delete the endpoint
try:
    sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
    print(f"Deleted endpoint: {endpoint_name}")
except Exception as e:
    print(f"Error deleting endpoint: {e}")

# If you also need to delete the endpoint configuration
try:
    sagemaker_client.delete_endpoint_config(EndpointConfigName=config_name)
    print(f"Deleted endpoint configuration: {config_name}")
except Exception as e:
    print(f"Error deleting endpoint configuration: {e}")