# Deploy zai-org/GLM-4.5 model

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.

See [technical blog](https://z.ai/blog/glm-4.5) for more details

## Environment Setup

First, we'll upgrade the SageMaker SDK to ensure compatibility with the latest features, particularly those needed for large language model deployment and streaming inference.

> **Note**: The `--quiet` and `--no-warn-conflicts` flags are used to minimize unnecessary output while installing dependencies.

> ⚠️ **Important**: After running the installation cell below, you may need to restart your notebook kernel to ensure the updated packages are properly loaded. To do this:

In [None]:
%pip install --upgrade --quiet --no-warn-conflicts sagemaker sagemaker-core python-dotenv

In [None]:
import json
from dotenv import load_dotenv
import boto3
import sagemaker

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

smr_client = boto3.client("sagemaker-runtime")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")
print(f"sagemaker version: {sagemaker.__version__}")

## Configure Model Container and Instance

For deploying GLM-4.5, we'll use:
- **vLLM v0.10.1**: A container optimized for large language model inference
- **P5 Instance**: AWS's latest GPU instance type optimized for large model inference

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.p5en.48xlarge` instance which offer:
  - 8 NVIDIA H200 GPUs
  - 1128 GB of memory
  - High network bandwidth for optimal inference performance

> **Note**: The region in the container URI should match your AWS region. Replace `us-east-1` with your region if different.

## Deployment of GLM-4.5

Amazon SageMaker AI provides the ability to build Docker containers to run on SageMaker endpoints, where they listen for health checks on `/ping` and receive real-time inference requests on `/invocations`.

Below, we'll demonstrate how to adapt the [vLLM](https://github.com/vllm-project/vllm) framework to run on SageMaker AI endpoints.

#### Container preparation:
1. Enable Docker access in your Studio domain. Please see [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local-get-started.html#studio-updated-local-enable) for details
2. Install Docker in your Studion environment. See this [link](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-local-get-started.html#studio-updated-local-docker-installation) for details
3. Please modify `build.sh` to change the account_id, region, repository name, and tag if required
4. Build the image and push to ECR repository using `build.sh` in docker directory

In [None]:
%%writefile docker/build.sh
export REGION=YOUR_REGION
export ACCOUNT_ID=YOUR_ACCOUNT_ID
export REPOSITORY_NAME=vllm
export TAG=v0.10.1

full_name="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPOSITORY_NAME}:${TAG}"

DOCKER_BUILDKIT=1 docker build . --tag $REPOSITORY_NAME:$TAG --file Dockerfile

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region ${REGION} --repository-names "${REPOSITIRY_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --region ${REGION} --repository-name "${REPOSITORY_NAME}" > /dev/null
fi

docker tag $REPOSITORY_NAME:$TAG ${full_name}
docker push ${full_name}

In [None]:
# !cd docker; ./build.sh; cd ..

Let's deploy full version of the model using vLLM container. Please see [HugginFace Repo](https://huggingface.co/zai-org/GLM-4.5) for more details about the model.

Please note you will need access to `ml.p5en.48xlarge` instance to deploy the full version of the model.

In [None]:
env = {
    "OPTION_MODEL": "zai-org/GLM-4.5",
    "OPTION_MAX_MODEL_LEN": "16384",
    "OPTION_TENSOR_PARALLEL_SIZE": "8"
}
inference_image = f"736221153822.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.1"
model_name = sagemaker.utils.name_from_base("glm4-5")
endpoint_config_name = model_name
endpoint_name = model_name
instance_type = "ml.p5en.48xlarge"
timeout = 3600

We are using sagemaker_core SDK for model deployment. 

This [SDK](https://github.com/aws/sagemaker-core) offers more "Pythonic" syntax compared to boto3

**Key Features:**
- Object-Oriented Interface: Provides a structured way to interact with SageMaker resources, making it easier to manage them using familiar object-oriented programming techniques.
- Resource Chaining: Allows seamless connection of SageMaker resources by passing outputs as inputs between them, simplifying workflows and reducing the complexity of parameter management.
- Full Parity with SageMaker APIs: Ensures access to all SageMaker capabilities through the SDK, providing a comprehensive toolset for building and deploying machine learning models.
- Abstraction of Low-Level Details: Automatically handles resource state transitions and polling logic, freeing developers from managing these intricacies and allowing them to focus on higher-level tasks.
- Auto Code Completion: Enhances the developer experience by offering real-time suggestions and completions in popular IDEs, reducing syntax errors and speeding up the coding process.
- Comprehensive Documentation and Type Hints: Provides detailed guidance and type hints to help developers understand functionalities, write code faster, and reduce errors without complex API navigation.
- Incorporation of Intelligent Defaults: Integrates the previous SageMaker SDK feature of intelligent defaults, allowing developers to set default values for parameters like IAM roles and VPC configurations. This streamlines the setup process, enabling developers to focus on customizations specific to their use case.

In [None]:
from sagemaker_core.shapes import ContainerDefinition, ProductionVariant
from sagemaker_core.resources import Model, EndpointConfig, Endpoint
from sagemaker_core.shapes import ProductionVariantRoutingConfig

In [None]:
model = Model.create(
    model_name=model_name,
    primary_container=ContainerDefinition(image=inference_image, environment=env),
    execution_role_arn=role,
    session=sess.boto_session,
    region=region,
)

In [None]:
endpoint_config = EndpointConfig.create(
        endpoint_config_name=endpoint_config_name,
        production_variants=[
            ProductionVariant(
                variant_name=model_name,
                initial_instance_count=1,
                instance_type=instance_type,
                model_name=model,
                container_startup_health_check_timeout_in_seconds=timeout,
                model_data_download_timeout_in_seconds=timeout,
                routing_config=ProductionVariantRoutingConfig(routing_strategy="LEAST_OUTSTANDING_REQUESTS"),
            )
        ],
    ),

In [None]:
endpoint = Endpoint.create(
    endpoint_name=endpoint_name,
    endpoint_config_name=endpoint_config_name
)
endpoint.wait_for_status("InService")

## Inference examples

### Invocation

This is hybrid (reasoning/non-reasonin) model. 

You can enable or disable "thinking" mode by specifying `{"chat_template_kwargs": {"enable_thinking": False}}`

In [None]:
payload = json.dumps(
    {
        "messages": [
            {"role": "user", "content": "How many times does the letter b appear in blueberry?"}
        ],
        "chat_template_kwargs": {
            "enable_thinking": False
        }
    }
)
res = smr_client.invoke_endpoint(EndpointName=endpoint_name,
                                 Body=payload,
                                 ContentType="application/json")

response = json.loads(res["Body"].read().decode("utf8"))
content = response["choices"][0]["message"]["content"]
usage = response["usage"]

print("###############\n" + content + "\n###############\n")
print(usage)

###############
The letter "b" appears **two** times in "blueberry".
###############

{'prompt_tokens': 21, 'total_tokens': 37, 'completion_tokens': 16, 'prompt_tokens_details': None}


In [None]:
payload = json.dumps(
    {
        "messages": [
            {"role": "user", "content": "How many times does the letter b appear in blueberry?"}
        ],
        "chat_template_kwargs": {
            "enable_thinking": True
        }
    }
)
res = smr_client.invoke_endpoint(EndpointName=endpoint_name,
                                 Body=payload,
                                 ContentType="application/json")

response = json.loads(res["Body"].read().decode("utf8"))
content = response["choices"][0]["message"]["content"]
usage = response["usage"]

print("###############\n" + content + "\n###############\n")
print(usage)

###############

<think>First, the question is: "How many times does the letter 'b' appear in 'blueberry'?"

I need to count the occurrences of the letter 'b' in the word "blueberry".

Let me write down the word: B-L-U-E-B-E-R-R-Y.

Now, I'll go through each letter and check if it's a 'b'.

- Position 1: B – yes, that's a 'b'.

- Position 2: L – not 'b'.

- Position 3: U – not 'b'.

- Position 4: E – not 'b'.

- Position 5: B – yes, that's another 'b'.

- Position 6: E – not 'b'.

- Position 7: R – not 'b'.

- Position 8: R – not 'b'.

- Position 9: Y – not 'b'.

So, I found 'b' at positions 1 and 5. That's two times.

I should confirm the spelling. Is "blueberry" spelled correctly? Yes, it's B-L-U-E-B-E-R-R-Y. No 'b's after that.

The question is about the letter 'b', and it's case-sensitive? The word "blueberry" is typically written in lowercase, but in this context, it's probably not case-sensitive, or we should consider it as given. The question says "the letter b", and in "blueber

In [None]:
#
# By default for chat completion the 'thinking' mode is enabled.
#
payload = json.dumps(
    {
        "messages": [
            {"role": "user", "content": "What is bigger 9.8 or 9.11?"}
        ],
    }
)
res = smr_client.invoke_endpoint(EndpointName=endpoint_name,
                                 Body=payload,
                                 ContentType="application/json")

response = json.loads(res["Body"].read().decode("utf8"))
content = response["choices"][0]["message"]["content"]
usage = response["usage"]

print("###############\n" + content + "\n###############\n")
print(usage)

### Streaming invocation

In [None]:
import io
import json
import time
import boto3
from IPython.display import clear_output

class LineIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

def stream_response(endpoint_name, inputs, max_tokens=8189, temperature=0.7, top_p=0.9):
    body = {
      "messages": [
        {"role": "user", "content": [{"type": "text", "text": inputs}]}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "stream": True,
    }

    resp = smr_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(body),
        ContentType="application/json",
    )

    event_stream = resp["Body"]
    start_json = b"{"
    full_response = ""
    start_time = time.time()
    token_count = 0

    for line in LineIterator(event_stream):
        if line != b"" and start_json in line:
            data = json.loads(line[line.find(start_json):].decode("utf-8"))
            token_text = data['choices'][0]['delta'].get('content', '')
            full_response += token_text
            token_count += 1

            # Calculate tokens per second
            elapsed_time = time.time() - start_time
            tps = token_count / elapsed_time if elapsed_time > 0 else 0

            # Clear the output and reprint everything
            clear_output(wait=True)
            print(full_response)
            print(f"\nTokens per Second: {tps:.2f}", end="")

    print("\n") # Add a newline after response is complete

    return full_response

In [None]:
inputs = "What is greater 9.11 or 9.8?"
output = stream_response(endpoint_name, inputs, max_tokens=8000)

In [None]:
inputs = "How many times does the letter b appear in blueberry?"
output = stream_response(endpoint_name, inputs, max_tokens=8000)

## Cleanup

In [37]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)