# Qwen3-Next-80B-A3B-Instruct on SageMaker with SGLang

This notebook deploys **[Qwen/Qwen3-Next-80B-A3B-Instruct]** on **Amazon SageMaker** using **SGLang**. It:
- Downloads model weights from Hugging Face → uploads to your **S3** bucket
- Builds a **SageMaker-compatible SGLang** container (extends upstream `Dockerfile.sagemaker`)
- Creates a **real-time endpoint** on a multi-GPU instance (e.g., `ml.p5.48xlarge`)
- Exposes an **OpenAI-compatible** API for chat completions

> ⚠️ **Heads-up**: Qwen3-Next-80B-A3B is *very* large. Make sure your account has quota for H100 `p5.48xlarge` instances and sufficient EBS/S3 storage and network bandwidth. You’ll also need a Hugging Face token with access to the model.

## References
- Upstream SGLang Dockerfile for SageMaker: `docker/Dockerfile.sagemaker` in the SGLang repo
- Example structure based on the **DeepSeek SGLang** SageMaker notebook from the `aws-samples/sagemaker-genai-hosting-examples` repo
- Qwen3-Next-80B-A3B-Instruct model card on Hugging Face

### About **Qwen3-Next-80B-A3B-Instruct**

**Qwen3-Next-80B-A3B-Instruct** is a very large, instruction-tuned multilingual LLM from the Qwen family, designed for high-quality reasoning, tool-use/code assistance, and general chat. As an “Instruct” variant, it’s optimized to follow natural-language directions with guardrails for safer, more helpful outputs.

**Highlights**
- **Scale & quality:** 80B parameters for strong reasoning and generation quality across broad tasks (analysis, coding, data wrangling, multi-turn chat).
- **Instruction tuning (“A3B”):** Aligned for cooperative behavior and concise, on-topic responses.
- **Long-context capable:** Supports long prompts/outputs (check the model card & your hardware budget for practical limits).
- **Multilingual coverage:** Performs across multiple languages; English-first workflows typically see top quality.
- **Enterprise considerations:** Requires multi-GPU nodes (e.g., H100/A100) and fast storage/network for weight loading and high-throughput serving.

**Common uses**
- Analyst-style reasoning and summarization
- Code explanation/refactoring and structured extraction
- RAG/chat over private corpora (pair with a vector store + tools)
- Function/tool calling in agentic workflows


### Why **SGLang on SageMaker** for Qwen3-Next-80B

**SGLang** is a high-throughput LLM serving stack that exposes an **OpenAI-compatible API** and focuses on efficient multi-request scheduling on GPUs. Running it on **Amazon SageMaker** lets you scale and operate the service with managed infrastructure.

**Serving benefits with SGLang**
- **OpenAI-style API**: Easy drop-in for chat/completions tooling.
- **High throughput**: Continuous batching, Radix Attention, & smart scheduling to keep GPUs busy.
- **Distributed inference**: Tensor/Data/Expert parallelism flags (e.g., `--tp`, `--dp`, `--ep`) to shard 80B across multiple GPUs.
- **Latency controls**: Tunables like `--chunked-prefill-size`, max batch tokens, and KV-cache dtype to trade latency vs. throughput.
- **Simple entrypoint**: The repo’s `docker/serve` script launches `sglang.launch_server` on **port 8080** (SageMaker’s expected port).


## 0) Prerequisites
- AWS credentials with permissions for SageMaker, ECR, S3, CloudWatch
- A SageMaker Notebook/Studio instance with **Docker** available (or run the Docker build steps elsewhere and just push to ECR) and configure EBS volume to be big enough to download the model
- HF token with model access: https://huggingface.co/settings/tokens
- (Optional) VPC/Subnets/Security Groups if deploying privately

In [None]:
# If needed, install packages in the notebook kernel
%pip install --quiet --upgrade boto3 sagemaker botocore huggingface_hub "pyyaml<7"

In [None]:
import os
import json
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker import get_execution_role
from huggingface_hub import login
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.session import Session

# -----------------------
# 1) Configuration
# -----------------------
AWS_REGION       = os.environ.get("AWS_REGION", "region") #update with your region
ACCOUNT_ID       = boto3.client("sts").get_caller_identity()["Account"]
S3_BUCKET        = os.environ.get("S3_BUCKET", f"sagemaker-{AWS_REGION}-{ACCOUNT_ID}")
S3_PREFIX        = os.environ.get("S3_PREFIX", "models/qwen3-next-80b-a3b-instruct")
MODEL_S3_PREFIX = f"s3://{S3_BUCKET}/{S3_PREFIX}/"
HF_TOKEN         = os.environ.get("HF_TOKEN", "")  # or set manually
HF_REPO_ID       = "Qwen/Qwen3-Next-80B-A3B-Instruct"

# Choose a big multi-GPU instance
INSTANCE_TYPE    = os.environ.get("INSTANCE_TYPE", "ml.p5.48xlarge")  # 8x H100 80GB
INSTANCE_COUNT   = int(os.environ.get("INSTANCE_COUNT", "1"))

# Parallelism and server tuning (tweak for your hardware & budget)
TP_SIZE          = int(os.environ.get("TP_SIZE", "8"))   # tensor parallel across 8 GPUs on p5.48xlarge
DP_SIZE          = int(os.environ.get("DP_SIZE", "1"))   # data parallel replicas per instance

# ECR image naming
ECR_REPO_NAME    = os.environ.get("ECR_REPO_NAME", "sglang-sagemaker-qwen80b")
ECR_TAG          = os.environ.get("ECR_TAG", "v1")
ECR_URI          = f"{ACCOUNT_ID}.dkr.ecr.{AWS_REGION}.amazonaws.com/{ECR_REPO_NAME}:{ECR_TAG}"

# Try to resolve the SageMaker role (falls back to env var if running outside Studio)
try:
    ROLE = get_execution_role()
except Exception:
    ROLE = os.environ.get("SAGEMAKER_EXECUTION_ROLE", "arn:aws:iam::<YOUR-ACCOUNT-ID>:role/<YOUR-SM-ROLE>")

sm_sess = Session()
boto3.setup_default_session(region_name=AWS_REGION)
s3 = boto3.client("s3", region_name=AWS_REGION)

print("Region: ", AWS_REGION)
print("Account:", ACCOUNT_ID)
print("Role:   ", ROLE)
print("Bucket: ", S3_BUCKET)
print("HF Repo:", HF_REPO_ID)
print("Model Prefix: ", MODEL_S3_PREFIX)
print("HF Repo:", S3_PREFIX)

In [None]:
# 2) Ensure S3 bucket exists (no-op if already created)
def ensure_bucket(bucket: str, region: str):
    s3 = boto3.client("s3", region_name=region)
    try:
        s3.head_bucket(Bucket=bucket)
        print(f"Bucket {bucket} exists")
    except Exception:
        params = {"Bucket": bucket}
        if region != "region": #update with your region
            params["CreateBucketConfiguration"] = {"LocationConstraint": region}
        s3.create_bucket(**params)
        print(f"Created bucket {bucket}")

ensure_bucket(S3_BUCKET, AWS_REGION)

In [None]:
# 3) Login to Hugging Face (required to download gated weights if applicable)
if HF_TOKEN:
    login(token=HF_TOKEN)
    print("HF_TOKEN is set.")
else:
    print("HF_TOKEN is empty. If the model requires auth, set HF_TOKEN first.")

## 4) Download the model from Hugging Face and upload the model artifacts on Amazon S3

In this example, we will demonstrate how to download your copy of the model from huggingface and upload it to an s3 location in your AWS account, then deploy the model with the downloaded model artifacts to an endpoint.

Best Practices:

Store Models in Your Own S3 Bucket

For production use-cases, always download and store model files in your own S3 bucket to ensure validated artifacts. This provides verified provenance, improved access control, consistent availability, protection against upstream changes, and compliance with organizational security protocols.

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = HF_REPO_ID
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.txt"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

### Upload model files to S3
SageMaker AI allows us to provide uncompressed files. Thus, we directly upload the folder that contains model files to s3

Note: The default SageMaker bucket follows the naming pattern: sagemaker-{region}-{account-id}

In [None]:
model_artifact = sm_sess.upload_data(path=model_download_path, key_prefix=S3_PREFIX)
print(f"Model uploaded to --- > {model_artifact}")

## 5) Build and Push a SageMaker-compatible SGLang image

We build the **official SGLang SageMaker Dockerfile** 

1. **Downloads** the model from your S3 location to the container’s local path
2. **Launches** `sglang` with your configured TP/DP/context length on port **8080** for SageMaker

> If Docker is not available in this environment, run the build steps on your workstation or in CodeBuild and skip to the ECR push and deploy cells.

In [None]:
%%bash
set -euxo pipefail
REPO_DIR=sglang
if [ ! -d "$REPO_DIR" ]; then
  git clone --depth 1 https://github.com/sgl-project/sglang.git "$REPO_DIR"
fi
cd "$REPO_DIR"
test -f docker/serve
docker build -f docker/Dockerfile.sagemaker -t sglang-sagemaker-base:latest docker


In [None]:
%%bash
set -euo pipefail

AWS_REGION="${AWS_REGION:-$(aws configure get region || echo us-west-2)}"
ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
ECR_REPO_NAME="${ECR_REPO_NAME:-sglang-sagemaker-qwen80b}"
ECR_TAG="${ECR_TAG:-v1}"

IMAGE_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO_NAME}:${ECR_TAG}"
echo "Pushing image: ${IMAGE_URI}"

# Ensure repo exists
aws ecr describe-repositories --repository-names "${ECR_REPO_NAME}" --region "${AWS_REGION}" >/dev/null 2>&1 || \
  aws ecr create-repository --repository-name "${ECR_REPO_NAME}" --region "${AWS_REGION}" >/dev/null

# Login, tag, push
aws ecr get-login-password --region "${AWS_REGION}" | \
  docker login --username AWS --password-stdin "${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"

docker tag sglang-sagemaker-base:latest "${IMAGE_URI}"
docker push "${IMAGE_URI}"

echo "Pushed ${IMAGE_URI}"


## 6) Create SageMaker Model, Endpoint Config, and Endpoint

We pass environment variables to the container so the entrypoint can **sync S3 → local** and **launch SGLang** automatically.

In [None]:
model = Model(
    # Use ModelDataSource with S3Prefix (no tarball) so SageMaker lays files under /opt/ml/model
    model_data={
        "S3DataSource": {
            "S3Uri": MODEL_S3_PREFIX,
            "S3DataType": "S3Prefix",
            "CompressionType": "None",
        }
    },
    role=ROLE,
    image_uri=ECR_URI,
    env={
        "TENSOR_PARALLEL_DEGREE": "8",
        #"PORT": "8080",
    },
    predictor_cls=Predictor,
    #sagemaker_session=session,
)

In [None]:
# --- deploy ---
INSTANCE_TYPE   = "ml.p5.48xlarge"                     # 8x H100 (change as needed)
ENDPOINT_NAME   = "qwen3-next-80b-sglang" 

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=INSTANCE_TYPE,
    endpoint_name=ENDPOINT_NAME,
)

## 7) Test the endpoint (OpenAI-compatible)

SGLang exposes an OpenAI-compatible API. SageMaker forwards HTTP to the container on port 8080 at the `/invocations` path.

The code below sends a basic chat request.

In [None]:
import json, boto3, os

runtime = boto3.client("sagemaker-runtime", region_name=AWS_REGION)
payload = {
    "model": "Qwen3-Next-80B-A3B-Instruct",
    "messages": [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Howdy, what is sglang?."}
    ],
    # Add any SGLang-specific params if needed, e.g. temperature, max_tokens
    "temperature": 0.2,
    "max_tokens": 128
}
resp = runtime.invoke_endpoint(
    EndpointName=ENDPOINT_NAME,
    ContentType="application/json",
    Body=json.dumps(payload).encode("utf-8"),
)
print("StatusCode:", resp["ResponseMetadata"]["HTTPStatusCode"])
print("Body:", resp["Body"].read().decode("utf-8")[:2000])

## 8) Cleanup

In [None]:
# Stop and remove the endpoint + config + model
# (Be careful! This will delete the endpoint.)
import boto3

sm = boto3.client("sagemaker", region_name=AWS_REGION)

print("Deleting endpoint...")
sm.delete_endpoint(EndpointName=endpoint_name)
sm.get_waiter("endpoint_deleted").wait(EndpointName=endpoint_name)
print("Deleted endpoint:", endpoint_name)

print("Deleting endpoint config...")
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
print("Deleted endpoint config:", endpoint_config_name)

print("Deleting model...")
sm.delete_model(ModelName=model_name)
print("Deleted model:", model_name)