# How to deploy Magistral-Small-2506 for inference on Amazon SageMakerAI
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to deploy the Magistral-Small-2506 model (HuggingFace model ID: [mistralai/Magistral-Small-2506](https://huggingface.co/mistralai/Magistral-Small-2506)) using Amazon SageMaker AI. The inference image will be the SageMaker-managed [LMI (Large Model Inference) v15](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

Building upon Mistral Small 3.1 (2503), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters.

Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized.

Learn more about Magistral in Mistral AI's [blog post](https://mistral.ai/news/magistral/).

### Key Features
- **Reasoning**: Capable of long chains of reasoning traces before providing an answer.

- **Multilingual**: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi.

- **Apache 2.0 License**: Open license allowing usage and modification for both commercial and non-commercial purposes.

- **Context Window**: A 128k context window, but performance might degrade past 40k. Hence we recommend setting the maximum model length to 40k.

### Execution environment setup
This notebook has been tested with the following:
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.247.1

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install -Uq sagemaker

### Setup

In [None]:
import sagemaker
import boto3
import logging
import time
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

print(sagemaker.__version__)

In [None]:
try:
    boto_region = boto3.Session().region_name
    sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
    role = sagemaker.get_execution_role()
    
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [None]:
HF_MODEL_ID = "mistralai/Magistral-Small-2506"

base_name = HF_MODEL_ID.split('/')[-1].replace('.', '-').lower()
model_lineage = HF_MODEL_ID.split("/")[0]
base_name

## Download the model from Hugging Face and upload the model artifacts on Amazon S3
If you are deploying a model hosted on the HuggingFace Hub, you must specify the `option.model_id=<hf_hub_model_id>` configuration. When using a model directly from the hub, we recommend you also specify the model revision (commit hash or branch) via `option.revision=<commit hash/branch>`. 

Since model artifacts are downloaded at runtime from the Hub, using a specific revision ensures you are using a model compatible with package versions in the runtime environment. Open Source model artifacts on the hub are subject to change at any time. These changes may cause issues when instantiating the model (updated model artifacts may require a newer version of a dependency than what is bundled in the container). If a model provides custom model (modeling.py) and/or custom tokenizer (tokenizer.py) files, you need to specify option.trust_remote_code=true to load and use the model.

In this example, we will demonstrate how to download your copy of the model from huggingface and upload it to an s3 location in your AWS account, then deploy the model with the downloaded model artifacts to an endpoint.  

**Best Practices**:
>
> **Store Models in Your Own S3 Bucket**
For production use-cases, always download and store model files in your own S3 bucket to ensure validated artifacts. This provides verified provenance, improved access control, consistent availability, protection against upstream changes, and compliance with organizational security protocols.
>

First, download the model artifact data from HuggingFace.

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

magistral_small_2506 = "mistralai/Magistral-Small-2506"

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = magistral_small_2506
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.txt"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

### Upload model files to S3
SageMaker AI allows us to provide [uncompressed](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html) files. Thus, we directly upload the folder that contains model files to s3
> **Note**: The default SageMaker bucket follows the naming pattern: `sagemaker-{region}-{account-id}`

In [None]:
s3_model_prefix = (
    "hf-large-models/magistral-small-2506"  # folder within bucket where model artifact will go
)

model_artifact = sagemaker_session.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")

### Configure Model Serving Properties

Now we'll create a `serving.properties` file that configures how the model will be served. 

In [None]:
# Create the directory that will contain the configuration files
from pathlib import Path

model_dir = Path('config-magistral-small-2506')
model_dir.mkdir(exist_ok=True)

**Best Practices**:
>
>**Separate Configuration from Model Artifacts**
> The LMI container supports separating configuration files from model artifacts. While you can store serving.properties with your model files, placing configurations in a distinct S3 location allows for better management of all your configurations files.
>
> **Note**: When your model and configuration files are in different S3 locations, set `option.model_id=<s3_model_uri>` in your serving.properties file, where `s3_model_uri` is the S3 object prefix containing your model artifacts. SageMaker AI will automatically download the model files by looking at the S3URI in model_id

In [None]:
config = f"""engine=Python
option.async_mode=true
option.rolling_batch=disable
option.entryPoint=djl_python.lmi_vllm.vllm_async_service
option.model_loading_timeout=1500
fail_fast=true
option.max_rolling_batch_size=8
option.trust_remote_code=false
option.model_id={model_artifact}
option.tool_call_parser=mistral
option.enable_auto_tool_choice=true
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
"""
with open("config-magistral-small-2506/serving.properties", "w") as f:
    f.write(config)

#### Optional configuration files

(Optional) You can also specify a `requirements.txt` to install additional libraries.
We update vllm to version vllm==0.8.5 for magistral support

In [None]:
%%writefile config-magistral-small-2506/requirements.txt
vllm==0.8.5.post1

### Upload config files to S3
Here we will upload our config files to a different path to keep model files and config separate.

In [None]:
from sagemaker.s3 import S3Uploader

sagemaker_default_bucket = sagemaker_session.default_bucket()

config_files_uri = S3Uploader.upload(
    local_path="config-magistral-small-2506",
    desired_s3_uri=f"s3://{sagemaker_default_bucket}/lmi/{base_name}/config-files"
)

print(f"code_model_uri: {config_files_uri}")

## Configure Model Container and Instance

For deploying Magistral-Small-2506, we'll use:
- **LMI (Deep Java Library) Inference Container**: A container optimized for large language model inference
- **[G6e Instance](https://aws.amazon.com/ec2/instance-types/g6e/)**: AWS's GPU instance type powered by NVIDIA L40S Tensor Core GPUs 

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.g6e.12xlarge` instance
> **Note**: The region in the container URI should match your AWS region.

In [None]:
gpu_instance_type = "ml.g6e.12xlarge"

In [None]:
CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'
image_uri = "763104351884.dkr.ecr.{}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128".format(sagemaker_session.boto_session.region_name)
print(image_uri)

## Create SageMaker Model

Now we'll create a SageMaker Model object that combines our:
- Container image (LMI)
- Model artifacts (configuration files)
- IAM role (for permissions)

This step defines the model configuration but doesn't deploy it yet. The Model object represents the combination of:

1. **Container Image** (`image_uri`): DJL Inference optimized for LLMs
2. **Model Data** (`model_data`): points to our configuration files in S3
3. **IAM Role** (`role`): Permissions for model execution

### Required Permissions
The IAM role needs:
- S3 read access for model artifacts
- CloudWatch permissions for logging
- ECR permissions to pull the container

In [None]:
# Specify the S3 URI for your uncompressed config files
model_data = {
    "S3DataSource": {
        "S3Uri": f"{config_files_uri}/",
        "S3DataType": "S3Prefix",
        "CompressionType": "None"
    }
}

> **Note**: Here S3 URI points to the configuration files S3 location

In [None]:
from sagemaker.utils import name_from_base
from sagemaker.model import Model

model_name = name_from_base(base_name, short=True)

# Create model
magistral_small_model = Model(
    name = model_name,
    image_uri=image_uri,
    model_data=model_data,  # Path to uncompressed config code files
    role=role,
    env={
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
    },
    sagemaker_session=sagemaker_session
)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. 
> ⚠️ **Important**: 
> - Deployment can take up to 15 minutes
> - Monitor the CloudWatch logs for progress

In [None]:
%%time

from sagemaker.utils import name_from_base

endpoint_name = name_from_base(base_name, short=True)

magistral_small_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=gpu_instance_type
)

### Use the code below to create a predictor from an existing endpoint and make inference

In [None]:
from sagemaker.serializers import JSONSerializer, IdentitySerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor

endpoint_name = "magistral-small-2506-250711-1534"# replace with your enpoint name 

magistral_small_predictor = Predictor(
    sagemaker_session=sagemaker_session,
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

#### Use predictor to make inference

It is highly recommend including the default system prompt used during RL for the best results, you can edit and customise it if needed for your specific use case.

```
<s>[SYSTEM_PROMPT]system_prompt

A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts (i.e. your summary should be succinct but contain all the critical steps you needed to reach the conclusion). You should use Markdown to format your response. Write both your thoughts and summary in the same language as the task posed by the user. NEVER use \boxed{} in your response.

Your thinking process must follow the template below:
<think>
Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate a correct answer.
</think>

Here, provide a concise summary that reflects your reasoning and presents a clear final answer to the user. Don't mention that this is a summary.

Problem:

[/SYSTEM_PROMPT][INST]user_message[/INST]<think>
reasoning_traces
</think>
assistant_response</s>[INST]user_message[/INST]
```

We can just download the prompt template from huggingface_hub

In [None]:
from huggingface_hub import hf_hub_download

# Retrieve the prompt template from huggingface_hub
def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    return system_prompt

SYSTEM_PROMPT = load_system_prompt(HF_MODEL_ID, "SYSTEM_PROMPT.txt")

In [None]:
payload = {
    "messages" : [
        {
            "role": "system",
            "content": SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}]
        }
    ],
    "max_tokens":500,
    "temperature": 0.6,
    "top_p": 0.9,
}

response = magistral_small_predictor.predict(payload)
print(response['choices'][0]['message']['content'])

# Print usage statistics
print("=== Token Usage ===")
usage = response['usage']
print(f"Prompt Tokens: {usage['prompt_tokens']}")
print(f"Completion Tokens: {usage['completion_tokens']}")
print(f"Total Tokens: {usage['total_tokens']}")

#### Invoke endpoint with boto3
Now you can invoke the endpoint with boto3 `invoke_endpoint` or `invoke_endpoint_with_response_stream` runtime api calls. If you have an existing endpoint, you don't need to recreate the `predictor` and can follow below example to invoke the endpoint with an endpoint name.

Note that based on the [Magistral Small hugging face page description](https://huggingface.co/mistralai/Magistral-Small-2506), It is highly recommend including the default system prompt used during RL for the best results, you can edit and customise it if needed for your specific use case. You can switch to no thinking by omitting or customizing the default system prompt

In [None]:
import boto3
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')

prompt = {
    'messages':[
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
],
    'temperature':0.7,
    'top_p':0.8,
    'top_k':20,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

#### No thinking by omitting the system prompt. You can always customize the system prompt for your use case

In [None]:
prompt = {
    'messages':[
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
],
    'temperature':0.7,
    'top_p':0.8,
    'top_k':20,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

### Streaming content

In [None]:
body = {
    'messages':[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"},
    ],
    'temperature':0.9,
    'max_tokens':800,
    'stream': True,
}

In [None]:
import json
import time

# Create SageMaker Runtime client
smr_client = boto3.client("sagemaker-runtime")
##Add your endpoint here 
endpoint_name = "magistral-small-2506-250711-1534"

# Invoke the model
response_stream = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(body)
)

first_token_received = False
ttft = None
token_count = 0
start_time = time.time()

print("Response:", end=' ', flush=True)
full_response = ""

for event in response_stream['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        
        try:
            # Handle SSE format (data: prefix)
            if chunk.startswith('data: '):
                data = json.loads(chunk[6:])  # Skip "data: " prefix
            else:
                data = json.loads(chunk)
            
            # Extract token based on OpenAI format
            if 'choices' in data and len(data['choices']) > 0:
                if 'delta' in data['choices'][0] and 'content' in data['choices'][0]['delta']:
                    token_count += 1
                    token_text = data['choices'][0]['delta']['content']
                                    # Record time to first token
                    if not first_token_received:
                        ttft = time.time() - start_time
                        first_token_received = True
                    full_response += token_text
                    print(token_text, end='', flush=True)
        
        except json.JSONDecodeError:
            continue
            
# Print metrics after completion
end_time = time.time()
total_latency = end_time - start_time

print("\n\nMetrics:")
print(f"Time to First Token (TTFT): {ttft:.2f} seconds" if ttft else "TTFT: N/A")
print(f"Total Tokens Generated: {token_count}")
print(f"Total Latency: {total_latency:.2f} seconds")
if token_count > 0 and total_latency > 0:
    print(f"Tokens per second: {token_count/total_latency:.2f}")

# Clean up

In [None]:
# Clean up
magistral_small_predictor.delete_model()
magistral_small_predictor.delete_endpoint(delete_endpoint_config=True)