In [None]:
%pip install sagemaker boto3 litellm aiohttp -qU

<div class="alert alert-block alert-info">
<center>⚠️ <b>Important:</b> Please restart the kernel after installing the dependencies. ⚠️</center>
</div>

## Deploy the model from SageMaker JumpStart on a SageMaker Inference endpoint

> Note: skip the cell below if you have already deployed your model.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.enums import EndpointType
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements


resources = ResourceRequirements(
    requests = {
        "num_accelerators": 4, # Number of accelerators required
        "memory": 96*1024,  # Minimum memory required in Mb (required)
        "copies": 1,
    }
)

model = JumpStartModel(
    model_id="huggingface-llm-mistral-small-24B-Instruct-2501", model_version="2.0.1",
    instance_type="ml.g5.12xlarge"
)
predictor = model.deploy(
    accept_eula=True,
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    serializer=JSONSerializer(), deserializer=JSONDeserializer(),
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    resources=resources,
    managed_instance_scaling={
        "MinInstanceCount": 0,
        "MaxInstanceCount": 1
    }
)

**NOTE:** deployment will take 5~7 minutes.

## Test it

### Using the Predictor object from the SageMaker Python SDK

In [None]:
try: 
    predictor
except:
    import boto3
    from sagemaker.session import Session
    from sagemaker.predictor import Predictor
    from sagemaker.serializers import JSONSerializer
    from sagemaker.deserializers import JSONDeserializer
    
    endpoint_name = "YOUR-ENDPOINT-NAME-HERE"
    component_name = "YOUR-INFERENCE-COMPONENT-NAME-HERE"

    boto_session = boto3.session.Session(region_name=boto3.Session().region_name)
    session = Session(boto_session=boto_session)
    
    predictor = Predictor(
        sagemaker_session=session,
        endpoint_name=endpoint_name, component_name=component_name,
        serializer=JSONSerializer(), deserializer=JSONDeserializer()
    )



sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/dggallit/Library/Application Support/sagemaker/config.yaml


In [13]:
%%time
prompt = "What is the town of Bari, Italy, known for?"
payload = {
    "messages": [
        {
            "role": "user",
            "content": prompt
        }
    ],
    "max_tokens": 4*1024,
    "temperature": 0.1,
    "top_p": 0.9,
}

response = predictor.predict(payload)
print(response['choices'][0]['message']['content'])

Bari, Italy, is known for several things:

1. **Historical Sites**: Bari is home to numerous historical sites, including the Basilica of Saint Nicholas, a significant pilgrimage site for both Roman Catholics and Orthodox Christians. The basilica houses the relics of Saint Nicholas, who is believed to have been buried there in the 11th century.

2. **Cultural Heritage**: The city has a rich cultural heritage, with influences from various periods, including Roman, Byzantine, and Norman eras. The Old Town (Bari Vecchia) is a well-preserved medieval quarter with narrow streets and historic buildings.

3. **Food**: Bari is famous for its cuisine, particularly seafood dishes. Local specialties include orecchiette pasta, a type of ear-shaped pasta often served with turnip greens or tomato sauce, and panzerotti, a type of fried calzone.

4. **University**: The University of Bari is one of the largest universities in Southern Italy and is known for its contributions to various fields of study.


### Using Boto3

In [None]:
%%time
import boto3
import json

payload = {
    "inputs": "What is the town of Bari, Italy, known for? Provide a short answer.",
    "parameters": {
        "max_new_tokens": 4*1024,
        "top_p": 0.9,
        "temperature": 0.2,
    }
}

runtime = boto3.client('sagemaker-runtime', region_name=boto3.Session().region_name)
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=component_name or None,
    ContentType='application/json',
    Body=json.dumps(payload)
)

result = json.loads(response['Body'].read().decode())
print(result['generated_text'])

 The town of Bari, Italy, is known for its historic center, the Basilica di San Nicola, and its role as a major port city.
CPU times: user 14.9 ms, sys: 5.91 ms, total: 20.8 ms
Wall time: 1.71 s


### Using Boto3 and the Messages API (for compatible models only)

In [15]:
%%time
payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful and honest assistant."},
        {"role": "user", "content": "What is the town of Bari, Italy, known for? Provide a short answer."}
    ],
    "max_tokens": 4*1024,
    "parameters": {
        "top_p": 0.9,
        "temperature": 0.6,
    }
}

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=component_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

result = json.loads(response['Body'].read().decode())
print(result['choices'][0]['message'])

{'role': 'assistant', 'content': 'Bari, Italy, is known for several things:\n\n1. **Saint Nicholas**: Bari is famous for being the home of the relics of Saint Nicholas, who is beloved by both Eastern and Western Christianity.\n\n2. **Port City**: Bari is a major seaport city in Italy, serving as a gateway to the Eastern Mediterranean.\n\n3. **Festivals**: The city hosts the annual Fiera del Levante, one of the largest trade fairs in Italy.\n\n4. **Old Town**: The historic old town, or "Bari Vecchia," features narrow cobblestone streets, trulli houses, and vibrant markets.\n\nIt is also a vibrant culture, historical site, and economic center of the region.'}
CPU times: user 3.11 ms, sys: 2.78 ms, total: 5.89 ms
Wall time: 4.66 s


## Using LiteLLM

In [None]:
from litellm import completion
import os


response = completion(
    model=f"sagemaker/{endpoint_name}", 
    model_id=component_name,
    messages=[
        {"role": "system", "content": "You are a helpful and honest assistant."},
        {"role": "user", "content": "What is the town of Bari, Italy, known for? Provide a short answer."}
    ],
    temperature=0.2,
    max_tokens=1024
)
response.choices[0].message.content

[92m20:20:58 - LiteLLM:INFO[0m: utils.py:3108 - 
LiteLLM completion() model= hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846; provider = sagemaker


[92m20:20:59 - LiteLLM:INFO[0m: utils.py:1185 - Wrapper: Completed Call, calling success_handler


[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: sagemaker/hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846
[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: sagemaker/hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846


[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846


[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: sagemaker/hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846


[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: sagemaker/hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846


' Bari is known for its historic center, the Basilica di San Nicola, and its role as a major port city.'

[92m20:20:59 - LiteLLM:INFO[0m: cost_calculator.py:636 - selected model name for cost calculation: hf-llm-mistral-small-24b-instruct-2501-2025-04-07-10-48-04-846


<div class="alert alert-block alert-info">
⚠️ <b>Important:</b> as of LiteLLM v1.67.2, `sagemaker_chat` provider does not not correctly pass the inference component name, causing `HTTPStatusError: Client error '400 Bad Request'`. Please use `sagemaker` provider instead.
</div>