# Gemma-3 example with HubCAP

We worked on a container automation platform to distribute LLM inference images of popular models available on the Hugging Face Hub. This project is named HubCAP.

Those inference images can be powered by different open-source inference backends ([TGI](https://github.com/huggingface/text-generation-inference), [vLLM](https://github.com/vllm-project/vllm), [SGLANG](https://github.com/sgl-project/sglang), [llamacpp](https://github.com/ggml-org/llama.cpp) or [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). The goal is to provide optimized model-specific inference container running with the best possible setup on any hardware. 

Images would be created and made available by Hugging Face on a [public ECR repo](https://gallery.ecr.aws/u9a4y4p1/huggingface/hubcap) and then distributed by AWS for usage in Sagemaker. 

In this notebook we provide an example on how the first HubCAP image for [google/gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) can be accessed through HubCAP and can be deployed to Sagemaker as a [Hugging Face DLC](https://huggingface.co/docs/sagemaker/en/tutorials/sagemaker-sdk/deploy-sagemaker-sdk).

### Download public HubCAP image and copy to public 

SageMaker doesn't allow deploying models from public ECR images directly unless specific VPC access settings are configured, which is often restrictive or unavailable.  
To ensure compatibility and control, we first pull the public HubCAP image locally, then re-tag and push it to our own private ECR repository.  
This gives us full ownership of the image and allows seamless deployment within private or VPC-secured SageMaker environments.

### 0. Configure and Authenticate AWS CLI

Before using SageMaker or ECR from this notebook, ensure you're authenticated with AWS via the CLI.  
You can use either standard credentials or AWS IAM Identity Center (SSO), depending on your setup.

##### Option A – Standard credentials

In a terminal (not in the notebook), run:

```bash
aws configure
```
You'll be prompted to enter:

- AWS Access Key ID
- AWS Secret Access Key
- Default region name (e.g., us-east-1)
- Default output format (json is fine)

##### Option B – AWS SSO 

If your organization uses AWS SSO, configure it using:

```bash
aws configure sso
```

Follow the prompts to:
- Choose the SSO start URL
- Select the AWS account and role
- Set your default region and output format

Then authenticate using


```bash
aws sso login
```

##### Verify your credentials
You can confirm you're authenticated by running the following:

In [None]:
!aws sts get-caller-identity

#### 1. Authenticate to public & private ECR

In [None]:
!aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

Login Succeeded


In [None]:
%%bash

export AWS_REGION=<your-region>
export ACCOUNT_ID=<your-account-id>

aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

Login Succeeded


#### 2. Create the Private Repo (if not already done)



In [None]:
!aws ecr create-repository --repository-name <your-private-repo-name> --region us-east-1

#### 3. Pull the Public Image

In [None]:
!docker pull public.ecr.aws/u9a4y4p1/huggingface/hubcap:google_gemma-3-27b-it-sglang-gpu

google_gemma-3-27b-it-sglang-gpu: Pulling from u9a4y4p1/huggingface/hubcap

[1Be37e4068: Pulling fs layer 
[1B571641e0: Pulling fs layer 
[1Bfe8a4288: Pulling fs layer 
[1Bb59401c3: Pulling fs layer 
[1B0a15c0bd: Pulling fs layer 
[1Bedf4b1b7: Pulling fs layer 
[1Bc576b08f: Pulling fs layer 
[1Bfbae923b: Pulling fs layer 
[1B5e86e52f: Pulling fs layer 
[1Bbd1d7787: Pulling fs layer 
[1B982378d4: Pulling fs layer 
[1B0e63b394: Pulling fs layer 
[1B4fbd471c: Pulling fs layer 
[1B74762c96: Pulling fs layer 
[1Be956feb6: Pulling fs layer 
[1B3aaa93b5: Pulling fs layer 
[1Bb9bba0e6: Pulling fs layer 
[1B3052ea84: Pulling fs layer 
[1B858e1b6b: Pulling fs layer 
[1B5b8dafe4: Pulling fs layer 
[1B73c66646: Pulling fs layer 
[1Bb9e07a44: Pulling fs layer 
[1B845c1803: Pulling fs layer 
[1B072f5852: Pulling fs layer 
[1Beb2a0d9e: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B8c1a950f: Pulling fs layer 
[1Bd8533999: Pulling fs layer 
[1Ba5e588cd: Pulling fs lay

#### 4. Tag It for Your Private ECR and Push it There

In [None]:
%%bash


export AWS_REGION=<your-region>
export ACCOUNT_ID=<your-account-id>
export ECR_REPO=<your-private-repo-name>
export IMAGE_TAG=google_gemma-3-27b-it-sglang-gpu


export PUBLIC_IMAGE=public.ecr.aws/u9a4y4p1/huggingface/hubcap:$IMAGE_TAG
export PRIVATE_IMAGE=$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPO:$IMAGE_TAG

# Tag for private ECR
docker tag $PUBLIC_IMAGE $PRIVATE_IMAGE

# Push to private ECR
docker push $PRIVATE_IMAGE

Now the image is available in your private ECR and is ready to be used by Sagemaker

### Setup Sagemaker development environment

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [None]:
!pip install "sagemaker>=2.248.1" --upgrade --quiet

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

iam = boto3.client('iam')
role = iam.get_role(RoleName='<your-sagemaker-execution-role>')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Deploy model

In [None]:
hubcap_image_uri = "<your-account-id>.dkr.ecr.<your-region>.amazonaws.com/<your-private-repo-name>:google_gemma-3-27b-it-sglang-gpu"
instance_type = "ml.g5.12xlarge" #instance_type suited for gemma3-27b-it
health_check_timeout = 900

Gemma-3 is a gated model to get access to it you need approval on the Hugging Face Hub (see [here](https://huggingface.co/docs/hub/en/models-gated)). Visit [gemma-3 model card](https://huggingface.co/google/gemma-3-27b-it) and accept the license terms. 

When deploying the container you will need to provide an Hugging Face token linked to your account.

> This image uses **SGLANG** as its inference backend. It comes with preconfigured deployment parameters, which are suitable for most use cases. However, these settings can be easily overridden to fit your specific needs. For a full list of available options, refer to the [SGLANG server arguments documentation](https://docs.sglang.ai/backend/server_arguments.html).


In [None]:

import logging
import time
import json

from sagemaker.huggingface import HuggingFaceModel

config = {
    "TENSOR_PARALLEL_DEGREE": "4",# Number of GPU used for TP (already set in the image but can be overwritten)
    "DTYPE": "bfloat16", # Data type (already set in the image but can be overwritten)
    "CONTEXT_LENGTH": "4096", #Max length of input text (already set in the image but can be overwritten)
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>"
    }

# Create Hugging Face Model Class
endpoint_name = "hubcap-huggingface-gemma-3-27b-it-sglang"
endpoint_name = (
    endpoint_name + "-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
)
model = HuggingFaceModel(
    name=endpoint_name,
    env=config,
    role=role,
    image_uri=hubcap_image_uri,
    sagemaker_session=sess, 
)

deploy_parameters = {
    "instance_type": instance_type,
    "initial_instance_count": 1,
    "endpoint_name": endpoint_name,
    "container_startup_health_check_timeout": health_check_timeout,
}

predictor = model.deploy(**deploy_parameters)

logging.info("Endpoint deployment complete.")

### Run Inference

In [None]:
# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning?"},
]

# Generation arguments

parameters = {
    "model": "google/gemma-3-27b-it",
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 200,
}
output = predictor.predict({"messages": messages, **parameters})
logging.info("Output: " + json.dumps(output))

### Clean up

In [None]:
predictor.delete_model()
predictor.delete_endpoint()