# How to deploy the VaultGemma 1B for inference using Amazon SageMakerAI
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to deploy the Vault Gemma 1B model (HuggingFace model ID: [google/vaultgemma-1b](https://huggingface.co/google/vaultgemma-1b)) using Amazon SageMaker AI. 

VaultGemma is a variant of the Gemma family of lightweight, state-of-the-art open models from Google. It is pre-trained from the ground up using Differential Privacy (DP). This provides strong, mathematically-backed privacy guarantees for its training data, limiting the extent to which the model's outputs can reveal information about any single training example.

VaultGemma uses a similar architecture as Gemma 2. VaultGemma is a pretrained model that can be instruction tuned for a variety of language understanding and generation tasks. Its relatively small size (< 1B parameters) makes it possible to deploy in environments with limited resources, democratizing access to state-of-the-art AI models that are built with privacy at their core.

### License agreement
* This model is gated on HuggingFace, please refer to the original [model card](https://huggingface.co/google/vaultgemma-1b) for license.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.242.0

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install -Uq huggingface==4.49 sagemaker transformers==4.57.0

### Setup

In [2]:
import os
import datetime
import sagemaker
import boto3
import logging
import json
import time
import shutil
import tarfile

import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

from huggingface_hub import snapshot_download

print(sagemaker.__version__)

2.253.1


In [6]:
session = sagemaker.Session()
role = sagemaker.get_execution_role()

instance_type = "ml.m5.xlarge"
instance_count = 1

HUGGING_FACE_HUB_TOKEN = "<REPLACE_ME>"
model_id = "google/vaultgemma-1b"
model_id_filesafe = model_id.replace("/", "_").replace(".", "_")
endpoint_name = f"{model_id_filesafe.replace("_", "-")}-endpoint-{str(datetime.datetime.now().timestamp()).replace(".", "-")}"
print(endpoint_name)

base_name = model_id.split('/')[-1].replace('.', '-').lower()
model_lineage = model_id.split('/')[0]
base_name

bucket_name = session.default_bucket()
default_prefix = session.default_bucket_prefix or f"models/{model_id_filesafe}"
print(f"Saving model artifacts to {bucket_name}/{default_prefix}")

os.makedirs("code", exist_ok=True)

google-vaultgemma-1b-endpoint-1761683900-420591
Saving model artifacts to sagemaker-us-east-1-329542461890/models/google_vaultgemma-1b


### Local Model Test

In [10]:
from huggingface_hub import login
login(HUGGING_FACE_HUB_TOKEN)

In [11]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/vaultgemma-1b", trust_remote_code=True, token=HUGGING_FACE_HUB_TOKEN)
model = AutoModelForCausalLM.from_pretrained("google/vaultgemma-1b", device_map="auto", dtype="auto")

prompt_text = "Explain in simple terms how differential privacy works."
input_ids = tokenizer(prompt_text, return_tensors="pt").to(model.device)

generated_outputs = model.generate(
    **input_ids,
    max_new_tokens=150,
    do_sample=True,
    top_p=0.9,
    temperature=0.8
)

response = tokenizer.decode(generated_outputs[0], skip_special_tokens=True)
print(response)

config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

2025-10-28 20:42:07.834777: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761684127.844092   65037 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761684127.847260   65037 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-10-28 20:42:07.857283: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


model.safetensors:   0%|          | 0.00/2.08G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Explain in simple terms how differential privacy works.

Explain in simple terms how differential privacy works.

The following is a list of some of the functions of the nervous system:

$\begin{array}{lll} \text { (a) } & \text { muscle } & \text { (b) } & \text { nerve } \\ \text { (c) } & \text { gland } & \text { (d) } & \text { muscle } \\ \text { (e) } & \text { gland } & \text { (f) } & \text { gland } \\ \text { (g) } & \text { gland } & \text { (h) } & \text { gland } \\ \text { (


## Create SageMaker Model

#### HUGGING_FACE_HUB_TOKEN 
VaultGemma-1B is a gated model. Therefore, if you deploy model files hosted on the Hub, you need to provide your HuggingFace token as environment variable. This enables SageMaker AI to download the files at runtime.

In [None]:
env = {
    'HF_MODEL_ID': model_id,
    'HF_TOKEN': HUGGING_FACE_HUB_TOKEN,
    'HF_TASK':'image-text-to-text',
    'SM_NUM_GPUS': json.dumps(1),
    'OPTION_TRUST_REMOTE_CODE': 'true',
    'OPTION_MODEL_LOADING_TIMEOUT': '3600',
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_TENSOR_PARALLEL_DEGREE": "1",
    "OPTION_MAX_MODEL_LEN": "5000",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "SERVING_FAIL_FAST": "true",
}


In [None]:
%%writefile code/requirements.txt
transformers==4.57.0
huggingface==4.49

In [None]:
%%writefile code/inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def model_fn(model_dir):

    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", dtype="auto")


    return {"tokenizer": tokenizer, "model": model}


def predict_fn(data, model_obj):
    tokenizer = model_obj["tokenizer"]
    model = model_obj["model"]
    
    prompt_text = "Explain in simple terms how differential privacy works."
    input_ids = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    
    generated_outputs = model.generate(
        **input_ids,
        max_new_tokens=150,
        do_sample=True,
        top_p=0.9,
        temperature=0.8
    )
    
    response = tokenizer.decode(generated_outputs[0], skip_special_tokens=True)
    return response

In [None]:
def filter_function(tarinfo):
    """Filter function to exclude .cache files and directories"""
    if '.cache' in tarinfo.name or '.gitattributes' in tarinfo.name:
        return None
    return tarinfo

In [None]:
s3_client = boto3.client('s3')
key = f"{default_prefix}/model.tar.gz"
force_rebuild_tarball = True

if force_rebuild_tarball or not s3_client.head_object(Bucket=bucket_name, Key=key):
    try:
        model_path = snapshot_download(repo_id=model_id, local_dir="./model")
        print(f"Successfully downloaded to {model_path}")
    except Exception as e:
        print(f"Failed to download after retries: {str(e)}")
    
    print("Building gzipped tarball...")
    with tarfile.open("./model.tar.gz", "w:gz") as tar:
        tar.add(model_path, arcname=".", filter=filter_function)
        tar.add("./code", filter=filter_function)
    print("Successfully tarred the ball.")
    
    print(f"Uploading tarball to {bucket_name}/{default_prefix}...")
    s3_client.upload_file("./model.tar.gz", bucket_name, f"{default_prefix}/model.tar.gz")
    # shutil.rmtree("./model")
    # os.remove("./model.tar.gz")
    print("Successfully uploaded, working directory cleaned")

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (M5 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.m5.xlarge` for high-performance inference

> ⚠️ **Important**: 
> - Deployment can take up to 15 minutes
> - Monitor the CloudWatch logs for progress

In [None]:
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'google/vaultgemma-1b',
	'HF_TASK':'image-text-to-text',
    'HF_TOKEN': HUGGING_FACE_HUB_TOKEN,
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=f"s3://{bucket_name}/{default_prefix}/model.tar.gz",
	transformers_version='4.49.0',
	pytorch_version='2.6.0',
	py_version='py312',
	env=env,
	role=role, 
    entry_point="inference.py",
    enable_network_isolation=False
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

In [None]:
# Using DJL Serving
# UNDER CONSTRUCTION
# %%time

# image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128-v1.2"
# model = HuggingFaceModel(
#     model_data=f"s3://{bucket_name}/{default_prefix}/model.tar.gz",
#     image_uri=image_uri,
#     env=env,
#     role=role,
#     entry_point="inference.py",
#     enable_network_isolation=False
# )

# predictor = model.deploy(
#     initial_instance_count=instance_count,
#     instance_type=instance_type,
#     endpoint_name=endpoint_name
# )

# predictor.predict({
# 	"inputs": "Can you please let us know more details about your training using differential privacy?",
# })

# Clean up

In [None]:
predictor.delete_endpoint(True)
huggingface_model.delete_model()