# How to deploy the VaultGemma 1B for inference using Amazon SageMakerAI
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to deploy the Vault Gemma 1B model (HuggingFace model ID: [google/vaultgemma-1b](https://huggingface.co/google/vaultgemma-1b)) using Amazon SageMaker AI. 

VaultGemma is a variant of the Gemma family of lightweight, state-of-the-art open models from Google. It is pre-trained from the ground up using Differential Privacy (DP). This provides strong, mathematically-backed privacy guarantees for its training data, limiting the extent to which the model's outputs can reveal information about any single training example.

VaultGemma uses a similar architecture as Gemma 2. VaultGemma is a pretrained model that can be instruction tuned for a variety of language understanding and generation tasks. Its relatively small size (< 1B parameters) makes it possible to deploy in environments with limited resources, democratizing access to state-of-the-art AI models that are built with privacy at their core.

### License agreement
* This model is gated on HuggingFace, please refer to the original [model card](https://huggingface.co/google/vaultgemma-1b) for license.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.242.0

Let's install or upgrade these dependencies using the following command:

In [4]:
%pip install -Uq sagemaker

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.4.0 requires nvidia-ml-py3<8.0,>=7.352.0, which is not installed.
aiobotocore 2.21.1 requires botocore<1.37.2,>=1.37.0, but you have botocore 1.40.30 which is incompatible.
autogluon-multimodal 1.4.0 requires transformers[sentencepiece]<4.50,>=4.38.0, but you have transformers 4.57.0.dev0 which is incompatible.
autogluon-timeseries 1.4.0 requires transformers[sentencepiece]<4.50,>=4.38.0, but you have transformers 4.57.0.dev0 which is incompatible.
sagemaker-studio-analytics-extension 0.2.0 requires sparkmagic==0.22.0, but you have sparkmagic 0.21.0 which is incompatible.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.3.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


### Setup

In [5]:
import sagemaker
import boto3
import logging
import time
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

print(sagemaker.__version__)

2.245.0


In [6]:
try:
    role = sagemaker.get_execution_role()
    sagemaker_session  = sagemaker.Session()
    
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [15]:
HF_MODEL_ID = "google/vaultgemma-1b"
HUGGING_FACE_HUB_TOKEN = "<REPLACE WITH TOKEN>"

base_name = HF_MODEL_ID.split('/')[-1].replace('.', '-').lower()
model_lineage = HF_MODEL_ID.split('/')[0]
base_name

'vaultgemma-1b'

In [11]:
instance_type = "ml.m5.xlarge"
instance_count = 1

## Create SageMaker Model

#### HUGGING_FACE_HUB_TOKEN 
VaultGemma-1B is a gated model. Therefore, if you deploy model files hosted on the Hub, you need to provide your HuggingFace token as environment variable. This enables SageMaker AI to download the files at runtime.

In [19]:
from sagemaker.huggingface import HuggingFaceModel

# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'google/vaultgemma-1b',
	'HF_TASK':'text-generation',
    'HF_TOKEN':HUGGING_FACE_HUB_TOKEN
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.49.0',
	pytorch_version='2.6.0',
	py_version='py312',
	env=hub,
	role=role, 
)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (M5 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.m5.xlarge` for high-performance inference

> ⚠️ **Important**: 
> - Deployment can take up to 15 minutes
> - Monitor the CloudWatch logs for progress

In [None]:
%%time

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=instance_count, # number of instances
	instance_type=instance_type # ec2 instance type
)

In [None]:
predictor.predict({
	"inputs": "Can you please let us know more details about your training using differential privacy?",
})

# Clean up

In [None]:
huggingface_model.delete_model()