# How to deploy the SEA-LION v4 instruct (Gemma-SEA-LION-v4-27B-IT) for inference using Amazon SageMakerAI with LMI v15 powered by vLLM 0.8.4
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to deploy the Gemma SEA-LION v4 27 B instruct model (HuggingFace model ID: [aisingapore/Gemma-SEA-LION-v4-27B-IT](https://huggingface.co/aisingapore/Gemma-SEA-LION-v4-27B-IT)) using Amazon SageMaker AI. The inference image will be the SageMaker-managed [LMI (Large Model Inference) v15 powered by vLLM 0.8.4](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

The SEA-LION v4 (Gemma3-based) models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. The base Gemma 3 model has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

### License agreement
* This model is gated on HuggingFace, please refer to the original [model card](https://huggingface.co/google/gemma-3-27b-it) for license.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.242.0

Let's install or upgrade these dependencies using the following command:

In [1]:
%pip install -Uq sagemaker

### Setup

In [2]:
import sagemaker
import boto3
import logging
import time
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

print(sagemaker.__version__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
2.251.0


In [3]:
try:
    role = sagemaker.get_execution_role()
    sagemaker_session  = sagemaker.Session()
    
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [4]:
HF_MODEL_ID = "aisingapore/Gemma-SEA-LION-v4-27B-IT"

base_name = HF_MODEL_ID.split('/')[-1].replace('.', '-').lower()
model_lineage = HF_MODEL_ID.split("/")[0]
print(f'base_name = {base_name}')

base_name = gemma-sea-lion-v4-27b-it


## Configure Model Serving Properties

Now we'll create a `serving.properties` file that configures how the model will be served. This configuration is crucial for optimal performance and memory utilization.

Key configurations explained:
- **Engine**: Python backend for model serving
- **Model Settings**:
  -  Using gemma-sea-lion-v4-27b-it 
  - Maximum sequence length of 32768 tokens
  - model loading timeout of 1200 seconds (20 minutes)
- **Performance Optimizations**:
  - Tensor parallelism across all available GPUs
  - Max rolling batch size of 16 for efficient batching
  
#### Understanding KV Cache and Context Window

The `max_model_len` parameter controls the maximum sequence length the model can handle, which directly affects the size of the KV (Key-Value) cache in GPU memory.

1. Start with a conservative value (current: 32768)
2. Monitor GPU memory usage
3. Incrementally increase if memory permits
4. Target the model's full context window 

In [5]:
# Create the directory that will contain the configuration files
from pathlib import Path

model_dir = Path('config')
model_dir.mkdir(exist_ok=True)

If you are deploying a model hosted on the HuggingFace Hub, you must specify the `option.model_id=<hf_hub_model_id>` configuration. When using a model directly from the hub, we recommend you also specify the model revision (commit hash or branch) via `option.revision=<commit hash/branch>`. 

Since model artifacts are downloaded at runtime from the Hub, using a specific revision ensures you are using a model compatible with package versions in the runtime environment. Open Source model artifacts on the hub are subject to change at any time. These changes may cause issues when instantiating the model (updated model artifacts may require a newer version of a dependency than what is bundled in the container). If a model provides custom model (modeling.py) and/or custom tokenizer (tokenizer.py) files, you need to specify option.trust_remote_code=true to load and use the model.

In [6]:
config = f"""engine=Python
option.async_mode=true
option.rolling_batch=disable
option.entryPoint=djl_python.lmi_vllm.vllm_async_service
option.tensor_parallel_degree=max
option.model_loading_timeout=1200
fail_fast=true
option.max_model_len=32768
option.max_rolling_batch_size=16
option.trust_remote_code=true
option.model_id={HF_MODEL_ID}
option.revision=main
"""

# If you have copied the model data from HuggingFace to an S3 bucket, you can replace
# option.model_id={HF_MODEL_ID}
# with
# option.model_id={model_s3_uri}

with open("config/serving.properties", "w") as f:
    f.write(config)

In [7]:
# Check that the file config/serving.properties was generated properly
!pygmentize config/serving.properties

[36mengine[39;49;00m=[33mPython[39;49;00m[37m[39;49;00m
[36moption.async_mode[39;49;00m=[33mtrue[39;49;00m[37m[39;49;00m
[36moption.rolling_batch[39;49;00m=[33mdisable[39;49;00m[37m[39;49;00m
[36moption.entryPoint[39;49;00m=[33mdjl_python.lmi_vllm.vllm_async_service[39;49;00m[37m[39;49;00m
[36moption.tensor_parallel_degree[39;49;00m=[33mmax[39;49;00m[37m[39;49;00m
[36moption.model_loading_timeout[39;49;00m=[33m1200[39;49;00m[37m[39;49;00m
[36mfail_fast[39;49;00m=[33mtrue[39;49;00m[37m[39;49;00m
[36moption.max_model_len[39;49;00m=[33m32768[39;49;00m[37m[39;49;00m
[36moption.max_rolling_batch_size[39;49;00m=[33m16[39;49;00m[37m[39;49;00m
[36moption.trust_remote_code[39;49;00m=[33mtrue[39;49;00m[37m[39;49;00m
[36moption.model_id[39;49;00m=[33maisingapore/Gemma-SEA-LION-v4-27B-IT[39;49;00m[37m[39;49;00m
[36moption.revision[39;49;00m=[33mmain[39;49;00m[37m[39;49;00m


**Best Practices**:
>
> **Store Models in Your Own S3 Bucket**
For production use-cases, always download and store model files in your own S3 bucket to ensure validated artifacts. This provides verified provenance, improved access control, consistent availability, protection against upstream changes, and compliance with organizational security protocols.
>
>**Separate Configuration from Model Artifacts**
> The LMI container supports separating configuration files from model artifacts. While you can store serving.properties with your model files, placing configurations in a distinct S3 location allows for better management of all your configurations files.
>
> When your model and configuration files are in different S3 locations, set `option.model_id=<s3_model_uri>` in your serving.properties file, where `s3_model_uri` is the S3 object prefix containing your model artifacts.

#### Optional configuration files

(Optional) You can also specify a `requirements.txt` to install additional libraries.

### Upload config files to S3
SageMaker AI allows us to provide [uncompressed](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html) files. Thus, we directly upload the folder that contains `serving.properties` to s3
> **Note**: The default SageMaker bucket follows the naming pattern: `sagemaker-{region}-{account-id}`

In [9]:
from sagemaker.s3 import S3Uploader

sagemaker_default_bucket = sagemaker_session.default_bucket()

config_files_uri = S3Uploader.upload(
    local_path="config",
    desired_s3_uri=f"s3://{sagemaker_default_bucket}/lmi/{base_name}/config-files"
)

In [10]:
# print(f"code_model_uri: {config_files_uri}")

## Configure Model Container and Instance

For deploying Gemma-3-27B-it, we'll use:
- **LMI (Deep Java Library) Inference Container**: A container optimized for large language model inference
- **[G6e Instance](https://aws.amazon.com/ec2/instance-types/g6e/)**: AWS's GPU instance type powered by NVIDIA L40S Tensor Core GPUs 

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.g6e.48xlarge` instance which offer:
  - 8 NVIDIA L40S Tensor Core GPUs
  - 384 GB of total GPU memory (48 GB of memory per GPU)
  - up to 400 Gbps of network bandwidth
  - up to 1.536 TB of system memory
  - and up to 7.6 TB of local NVMe SSD storage.

> **Note**: The region in the container URI should match your AWS region.

In [11]:
gpu_instance_type = "ml.g6e.48xlarge"

In [12]:
image_uri = "763104351884.dkr.ecr.{}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128".format(sagemaker_session.boto_session.region_name)
print(image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128


## Create SageMaker Model

Now we'll create a SageMaker Model object that combines our:
- Container image (LMI)
- Model artifacts (configuration files)
- IAM role (for permissions)

This step defines the model configuration but doesn't deploy it yet. The Model object represents the combination of:

1. **Container Image** (`image_uri`): DJL Inference optimized for LLMs
2. **Model Data** (`model_data`): Our configuration files in S3
3. **IAM Role** (`role`): Permissions for model execution

### Required Permissions
The IAM role needs:
- S3 read access for model artifacts
- CloudWatch permissions for logging
- ECR permissions to pull the container

#### HUGGING_FACE_HUB_TOKEN 
Gemma-3-27B-Instruct is a gated model. Therefore, if you deploy model files hosted on the Hub, you need to provide your HuggingFace token as environment variable. This enables SageMaker AI to download the files at runtime.

In [13]:
# Specify the S3 URI for your uncompressed config files
model_data = {
    "S3DataSource": {
        "S3Uri": f"{config_files_uri}/",
        "S3DataType": "S3Prefix",
        "CompressionType": "None"
    }
}

In [15]:
HUGGING_FACE_HUB_TOKEN = "hf_ (fill in you Hugging Face Hub token here)"

In [17]:
from sagemaker.utils import name_from_base
from sagemaker.model import Model

model_name = name_from_base(base_name, short=True)
model_name

'gemma-sea-lion-v4-27b-it-250825-0926'

In [18]:
# Create model
sealion_4_model = Model(
    name = model_name,
    image_uri=image_uri,
    model_data=model_data,  # Path to uncompressed code files
    role=role,
    env={
        "HF_TASK": "Image-Text-to-Text",
        "OPTION_LIMIT_MM_PER_PROMPT": "image=2", # Limit the number of images that can be sent per prompt
        "HUGGING_FACE_HUB_TOKEN": HUGGING_FACE_HUB_TOKEN # HF Token for gated models
    },
)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (G6e instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.g6e.48xlarge` for high-performance inference

> ⚠️ **Important**: 
> - Deployment can take up to 15 minutes
> - Monitor the CloudWatch logs for progress

The following may take approximately 11-12 minutes (or more)

In [None]:
%%time

from sagemaker.utils import name_from_base

endpoint_name = name_from_base(base_name, short=True)

try:
    sealion_4_model.deploy(
        endpoint_name=endpoint_name,
        initial_instance_count=1,
        instance_type=gpu_instance_type
    )
except Exception as e:
    print(f"Exception: {e}")


### Use the code below to create a predictor from an existing endpoint and make inference

In [21]:
from sagemaker.serializers import JSONSerializer, IdentitySerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor

In [23]:
# Option 1: Automatically retrieve the first endpoint (if this is your only endpoint)
sagemaker_client = boto3.client('sagemaker')
response = sagemaker_client.list_endpoints()
endpoint_names = [ endpoint['EndpointName'] for endpoint in response['Endpoints'] ]
if len(endpoint_names):
    endpoint_name = endpoint_names[0]
    print(f'Using endpoint: {endpoint_name}')

# Option 2: Set the endpoint name manually (Uncomment below to use Option 2)
# endpoint_name = "gemma-3-27b-it-... (replace with your enpoint name)"

Using endpoint: gemma-sea-lion-v4-27b-it-250825-0928


In [25]:
predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

## Text only Inference

In [37]:
def invoke_sealion(prompt: str, print_response=True, **kwargs):
    payload = {
        "messages" : [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ],
        "max_tokens": 500,
        "temperature": 0.1,
        "top_p": 0.9,
    }
    for k in kwargs:
        if k in ['max_tokens', 'temperature', 'top_p']:
            payload[k] = kwargs[k]
    response = predictor.predict(payload)
    
    if print_response:
        # Print usage statistics
        usage = response['usage']
        print(response['choices'][0]['message']['content'].strip())
        print(f"=== Token Usage: {usage['prompt_tokens']} (prompt), {usage['completion_tokens']} (completions), {usage['total_tokens']} (total) ===")
    return response['choices'][0]['message']['content']

In [38]:
_ = invoke_sealion("Write me a poem about Machine Learning.", max_tokens=500, temperature=0.1, top_p=0.9)

## The Algorithm's Dream

No longer coded, line by rigid line,
A new intelligence begins to shine.
Machine Learning, a whisper, then a roar,
Learning from data, wanting more and more.

It starts with patterns, hidden in the haze,
A million examples in a digital maze.
Features extracted, a careful, keen eye,
To find the connections as the moments fly.

Regression's curve, predicting what will be,
Classification sorting, for you and for me.
Clustering groups, where similarities reside,
Uncovering secrets, deep inside.

Neural networks layered, a mimicking brain,
Connections strengthening, again and again.
Backpropagation's dance, a subtle, slow art,
Adjusting the weights, playing a crucial part.

From spam detection to faces it knows,
From medical diagnoses to where the river flows.
It learns to translate, to write, and to see,
A powerful tool, for you and for me.

But caution is needed, a mindful embrace,
Bias can creep in, leaving a flawed trace.
Explainability sought, a transparent vi

In [39]:
SAMPLE_PROMPTS = [
    """Terjemahkan teks berikut ini ke dalam Bahasa Inggris. Teks: Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang,menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat oleh pengguna lain. Masalah ini dengan “cepat” diperbaiki setelah ditemukan, demikian keterangan Facebook, perusahaan media sosial yang memiliki Instagram. Jani kemudian dibayar - yang membuat dia sebagai anak yang termuda yang pernah menerima hadiah atas penemuan bug ini. Setelah menemukan kekurangan itu pada Februari, dia mengirim email ke Facebook. Beli sepeda dan peralatan sepak bola Sejumlah ahli teknik keamanan di perusahaan itu telah membuat akun uji coba kepada Jani untuk membuktikan teorinya - dan dia dapat melakukannya. Anak laki-laki ini, dari Helsinki, mengatakan kepada koran Finlandia Iltalehti, dia berencana untuk menggunakan uang itu untuk membeli sepeda baru, peralatan sepak bola dan komputer untuk saudara laki-lakinya. Facebook mengatakan kepada BBC, telah membayar $4.3 juta sebagai hadiah bagi yang menemukan bug sejak 2011. Banyak perusahaan menawarkan sebuah insentif keuangan bagi profesional keamanan - dan anak-anak muda, yang menyampaikan kekurangan itu kepada perusahaan, dibandingkan menjualnya ke pasar gelap. Terjemahan:""",
    """Apa sentimen dari kalimat berikut ini? Kalimat: Buku ini sangat membosankan. Jawaban:""",
    """Anda akan diberikan sebuah teks dan pertanyaan. Jawablah pertanyaan tersebut berdasarkan teks yang tersedia. > Teks: “Isyana lahir di Bandung pada 2 Mei 1993. Dia menghabiskan masa kecilnya di berbagai lokasi, karena orang tuanya bekerja & melanjutkan studi mereka di Belgia. Namun, pada usia 7 tahun keluarganya pindah ke Bandung, Indonesia. Isyana adalah putri bungsu dari pasangan Luana Marpanda, seorang guru musik, dan Sapta Dwikardana, Ph.D seorang dosen dan terapis (grafologis). Ia memiliki kakak perempuan bernama Rara Sekar Larasati, yang juga merupakan vokalis band bernama Banda Neira. Dibesarkan dalam keluarga pendidik, Isyana diperkenalkan ke dunia musik pada usia 4 tahun oleh ibunya. Isyana telah menguasai sejumlah instrumen. Termasuk piano, electone, flute, biola, dan saksofon.” > Pertanyaan: Siapa nama orang tua Isyana?""",
    """Sebutkan persamaan dan perbedaan antara gado-gado, ketoprak dan karedok""",
    """Jelaskan budaya Indonesia menyapa orang yang lebih tua?""",
    """Jelaskan budaya pulang kampung ketika lebaran?""",
    """Sebutkan berbagai jenis kopi dan karakteristik rasanya yang berasal dari Indonesia"""   
]

In [41]:
for prompt in SAMPLE_PROMPTS:
    print(f"##### Prompt: {prompt} #####")
    print(f"##### Response #####")
    _ = invoke_sealion(prompt, max_tokens=500, temperature=0.1, top_p=0.9)
    print()

##### Prompt: Terjemahkan teks berikut ini ke dalam Bahasa Inggris. Teks: Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang,menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat oleh pengguna lain. Masalah ini dengan “cepat” diperbaiki setelah ditemukan, demikian keterangan Facebook, perusahaan media sosial yang memiliki Instagram. Jani kemudian dibayar - yang membuat dia sebagai anak yang termuda yang pernah menerima hadiah atas penemuan bug ini. Setelah menemukan kekurangan itu pada Februari, dia mengirim email ke Facebook. Beli sepeda dan peralatan sepak bola Sejumlah ahli teknik keamanan di perusahaan itu telah membuat akun uji coba kepada Jani untuk membuktikan teorinya - dan dia dapat melakukannya. Anak laki-laki ini, dari Helsinki, mengatakan kepada koran Finlandia Iltalehti, dia berencana untuk menggunakan uang itu untuk membeli sepeda baru, peralatan se

## Multimodality

SEA LION v4 models (based on Gemma 3) are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants.

#### single image

In [42]:
from IPython.display import Image as IPyImage
IPyImage(url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", height=300, width= 300)

In [43]:
def invoke_sealion_multimodal(prompt: str, image_url: str, system_prompt="You are a helpful assistant.", print_response=True, **kwargs):
    payload = {
        "messages": [
            {
              "role": "system",
              "content": [
                  {
                      "type": "text",
                      "text": system_prompt
                  }
              ]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url", 
                        "image_url": {
                            "url": image_url
                        },
                    },
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ],
        "max_tokens": 500,
        "temperature": 0.1,
        "top_p": 0.9,
    }
    for k in kwargs:
        if k in ['max_tokens', 'temperature', 'top_p']:
            payload[k] = kwargs[k]
    response = predictor.predict(payload)
    
    if print_response:
        # Print usage statistics
        usage = response['usage']
        print(response['choices'][0]['message']['content'].strip())
        print(f"=== Token Usage: {usage['prompt_tokens']} (prompt), {usage['completion_tokens']} (completions), {usage['total_tokens']} (total) ===")
    return response['choices'][0]['message']['content']

In [46]:
_ = invoke_sealion_multimodal(
    prompt = "What animal is on the candy?",
    image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"
)

Based on the image, the animal on the candy appears to be a **turtle**. Each candy has a small turtle design on it.
=== Token Usage: 282 (prompt), 28 (completions), 310 (total) ===


### Streaming responses
You can also direclty stream response from your endpoint. To achieve this, we will use the invoke_endpoint_with_response_stream API.

You can **interleave images with text**. To do so, just cut off the input text where you want to insert an image, and insert it with an image block like the following.

In [47]:
from IPython.display import Image as IPyImage
# IPyImage(url="https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3018.JPG", height=300, width= 300)
IPyImage(url="https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fruit_knife.png", height=300, width= 300)

In [48]:
from IPython.display import Image as IPyImage
# IPyImage(url="https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3015.jpg", height=300, width= 300)
IPyImage(url="https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/candy.JPG", height=300, width= 300)

In [49]:
body = {
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "I have this "},  # "I'm already using this supplement "},
        {
          "type": "image_url", 
          "image_url": {"url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/candy.JPG"},
        },
        {"type": "text", "text": "and I want to use this as well "},
        {
          "type": "image_url", 
          "image_url": {"url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fruit_knife.png"}, # "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_3015.jpg"},
        },
        {"type": "text", "text": " what are cautions?"},
      ]
    }
  ],
  "max_tokens": 1500,
  "temperature": 0.6,
  "top_p": 0.9,
  "stream": True
}

In [50]:
import json
import time

# Create SageMaker Runtime client
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
print(f'Using endpoint: {endpoint_name}')

# Invoke the model
response_stream = sagemaker_runtime_client.invoke_endpoint_with_response_stream(
    EndpointName = endpoint_name,
    ContentType = "application/json",
    Body = json.dumps(body)
)

first_token_received = False
ttft = None
token_count = 0
start_time = time.time()

print("Response:", end=' ', flush=True)
full_response = ""

for event in response_stream['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        
        try:
            # Handle SSE format (data: prefix)
            if chunk.startswith('data: '):
                data = json.loads(chunk[6:])  # Skip "data: " prefix
            else:
                data = json.loads(chunk)
            
            # Extract token based on OpenAI format
            if 'choices' in data and len(data['choices']) > 0:
                if 'delta' in data['choices'][0] and 'content' in data['choices'][0]['delta']:
                    token_count += 1
                    token_text = data['choices'][0]['delta']['content']
                                    # Record time to first token
                    if not first_token_received:
                        ttft = time.time() - start_time
                        first_token_received = True
                    full_response += token_text
                    print(token_text, end='', flush=True)
        
        except json.JSONDecodeError:
            continue
            
# Print metrics after completion
end_time = time.time()
total_latency = end_time - start_time

print("\n\nMetrics:")
print(f"Time to First Token (TTFT): {ttft:.2f} seconds" if ttft else "TTFT: N/A")
print(f"Total Tokens Generated: {token_count}")
print(f"Total Latency: {total_latency:.2f} seconds")
if token_count > 0 and total_latency > 0:
    print(f"Tokens per second: {token_count/total_latency:.2f}")

Using endpoint: gemma-sea-lion-v4-27b-it-250825-0928
Response: , see you a picture of what appears to be jelly beans (or similar candies) in a hand, and a picture of fruit and a knife on a cutting board. 

Here are some cautions combining these images, assuming you're thinking about combining them in some way (like a composite image, a story, or a concept):

**1. Health & Safety (Especially if portraying consumption):**

*   **Sugar Content:** The candies are high in sugar. Combining this with fruit could reinforce a message of excessive sugar intake. If you're creating content for children, be mindful of this.
*   **Knife Safety:** The knife in the second image is a potential hazard. If you're depicting someone using it, emphasize safe handling. Avoid any imagery that could encourage dangerous behavior.
*   **Choking Hazard:** Small candies can be a choking hazard, especially for young children. If your image includes children, avoid showing them putting the candies in their mouths.



# Clean up

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
# Use this to generate html version of the Jupyter notebook
!jupyter nbconvert Gemma-SEA-LION-v4-27B-Instruct.ipynb --to html