# Introduction

Gemma 3 model was launched March 12, 2025 in 4 sizes, each with both pretrained (pt) and instruction (fine)tuned (it), as following:

* **1B** - input: 8K, output: 8K, text input and output 
* **4B** - input: 128K, output: 8K, multimodal input, text output 
* **12B** - input: 128K, output: 8K, multimodal input, text output 
* **27B**- input: 128K, output: 8K, multimodal input, text output 

In this Notebook we will use the `gemma-3/transformers/gemma-3-4b-it/1` and `gemma-3/transformers/gemma-3-27b-it/1`


# Instalations

We install the latest version of transformers, compatible with Gemma3.

In [None]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3 -q --no-cache
!pip install accelerate bitsandbytes

# Import packages

We import pipeline from transformers library and torch library (we will use to set the parameters type). We also print each package version so that our results remain reproductible.

In [None]:
import torch
import gc
from transformers import pipeline, AutoProcessor
import os
import sys
import pkg_resources

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128,garbage_collection_threshold:0.6"

print("Python:", sys.version.replace("\n", " "))

# Intialize the pipeline

In [None]:
# point at the local repo and let HF pull in the custom classes
pipe = pipeline(
    "image-text-to-text",
    model="/kaggle/input/gemma-3/transformers/gemma-3-4b-it/1",
    revision="main",                 # or the actual branch/tag if diff
    trust_remote_code=True,          # <— allow custom Gemma3* classes to be loaded
    device_map="auto",
    load_in_8bit=True,               # quantize weights to 8‑bit
    torch_dtype=torch.bfloat16,      # mixed bfloat16 math
    low_cpu_mem_usage=True,          # minimize CPU RAM spikes
    offload_folder="/kaggle/working/offload"
)

# the processor is auto‐discovered, but here we choose to explicitly do:
processor = AutoProcessor.from_pretrained(
    "/kaggle/input/gemma-3/transformers/gemma-3-4b-it/1",
    trust_remote_code=True
)

# re-wrap so the processor is used
pipe = pipeline(
    "image-text-to-text",
    model=pipe.model,
    processor=processor,
    device_map="auto",
    trust_remote_code=True,
    use_cache=False                  # disable KV cache to save ~0.3 GiB
)

print("Successfully loaded Gemma 3!")

# Try simple multimodal prompts


We create a message where we specify the type of inputs (image & text) and we provide the url for image and the text.

In [None]:
import pandas as pd
import json

folder_img_path = "/kaggle/input/eventa-img-cieldt/database_images_compressed90/"

df_submission = pd.read_csv("/kaggle/input/eventa-private-submission/submission_ensembleScore.csv")

with open("/kaggle/input/eventa-json-cieldt/database.json") as f:
    database = json.load(f)
    f.close()

In [None]:
%%time

for i in range(0, 2000):
    # --- 1. Lấy thông tin từ ensemble score ---
    query_id = df_submission.loc[i, "query_id"]
    retrieved_img_id = df_submission.loc[i, "retrieved_id_img"]
    article_id = df_submission.loc[i, "article_id_1"]
    article_content = 'date: ' + database[article_id]['date'] + 'title: ' + database[article_id]['title'] + 'content: ' + database[article_id]['content']
    if (len(article_content) > 15000):
        article_content = article_content[:15000]
    if (i % 10 == 0):
        print(i, ", len article_content:", len(article_content))
    
    # --- 2. Đường dẫn đến ảnh ---
    folder_img_path = "/kaggle/input/eventa-img-cieldt/database_images_compressed90/" 
    image_path = folder_img_path + retrieved_img_id + ".jpg"
    # !? error img
    if (retrieved_img_id == '3a565cf1ad272aaf'):
        image_path = '/kaggle/input/eventa-img-exception-cieldt/3a565cf1ad272aaf.jpg'
    
    # --- 3. Xây dựng prompt ---
    prompt = """You are a seasoned expert in journalistic photo caption writing. Your primary goal is to create a SINGLE, comprehensive caption that **masterfully connects the provided IMAGE to the key events, figures, and narratives within the ARTICLE.**
    
    This caption must achieve the following, with a strong emphasis on the connection to the article:
    
    1.  **Contextualize the Image through the Article First:**
        *   Before describing visual details, understand the **core message, main event, or central figures discussed in the ARTICLE.**
        *   Identify **how the IMAGE serves to illustrate, support, or provide a visual anchor for these key aspects of the article.**
    
    2.  **Describe the Image in Service of the Article's Narrative:**
        *   Briefly describe the **most relevant visual elements of the IMAGE** – focusing on subjects, actions, and settings that directly pertain to the article's content.
        *   **Prioritize details that help the reader understand the image's role in the story.** Less critical visual minutiae can be omitted if they don't serve this primary purpose.
        *   If specific named entities (people, places, organizations) from the article are clearly visible and relevant, ensure they are mentioned to strengthen the link.
    
    3.  **Clearly Articulate the Significance and Connection:**
        *   The caption's main thrust should be to clarify the **image's significance and its direct relationship to the information presented in the article.**
        *   Ensure the reader understands *why* this specific image accompanies this article.
    
    4.  **Professional and Informative Style:**
        *   Use precise, informative, and coherent language.
        *   The style should be journalistic, conveying information efficiently and accurately.
    
    **Crucial Requirement:** Begin writing the caption IMMEDIATELY. Do not include any introductory phrases such as 'Here is the caption:', 'The caption is:', or similar.
    """
    requirement_hard = f'''**Crucial Requirement:** Begin writing the caption IMMEDIATELY. Do not include any introductory phrases such as 'Here is the caption:', 'The caption is:', or similar.....'''
    
    # --- 4. Cấu trúc messages ---
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "url": image_path}, # Ảnh đi kèm
                {"type": "text", "text": prompt}, # Prompt
                {"type": "text", "text": article_content}, # Article content
                {"type": "text", "text": requirement_hard}, # Only generate caption      
            ]
        }
    ]
    
    # --- 5. Chạy model và lưu + in kết quả ---
    # %%time
    output = pipe(text=messages, max_new_tokens=1500)
    if (i % 10 == 0):
        print(output[0]["generated_text"][-1]["content"])

    df_submission.loc[i, "generated_caption"] = output[0]["generated_text"][-1]["content"]

    # --- 6. Giải phóng bộ nhớ ---
    del output
    del messages
    del article_content 
    gc.collect()
    torch.cuda.empty_cache()

In [None]:
df_submission.to_csv("submission_private.csv", index = False)