# Introduction

This notebook presents a functional prototype built with Gemma 3n, Google’s latest on-device, multimodal AI model. The goal: use compact, private, offline-ready AI to solve a real-world problem. Full demo, code, and technical details follow.

## What are the key features of Gemma 3n?

The key features of this new model from Google are:

1. On-Device Performance
Optimized for mobile and edge devices, Gemma 3n delivers real-time AI with minimal memory usage. The 5B and 8B models run like 2B and 4B models, thanks to innovations like Per-Layer Embeddings (PLE).

2. Mix’n’Match Model Scaling
A single model can act as multiple: the 4B version includes a 2B submodel, enabling dynamic tradeoffs between performance and efficiency. Developers can also create custom-sized submodels tailored to specific tasks.

3. Privacy-First and Offline-Ready
Gemma 3n runs entirely on-device, ensuring user data never leaves the device. This makes it ideal for privacy-sensitive applications and for use in low- or no-connectivity environments.

4. Multimodal Understanding
Supports text, image, audio, and enhanced video input, enabling powerful applications like voice interfaces, transcription, translation, visual recognition, and more—all locally.

5. Multilingual Proficiency
Strong performance across major global languages including Japanese, German, Korean, Spanish, and French, expanding access and inclusivity.



# Prepare the model

## Install prerequisites

In [1]:
!pip install timm --upgrade
!pip install accelerate
!pip install git+https://github.com/huggingface/transformers.git

Collecting timm
  Downloading timm-1.0.16-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->timm)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->timm)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch->timm)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch->timm)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch->timm)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusp

## Import packages

In [2]:
from time import time
import kagglehub
import transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from transformers import AutoProcessor, AutoModelForImageTextToText

2025-07-03 14:27:01.609157: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751552821.866423      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751552821.940765      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Load the model

In [3]:
GEMMA_PATH = kagglehub.model_download("google/gemma-3n/transformers/gemma-3n-e2b-it")
processor = AutoProcessor.from_pretrained(GEMMA_PATH)
model = AutoModelForImageTextToText.from_pretrained(GEMMA_PATH, torch_dtype="auto", device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Test the model with a simple prompt

In [4]:
prompt = """What is the France capital?"""
input_ids = processor(text=prompt, 
                      return_tensors="pt").to(model.device, 
                                              dtype=model.dtype)

outputs = model.generate(**input_ids, 
                         max_new_tokens=32, 
                         disable_compile=True)
text = processor.batch_decode(
    outputs,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)
print(text[0])

What is the France capital?

Paris.



Let's wrap this inside a function.

In [5]:
def query_model(prompt, max_new_tokens=32):
    start_time = time()
    input_ids = processor(text=prompt, 
                          return_tensors="pt").to(model.device, 
                                                  dtype=model.dtype)
    
    outputs = model.generate(**input_ids, 
                             max_new_tokens=max_new_tokens, 
                             disable_compile=True)
    text = processor.batch_decode(
        outputs,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )
    total_time = round(time() - start_time, 2)
    response = text[0].split(prompt)[-1]
    return response, total_time
    


In [6]:
prompt = "Quelle est la capitale de la France?"
response, total_time = query_model(prompt, max_new_tokens=16)
print(f"Execution time: {total_time}")
print(f"Question: {prompt}")
print(f"Response: {response}")

Execution time: 3.02
Question: Quelle est la capitale de la France?
Response: 

Paris.


# Test the model with history questions

In [7]:
prompt = "When started WW2?"
response, total_time = query_model(prompt, max_new_tokens=32)
print(f"Execution time: {total_time}")
print(f"Question: {prompt}")
print(f"Response: {response}")

Execution time: 5.8
Question: When started WW2?
Response: 
WW2 began in 1939.



It doesn't look too right, I would like to keep it as short as possible. Let's refine a bit the function, we will add a system prompt.

In [8]:
def query_model_v2(prompt, max_new_tokens=32):
    start_time = time()
    
    system_prompt = """
            You are a smart AI expert in aswering questions.
            Just answer to the point, do not elaborate.
            For example, if you are asked to provide a year, a name, a location,
            return just the information, without any other words.
            """
    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": system_prompt}
            ],
            "role": "user",
            "content": [
                {"type": "text", "text": prompt}
            ]
        }
    ]
    
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device, dtype=model.dtype)

    # retrieve input length
    input_len = inputs["input_ids"].shape[-1]
    
    outputs = model.generate(**inputs, 
                             max_new_tokens=max_new_tokens, 
                             disable_compile=True)
    text = processor.batch_decode(
        # use input length to filter only the response from the output
        outputs[:, input_len:],
        # skip special tokens
        skip_special_tokens=True,
        # cleanup tokenization spaces
        clean_up_tokenization_spaces=True
    )
    total_time = round(time() - start_time, 2)
    response = text[0]
    return response, total_time

In [9]:
prompt = "What year started WW2?"
response, total_time = query_model_v2(prompt, max_new_tokens=12)
print(f"Execution time: {total_time}")
print(f"Question: {prompt}")
print(f"Response: {response}")

Execution time: 7.24
Question: What year started WW2?
Response: World War II started in **1939**. 


## Colorize the output

In [10]:
from IPython.display import display, Markdown

def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Response", "Explanation", "Execution time"], ["blue", "red", "green", "darkblue",  "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [11]:
prompt = "Between what years was Obama president?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 11.62



**<font color='red'>Question:</font>** Between what years was Obama president?



**<font color='green'>Response:</font>** Barack Obama was president of the United States from **2009 to 2017**.


In [12]:
prompt = "Between what years was the 30 years war?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 15.29



**<font color='red'>Question:</font>** Between what years was the 30 years war?



**<font color='green'>Response:</font>** The Thirty Years' War was fought between **1618 and 1648**. 

While it's often considered to have started in

In [13]:
prompt = "Between what years was the WW1?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 10.89



**<font color='red'>Question:</font>** Between what years was the WW1?



**<font color='green'>Response:</font>** World War I took place between **1914 and 1918**. 


In [14]:
prompt = "What year was the Lepanto battle?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 8.95



**<font color='red'>Question:</font>** What year was the Lepanto battle?



**<font color='green'>Response:</font>** The Battle of Lepanto took place in **1571**.


In [15]:
prompt = "What happened in 1868 in Japan?"
response, total_time = query_model_v2(prompt, max_new_tokens=64)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 28.15



**<font color='red'>Question:</font>** What happened in 1868 in Japan?



**<font color='green'>Response:</font>** 1868 was a pivotal year in Japanese history, marking the end of the Edo period and the beginning of the Meiji Restoration. Here's a breakdown of the key events:

*   **The Boshin War Begins:** This was a civil war between the Tokugawa shogunate (the ruling military government

Let's modify the query function to stop the generation after a maximum character number was reached.

In [16]:
from transformers import StoppingCriteria, StoppingCriteriaList

class MaxCharLengthCriteria(StoppingCriteria):
    def __init__(self, tokenizer, max_chars, input_len):
        self.tokenizer = tokenizer
        self.max_chars = max_chars
        self.input_len = input_len

    def __call__(self, input_ids, scores, **kwargs):
        # Decode only the generated part
        gen_tokens = input_ids[:, self.input_len:]
        text = self.tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]
        return len(text) >= self.max_chars

def query_model_v3(prompt, max_chars=128, max_new_tokens=64):
    start_time = time()
    
    system_prompt = """
            You are a smart AI expert in aswering questions.
            Just answer to the point, do not elaborate.
            For example, if you are asked to provide a year, a name, a location,
            return just the information, without any other words.
            """
    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": system_prompt}
            ],
            "role": "user",
            "content": [
                {"type": "text", "text": prompt}
            ]
        }
    ]
    
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device, dtype=model.dtype)

    # retrieve input length
    input_len = inputs["input_ids"].shape[-1]
    
    stopping_criteria = StoppingCriteriaList([
        MaxCharLengthCriteria(processor, max_chars=max_chars, input_len=input_len)
    ])

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        stopping_criteria=stopping_criteria,
        disable_compile=True
    )

    text = processor.batch_decode(
        # use input length to filter only the response from the output
        outputs[:, input_len:],
        # skip special tokens
        skip_special_tokens=True,
        # cleanup tokenization spaces
        clean_up_tokenization_spaces=True
    )
    total_time = round(time() - start_time, 2)
    response = text[0]
    return response, total_time

In [17]:
prompt = "What happened in 1868 in Japan?"
response, total_time = query_model_v3(prompt, max_chars=128, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 14.45



**<font color='red'>Question:</font>** What happened in 1868 in Japan?



**<font color='green'>Response:</font>** 1868 was a pivotal year in Japanese history, marking the end of the Tokugawa Shogunate and the beginning of the Meiji Restoration

In [18]:
prompt = "Who was the first American president?"
response, total_time = query_model_v3(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 12.85



**<font color='red'>Question:</font>** Who was the first American president?



**<font color='green'>Response:</font>** The first American president was **George Washington**. He served from 1789 to 1797.





## Let's ask some pop culture question

In [19]:
prompt = "In what novel the number 42 is important?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 16.25



**<font color='red'>Question:</font>** In what novel the number 42 is important?



**<font color='green'>Response:</font>** The number 42 is famously important in **"The Hitchhiker's Guide to the Galaxy" by Douglas Adams.**

In the story, a

In [20]:
prompt = "Name the famous boyfriend of Yoko Ono."
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 15.79



**<font color='red'>Question:</font>** Name the famous boyfriend of Yoko Ono.



**<font color='green'>Response:</font>** The famous boyfriend of Yoko Ono was **John Lennon**. 

They were married from 1969 to 1970.


In [21]:
prompt = "Who was nicknamed 'The King' in music?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 15.57



**<font color='red'>Question:</font>** Who was nicknamed 'The King' in music?



**<font color='green'>Response:</font>** There are several musicians who have been nicknamed "The King" in music, but the most famous and widely recognized is **Elvis Presley**. 

However, here

In [22]:
prompt = "What actor played Sheldon in TBBT?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 11.76



**<font color='red'>Question:</font>** What actor played Sheldon in TBBT?



**<font color='green'>Response:</font>** Jim Parsons played Sheldon Cooper in The Big Bang Theory (TBBT). 


In [23]:
prompt = "What acctress from `The Friends` married Brad Pitt?"
response, total_time = query_model_v2(prompt, max_new_tokens=16)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 10.54



**<font color='red'>Question:</font>** What acctress from `The Friends` married Brad Pitt?



**<font color='green'>Response:</font>** This is a trick question! **No actress from "The Friends" married Brad

## Math questions

In [24]:
prompt = "34 + 21"
response, total_time = query_model_v2(prompt, max_new_tokens=16)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 7.51



**<font color='red'>Question:</font>** 34 + 21



**<font color='green'>Response:</font>** 34 + 21 = 55


In [25]:
prompt = "49 x 27"
response, total_time = query_model_v2(prompt, max_new_tokens=16)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 9.59



**<font color='red'>Question:</font>** 49 x 27



**<font color='green'>Response:</font>** 49 x 27 = 1323

Here's

In [26]:
prompt = "Brian and Sarah are brothers. Brian is 5yo, Sarah is 6 years older. How old is Sarah?"
response, total_time = query_model_v2(prompt, max_new_tokens=32)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 19.16



**<font color='red'>Question:</font>** Brian and Sarah are brothers. Brian is 5yo, Sarah is 6 years older. How old is Sarah?



**<font color='green'>Response:</font>** Sarah is 6 years old.

Since Sarah is 6 years older than Brian, and Brian is 5, Sarah is 5 + 6 =

In [27]:
prompt = "x + 2 y = 5; y - x = 1. What are x and y? Just return x and y."
response, total_time = query_model_v2(prompt, max_new_tokens=64)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 11.08



**<font color='red'>Question:</font>** x + 2 y = 5; y - x = 1. What are x and y? Just return x and y.



**<font color='green'>Response:</font>** x = 1
y = 2

In [28]:
prompt = "What is the total area of a sphere or radius 3? Just return the result."
response, total_time = query_model_v2(prompt, max_new_tokens=64)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 12.9



**<font color='red'>Question:</font>** What is the total area of a sphere or radius 3? Just return the result.



**<font color='green'>Response:</font>** 113.09733552923255


In [29]:
prompt = "A rectangle with diagonal 4 is circumscribed by a circle. What is the circle's area?"
response, total_time = query_model_v2(prompt, max_new_tokens=200)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 65.25



**<font color='red'>Question:</font>** A rectangle with diagonal 4 is circumscribed by a circle. What is the circle's area?



**<font color='green'>Response:</font>** Let the rectangle have length $l$ and width $w$.
Since the rectangle is circumscribed by a circle, the diagonal of the rectangle is equal to the diameter of the circle.
We are given that the diagonal of the rectangle is 4, so the diameter of the circle is 4.
Therefore, the radius of the circle is $r = \frac{4}{2} = 2$.
The area of the circle is given by the formula $A = \pi r^2$.
Substituting $r=2$, we have $A = \pi (2^2) = 4\pi$.

Final Answer: The final answer is $\boxed{4\pi}$

## Multiple languages

In [30]:
#Romanian
prompt = "Cine este Mircea Cartarescu?"
response, total_time = query_model_v2(prompt, max_new_tokens=128)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 55.47



**<font color='red'>Question:</font>** Cine este Mircea Cartarescu?



**<font color='green'>Response:</font>** Mircea Cartarescu este unul dintre cei mai importanți și controversați scriitori români contemporani. Este un autor de ficțiune complexă, care explorează teme precum timpul, memoria, spațiul, identitatea și condiția umană într-o manieră poetică și neconvențională. 

Iată câteva aspecte cheie despre Mircea Cartarescu:

**Cariera și Stil:**

*   **Scriitor:** Este cunoscut pentru romanele sale ambițioase, stilul său unic și utilizarea metaforelor complexe.
*   **Poet

In [31]:
#Albanian
prompt = "Kush ishte Ismail Kadare?"
response, total_time = query_model_v2(prompt, max_new_tokens=128)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 55.34



**<font color='red'>Question:</font>** Kush ishte Ismail Kadare?



**<font color='green'>Response:</font>** Ismail Kadare ishte një shkrija shqiptar i njohur ndërmjetësor i literatura. Ai vildi për ndërrimet politike në Shqipëri, si në mënyrë direkte edhe me alegori dhe simbolizëm.

Këtu janë disa pika të rëndësishme për të shkruar për Ismail Kadare:

*   **Përancë:** Ai është një nga autorët më të vlerësuar shqiptarë në botë, i njohur me stilin e tij të unikut

In [32]:
#Japanese
prompt = "夏目漱石とは誰ですか?"
response, total_time = query_model_v2(prompt, max_new_tokens=128)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 55.35



**<font color='red'>Question:</font>** 夏目漱石とは誰ですか?



**<font color='green'>Response:</font>** 夏目漱石（なつめ そうせき、1867年7月9日 – 1916年4月29日）は、日本の近代文学を代表する作家です。明治時代から昭和時代初期にかけて活躍し、小説、評論、随筆など幅広い分野で才能を発揮しました。

**彼の代表的な特徴や業績は以下の通りです。**

*   **近代日本文学の確立:** 西洋文学の影響を受けながら、日本の社会や国民性、人間の心理を深く掘り下げた作品を数多く残し、近代日本文学の確立

In [33]:
#Chinese
prompt = "马拉多纳是谁?"
response, total_time = query_model_v2(prompt, max_new_tokens=128)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 54.49



**<font color='red'>Question:</font>** 马拉多纳是谁?



**<font color='green'>Response:</font>** 迭戈·马拉多纳（Diego Armando Maradona，1960年10月30日 – 2020年11月25日）是一位阿根廷足球运动员，被广泛认为是足球史上最伟大的球员之一。

以下是关于马拉多纳的一些关键信息：

* **职业生涯：** 他在阿根廷的俱乐部博卡青年开始职业生涯，后来在欧洲的俱乐部踢球，包括意大利的纳多尔和拿破仑。他以其出色的技术、盘带、组织能力和进球能力而闻名。
* **

In [34]:
#French
prompt = "Qui était Marguerite Yourcenar?"
response, total_time = query_model_v2(prompt, max_new_tokens=128)
display(Markdown(colorize_text(f"Execution time: {total_time}\n\nQuestion: {prompt}\n\nResponse: {response}")))



**<font color='magenta'>Execution time:</font>** 55.12



**<font color='red'>Question:</font>** Qui était Marguerite Yourcenar?



**<font color='green'>Response:</font>** Marguerite Yourcenar (1900-1983) était une écrivaine française majeure, reconnue pour ses romans historiques exceptionnels et son style d'écriture raffiné et intellectuel. Elle est souvent considérée comme l'une des plus grandes écrivaines françaises du XXe siècle. Voici un résumé de sa vie et de son œuvre :

**Sa vie:**

* **Origines et éducation:** Née à Nancy en 1900, elle bénéficia d'une éducation brillante et étudiée.  Elle était passionnée par l'histoire et les langues anciennes, notamment le grec

# Conclusions


Preliminary conclusion after testing the model with:
* History questions  
* Pop culture  
* Math (arithmetics, algebra, geometry)
* Multiple languages.
  
is that the model is performing reasonably well with easy and medium-level questions.

**Good points**:
- When prompted to answer to the point, the model tend to behave well.
- Math seems to be accurate.
- Language capability is extensive.

**Areas to improve**:
- Modify the output to stop at the end of a phrase.