# Part 3: Creating a Basic Image Agent

![Alt text](img/augLLMs.png)

In this tutorial we'll be making a simplified image classifier/agent with Gemma3.

There are two parts

* **Multimodal Gemma Classifier** - Using a Gemma model to detect what's in the image and provide a specific output.
* **Downstream Action** - A simple function that can process the results of the action, such as sending an email or anything else!

## Putting it all together

Now that we have an image model ready, let's set up a simple function to interact with the model and it's outputs. Let's redefine `model_call(prompt)` function that sends user input to the LLM and receives a response.  

In [13]:
from ollama import chat
from ollama import ChatResponse
import pprint
from IPython.display import Markdown

image_path = "img/NotHotDog.jpg"  # Replace with the actual path to your image file


def image_chat(prompt, img_path, model="gemma3:27b-it-qat"):
    response = chat(
        model = model, 
        messages=[
            {
                'role': 'user',
                'content': prompt,
                'images': [img_path]
            }
        ]
    )
    return response["message"]["content"]

prompt = "What is this an image of?"
output = image_chat(prompt, image_path)

display(Markdown(output))

This image shows a **dachshund dog dressed in a hot dog costume**. 

The costume is designed to look like a full hot dog with a bun, ketchup, and mustard on top of the dog's back. It's a playful and humorous outfit!

## Hot Dog or not Hotdog Classifier

In [14]:
image_path = "img/NotHotDog.jpg"  # Replace with the actual path to your image file

def image_chat(prompt, img_path, model="gemma3:27b-it-qat"):
    response = chat(
        model = model, 
        messages=[
            {
                'role': 'user',
                'content': prompt,
                'images': [img_path]
            }
        ]
    )
    return response["message"]["content"]

prompt = 'Is this an image of the food item hot dog say yes, otherwise say no, no other output'
image_chat(prompt, image_path) 

'no'

In [15]:
image_path = "img/NotHotDog.jpg"  # Replace with the actual path to your image file

def image_chat(prompt, img_path):
    response = chat(
        model="gemma3:27b-it-qat",  # Use a vision-capable model like LLaVA
        messages=[
            {
                'role': 'user',
                'content': 'Is this an image of the food item hot dog say yes, otherwise say no, no other output',
                'images': [img_path]
            }
        ]
    )
    return response["message"]["content"]


image_chat(None,image_path) 

'no'

## Parsing the response
With our prompt complete we can turn this into a simple classifier. From here you can replace this with any python logic you like, whether its sending an email, or anything else.

In [16]:
image_path = "img/Hot_dog_with_mustard.png"  # Replace with the actual path to your image file

image_chat(None, image_path) 

'yes'

In [60]:
def call_a_tool(img_path):
    """This just prints a string, but it could be anything else"""
    response = image_chat(prompt, image_path)

    if response.lower() == 'yes':
        return "Give treat"
    return "Add ketchup"

call_a_tool(image_path)

'Give treat'

In [59]:
from PIL import Image, ImageDraw, ImageFont

def add_not_hotdog_text(input_image_path: str, output_image_path: str, text: str = "NOT HOTDOG", text_scale_factor: float = .5):
    """
    Adds the specified text to an image and saves it to a new file.

    Args:
        input_image_path (str): The path to the input image file.
        output_image_path (str): The path where the modified image will be saved.
        text (str): The text to be written on the image (defaults to "NOT HOTDOG").
        text_scale_factor (float): A multiplier to adjust the font size. Higher values mean larger text.
                                   Default changed to 1.0, meaning the base calculation is the primary size.
    """
    try:
        # Open the image and ensure it's in RGBA mode for transparency, especially for text
        img = Image.open(input_image_path).convert("RGBA")
        draw = ImageDraw.Draw(img)

        # Define font and text color
        font = None
        # Significantly increase the base font size relative to image width
        # Aim for the text to take up about 1/3 to 1/2 of the image width
        # Adjusting the divisor to make text proportionally larger
        base_font_size = img.width / 3.5 # Making it a larger fraction of image width
        font_size = int(base_font_size * text_scale_factor)
        
        # Ensure a very high minimum size, regardless of image dimensions or scale factor,
        # to guarantee visibility even if image is very small or scaling is minimal.
        font_size = max(font_size, 200) # Set an even higher absolute minimum font size
        
        print(f"Attempting to use font size: {font_size}")

        # Try to load a system font like Arial or DejaVuSans for proper scaling.
        # Fallback to default if TrueType fonts are not found on the system.
        try:
            local_font_name = "Roboto-ExtraBold.ttf"
            current_dir = os.getcwd()
            local_font_path = os.path.join(current_dir, local_font_name)

            if os.path.exists(local_font_path):
                font = ImageFont.truetype(local_font_path, font_size)
                true_type_font_found = True
                print(f"Using local font: {local_font_path}")
            # More comprehensive list of common font names/paths
            font_candidates = [
                ("arial.ttf", None), # Windows common
                ("Arial.ttf", "/Library/Fonts/"), # macOS common
                ("DejaVuSans.ttf", "/usr/share/fonts/truetype/dejavu/"), # Linux common
                ("LiberationSans-Regular.ttf", "/usr/share/fonts/truetype/liberation/"), # Another Linux common
                ("sans-serif", None) # Generic (Pillow might find a system default)
            ]
            
            for font_name, font_dir in font_candidates:
                try:
                    if font_dir:
                        font_path_attempt = os.path.join(font_dir, font_name)
                    else:
                        font_path_attempt = font_name
                    
                    if os.path.exists(font_path_attempt):
                        font = ImageFont.truetype(font_path_attempt, font_size)
                        print(f"Using font: {font_path_attempt}")
                        break
                except Exception:
                    continue # Try next font path if current one fails to load
            
            if font is None:
                # If no TrueType font found after all attempts
                font = ImageFont.load_default()
                print(f"Warning: No TrueType font found at common paths. Using default bitmap font. "
                      f"Text will appear small and will NOT scale with font_size ({font_size}). "
                      "Consider installing 'arial.ttf' or 'dejavusans.ttf' on your system for proper scaling.")
        except Exception as e:
            font = ImageFont.load_default()
            print(f"Critical error during TrueType font loading ({e}). Using default bitmap font. "
                  f"Text will appear small and will NOT scale with font_size ({font_size}).")


        # Text color: changed to white for better contrast on a hotdog, fully opaque
        text_color = (255, 255, 255, 255)  # Opaque White (RGBA)
        text_color = (0, 0, 0, 255)  # Opaque White (RGBA)


        # Calculate text bounding box for positioning
        try:
            # bbox = draw.textbbox((0, 0), text, font=font) is preferred for Pillow 8.0+
            # It calculates (left, top, right, bottom) including ascenders/descenders
            bbox = draw.textbbox((0, 0), text, font=font)
            text_width = bbox[2] - bbox[0]
            text_height = bbox[3] - bbox[1]
        except AttributeError:
            # Fallback for older Pillow versions that only have textsize (deprecated)
            text_width, text_height = draw.textsize(text, font=font)
            print("Using deprecated draw.textsize. Consider updating Pillow to 8.0.0+ for draw.textbbox for more accurate bounding box calculation.")

        print(f"Calculated text dimensions: width={text_width}, height={text_height}")

        # Calculate position to center the text
        x = (img.width - text_width) / 2
        y = (img.height - text_height) / 2

        # Ensure x and y are not negative and are within image bounds
        x = max(0, min(x, img.width - text_width))
        y = max(0, min(y, img.height - text_height))

        print(f"Text position: x={x}, y={y}")

        # Draw the text on the image
        draw.text((x, y), text, font=font, fill=text_color)
        print(f"Text '{text}' drawn on image.")

        # Get the file extension from the output path
        file_extension = os.path.splitext(output_image_path)[1].lower()

        # If the output format is JPEG, convert to RGB mode before saving (JPEG does not support RGBA)
        if file_extension in ['.jpg', '.jpeg']:
            # Create a new RGB image (e.g., with a white background)
            rgb_img = Image.new("RGB", img.size, (255, 255, 255))
            # Paste the original RGBA image onto the new RGB image using its alpha channel as a mask.
            # This correctly handles transparency by compositing the image onto the white background.
            rgb_img.paste(img, (0, 0), img)
            rgb_img.save(output_image_path)
            print(f"Successfully saved image with text to {output_image_path} (converted to RGB for JPEG).")
        else:
            # For PNG or other formats that support transparency, save as is (RGBA)
            img.save(output_image_path)
            print(f"Successfully saved image with text to {output_image_path}.")

    except FileNotFoundError:
        print(f"Error: Input image file not found at {input_image_path}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}. Please ensure Pillow is installed and the image path is correct.")

add_not_hotdog_text(image_path, "test_tool_call_output.jpg")

Attempting to use font size: 366
Using local font: /Users/canyon/repos/ai_image_agent/notebooks/Roboto-ExtraBold.ttf
Calculated text dimensions: width=2310, height=268
Text position: x=126.0, y=491.5
Text 'NOT HOTDOG' drawn on image.
Successfully saved image with text to test_tool_call_output.jpg (converted to RGB for JPEG).


In [62]:
def call_a_tool(img_path):
    """This just prints a string, but it could be anything else"""
    response = image_chat(prompt, image_path)
    if response.lower() == 'yes':
        add_not_hotdog_text(image_path, "not_hot_dog.jpg", text= "Not HOTDOG")
        return
    add_not_hotdog_text(image_path, "hot_dog.jpg", text= "HOTDOG")
    return

call_a_tool(image_path)

Attempting to use font size: 366
Using local font: /Users/canyon/repos/ai_image_agent/notebooks/Roboto-ExtraBold.ttf
Calculated text dimensions: width=2159, height=268
Text position: x=201.5, y=491.5
Text 'Not HOTDOG' drawn on image.
Successfully saved image with text to not_hot_dog.jpg (converted to RGB for JPEG).


## Upgrading with Gemini 
If you need a more powerful model it's only one API call away. With this the model stops being local but in turn you get access to frontier capabilities.

In [11]:
from google import genai
from google.genai import types

with open(image_path, 'rb') as f:
      image_bytes = f.read()

def generate():
    client = genai.Client(
        api_key="INSERT_API_KEY",
    )

    # Pick any model from AI studio
    model = "gemini-2.0-flash"
    contents = [
        types.Content(
            role="user",
            parts=[
                    types.Part.from_bytes(
                    data=image_bytes,
                    mime_type='image/jpeg',
                  ),
                types.Part.from_text(text="""What is this?"""),
            ],
        ),
    ]
    generate_content_config = types.GenerateContentConfig(
        response_mime_type="text/plain",
    )

    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        print(chunk.text, end="")
generate()

This is a dachshund dog wearing a hot dog costume.

## 🎯 Recap: What We Learned 

In this section, we built our first basic image classification agent that can differentiate between two images and respond accordingly.

Here are the key ideas to remember:
- **Not everything needs to be chat**: Models can be prompted to return parseable outputs quite easily, no architecture changes needed.
- **The model can be an intermediate part of a system**: The model doesn't always need to be front and center of every application.
- **From there you can do anything**: We just outputted strings, but with python (or any other language) we can make our system act agentically do anything.