# Object detection and segmentation with the Gemini API


**References:**
* [Conversational image segmentation with Gemini 2.5](https://developers.googleblog.com/en/conversational-image-segmentation-gemini-2-5/)
* [Use Gemini 2.5 for Zero-Shot Object Detection & Segmentation](https://blog.roboflow.com/gemini-2-5-object-detection-segmentation/)

---

## Setup

### Software Development Kit (SDK)

Installing and initializing the Google Generative AI SDK, a collection of tools and libraries that allow developers to interact with Google's Gemini models.  
This setup enables performing tasks such as text generation, image analysis, and segmentation.

In [None]:
# Install
#!pip install google-genai supervision

# Initialize
from google import genai
from google.genai import types

### Application Programming Interface (API) key and SDK client

The API key is loaded from an environment file (.env) to keep credentials secure.  
The client object authenticates requests and enables communication with Gemini models.

In [None]:
# Packages
import os
from dotenv import load_dotenv

# Load API key from environment file
load_dotenv("gemini_api_key.env")
client = genai.Client(api_key = os.getenv("GOOGLE_API_KEY"))

### Model

Spatial understanding works best with the **Gemini 2.0 Flash** model.  
It performs even better with **Gemini 2.5 models** (like `gemini-2.5-pro`), which are more capable "thinking models" — though slightly slower.  
Some advanced features, such as **image segmentation**, are only supported by **Gemini 2.5 models**.

**Available model options:**
- `gemini-2.0-flash`
- `gemini-2.5-flash-lite`
- `gemini-2.5-flash-lite-preview-09-2025`
- `gemini-2.5-flash`
- `gemini-2.5-flash-preview-09-2025`
- `gemini-2.5-pro`

**Temperature (creativity control):**
- Controls randomness/creativity of model output.  
- `0.0` → very deterministic, consistent output (best for precise tasks).  
- `0.5` → moderately creative (good balance for tasks like bounding boxes).  
- `1.0` → more creative, potentially less consistent output.

In [None]:
# Models
model_gemini_2_0 = "gemini-2.0-flash" 
model_gemini_2_5 = "gemini-2.5-flash"

# Temperature
temperature = 0.5

### Safety settings

This parameter specifies how the model should handle potentially harmful content.

In [None]:
safety_settings = [types.SafetySetting(category="HARM_CATEGORY_DANGEROUS_CONTENT",
                                       threshold="BLOCK_ONLY_HIGH",),]

### Other packages

Installing and importing additional packages required for image processing and data handling.  
- **Pillow (PIL)** is used to open and manipulate images.  
- **io** and **requests** modules help handle image data streams and HTTP requests.  
- **Supervision (sv)** is used for utilities like plotting bounding boxes and working with detection results.

In [None]:
# Packages

#!pip install Pillow
from PIL import Image

from pathlib import Path

import io
from io import BytesIO

import requests

import supervision as sv

import json
import base64

### Error handling and retry parameters

In [None]:
# Packages
import time
from google.genai import errors as genai_errors

# Define retry configuration
max_retries = 10
wait_time = 1
current_retry = 0
response = None

## Object detection

### Dog 1.jpeg

In [None]:
# Root folder
root_path = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\street_view_images")

# Get all images recursively
all_images = list(root_path.rglob("*.[jp][pn]g"))  # matches .jpg, .jpeg, .png

print(f"Total images found: {len(all_images)}")

In [None]:
# -----------------------------
# Prompt for Street View object detection
# -----------------------------
prompt = (
    "You are performing object detection on a Street View image. "
    "Identify all visible objects in the scene (e.g., cars, pedestrians, traffic signs, street lights, trees, buildings, sidewalks, bicycles, etc.). "
    "Output a JSON list of bounding boxes. Each entry should contain:\n"
    " - 'box_2d': the coordinates of the 2D bounding box\n"
    " - 'label': a descriptive text label of the object\n"
    "Use clear, descriptive labels for each object."
)

# -----------------------------
# Output folder
# -----------------------------
output_folder = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\street_view_outputs")
output_folder.mkdir(parents=True, exist_ok=True)

# -----------------------------
# Process all images
# -----------------------------
for image_path in all_images:
    print(f"\nProcessing image: {image_path}")

    # Load and resize
    image = Image.open(image_path)
    width, height = image.size
    target_height = int(1024 * height / width)
    resized_image = image.resize((1024, target_height), Image.Resampling.LANCZOS)

    # Reset retry variables
    current_retry = 0
    wait_time = 1
    detections_json = None

    # -----------------------------
    # Retry loop for API call with JSON validation
    # -----------------------------
    while current_retry < max_retries:
        try:
            print(f"Attempt {current_retry + 1}/{max_retries} to call the API...")

            response = client.models.generate_content(
                model=model_gemini_2_0,
                contents=[resized_image, prompt],
                config=types.GenerateContentConfig(
                    temperature=temperature,
                    safety_settings=safety_settings,
                    thinking_config=types.ThinkingConfig(thinking_budget=0)
                )
            )

            # Extract JSON from response text
            text = response.text
            json_start = text.find("[")
            json_end = text.rfind("]") + 1
            try:
                detections_json = json.loads(text[json_start:json_end])
                print(f"Valid JSON detected, proceeding with {len(detections_json)} objects.")
                break  # JSON valid -> exit retry loop
            except Exception:
                current_retry += 1
                print(f"Invalid JSON in response. Retrying {current_retry}/{max_retries}...")
                time.sleep(wait_time)
                wait_time *= 2
                continue

        except genai_errors.ClientError as e:
            current_retry += 1
            print(f"ClientError, retrying {current_retry}/{max_retries}... Error: {e}")
            time.sleep(wait_time)
            wait_time *= 2

        except Exception as e:
            print(f"Unexpected error: {e}")
            break

    # Skip image if no valid JSON after retries
    if not detections_json:
        print(f"Failed to get valid detections for {image_path.name}, skipping...")
        continue

    # -----------------------------
    # Convert to Supervision Detections
    # -----------------------------
    resolution_wh = resized_image.size
    detections = sv.Detections.from_vlm(
        vlm=sv.VLM.GOOGLE_GEMINI_2_5,
        result=json.dumps(detections_json),
        resolution_wh=resolution_wh
    )

    # -----------------------------
    # Annotate image
    # -----------------------------
    thickness = sv.calculate_optimal_line_thickness(resolution_wh)
    text_scale = sv.calculate_optimal_text_scale(resolution_wh)

    box_annotator = sv.BoxAnnotator(thickness=thickness)
    label_annotator = sv.LabelAnnotator(
        smart_position=True,
        text_color=sv.Color.BLACK,
        text_scale=text_scale,
        text_position=sv.Position.CENTER
    )

    annotated = resized_image
    for annotator in (box_annotator, label_annotator):
        annotated = annotator.annotate(scene=annotated, detections=detections)

    # -----------------------------
    # Save annotated image
    # -----------------------------
    output_path = output_folder / f"{image_path.stem}_annotated.jpg"
    annotated.save(output_path)
    print(f"Saved annotated image: {output_path}")
