# Object detection and segmentation with the Gemini API


**References:**
* [Conversational image segmentation with Gemini 2.5](https://developers.googleblog.com/en/conversational-image-segmentation-gemini-2-5/)
* [Use Gemini 2.5 for Zero-Shot Object Detection & Segmentation](https://blog.roboflow.com/gemini-2-5-object-detection-segmentation/)

---

## Setup

### Software Development Kit (SDK)

Installing and initializing the Google Generative AI SDK, a collection of tools and libraries that allow developers to interact with Google's Gemini models.  
This setup enables performing tasks such as text generation, image analysis, and segmentation.

In [1]:
# Install
#!pip install google-genai supervision

# Initialize
from google import genai
from google.genai import types
from google.genai import errors as genai_errors

### Application Programming Interface (API) key and SDK client

The API key is loaded from an environment file (.env) to keep credentials secure.  
The client object authenticates requests and enables communication with Gemini models.

In [2]:
# Packages
import os
from dotenv import load_dotenv

# Load API key from environment file
load_dotenv("gemini_api_key.env")
client = genai.Client(api_key = os.getenv("GOOGLE_API_KEY"))

### Model

Spatial understanding works best with the **Gemini 2.0 Flash** model.  
It performs even better with **Gemini 2.5 models** (like `gemini-2.5-pro`), which are more capable "thinking models" — though slightly slower.  
Some advanced features, such as **image segmentation**, are only supported by **Gemini 2.5 models**.

**Available model options:**
- `gemini-2.0-flash`
- `gemini-2.5-flash-lite`
- `gemini-2.5-flash-lite-preview-09-2025`
- `gemini-2.5-flash`
- `gemini-2.5-flash-preview-09-2025`
- `gemini-2.5-pro`

**Temperature (creativity control):**
- Controls randomness/creativity of model output.  
- `0.0` → very deterministic, consistent output (best for precise tasks).  
- `0.5` → moderately creative (good balance for tasks like bounding boxes).  
- `1.0` → more creative, potentially less consistent output.

In [3]:
# Models
model_gemini_2_0 = "gemini-2.0-flash" 
model_gemini_2_5 = "gemini-2.5-flash"

# Temperature
temperature = 0.5

### Safety settings

This parameter specifies how the model should handle potentially harmful content.

In [4]:
safety_settings = [types.SafetySetting(category="HARM_CATEGORY_DANGEROUS_CONTENT",
                                       threshold="BLOCK_ONLY_HIGH",),]

### Other packages

Installing and importing additional packages required for image processing, file handling, and data encoding.

- **time**: Used to measure execution time or implement delays during retries.

- **Pillow (PIL)**: Used to open, resize, and manipulate images.

- **pathlib**: Simplifies path handling and directory/file management.

- **Pandas (pd)**: Used for data manipulation and analysis, including reading from and writing to Excel or CSV files.

- **io**: Provides tools for working with in-memory streams such as BytesIO.

- **requests**: Handles HTTP requests (e.g., downloading images or interacting with APIs).

- **Supervision (sv)**: Provides utilities for detection handling, annotation, and visualization.

- **json**: Used for parsing and generating JSON-formatted data.

- **base64**: Used for encoding and decoding binary data, especially images or API payloads.

In [5]:
# Packages

# Standard Library (Python built-in)
import time
start_time = time.time()
import json
from pathlib import Path
from io import BytesIO

# Third-party Libraries
import pandas as pd
from PIL import Image
import requests
import supervision as sv
import base64

## Object detection

In [6]:
# Root folder
root_path = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\street_view_images")

# Get all images recursively
all_images = list(root_path.rglob("*.[jp][pn]g"))  # matches .jpg, .jpeg, .png

print(f"Total images found: {len(all_images)}")

Total images found: 36


In [8]:
# Parameters
excel_file = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\coordinates.xlsx")
root_path = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\street_view_images")
output_folder = Path(r"C:\Users\amand\Amanda\GitHub\virtual_audit_ai\street_view_outputs")
output_folder.mkdir(parents=True, exist_ok=True)

max_retries = 10

# Load Excel
df = pd.read_excel(excel_file)

# Reset Trees column to default missing "."
df["Trees"] = "."

# Load all images
all_images = list(root_path.rglob("*.[jp][pn]g"))

# Prompt for object detection
prompt = (
    "Detect ONLY real trees in this Street View image. "
    "Only label plants with a visible leafy canopy. Do NOT label objects without leaves (trunks, poles, shrubs, grass, fences, artificial trees, etc.). "
    "Output strictly a JSON list of bounding boxes with 'box_2d' and 'label':'tree'. No extra text."
)

# Process each image
for idx, image_path in enumerate(all_images, start=1):
    print(f"\nProcessing image {idx}/{len(all_images)}: {image_path.name}")

    try:
        # Extract main ID and point number from filename
        parts = image_path.stem.split("_")
        main_id = parts[0]
        point_number = parts[1].lstrip("p")
        subpoint_folder = f"point_{point_number}"

        # Load and resize image
        image = Image.open(image_path)
        width, height = image.size
        target_height = int(1024 * height / width)
        resized_image = image.resize((1024, target_height), Image.Resampling.LANCZOS)

        # Retry API call
        current_retry = 0
        wait_time = 1
        detections_json = None

        while current_retry < max_retries:
            try:
                response = client.models.generate_content(
                    model=model_gemini_2_0,
                    contents=[resized_image, prompt],
                    config=types.GenerateContentConfig(
                        temperature=temperature,
                        safety_settings=safety_settings,
                        thinking_config=types.ThinkingConfig(thinking_budget=0)))

                text = response.text
                json_start = text.find("[")
                json_end = text.rfind("]") + 1

                try:
                    detections_json = json.loads(text[json_start:json_end])
                    break  # Success
                except Exception:
                    current_retry += 1
                    print(f"(JSON error) Retrying {current_retry}/{max_retries} for {image_path.name}")
                    time.sleep(wait_time)
                    wait_time *= 2

            except genai_errors.ClientError:
                current_retry += 1
                print(f"(ClientError) Retrying {current_retry}/{max_retries} for {image_path.name}")
                time.sleep(wait_time)
                wait_time *= 2

            except Exception as e:
                current_retry += 1
                print(f"(Other error: {e}) Retrying {current_retry}/{max_retries} for {image_path.name}")
                time.sleep(wait_time)
                wait_time *= 2

        if not detections_json:
            print(f"⚠️ Failed to detect objects for {image_path.name} after {max_retries} retries")
            continue

        # Update Trees column in Excel
        mask = (df["id"].astype(str) == main_id) & (df["points"].astype(str) == point_number)
        if len(detections_json) > 0 and mask.any():
            df.loc[mask, "Trees"] = "Yes"
        else:
            print(f"⚠️ No match in Excel for id={main_id}, point={point_number}")

        # Annotate image
        resolution_wh = resized_image.size
        detections = sv.Detections.from_vlm(
            vlm=sv.VLM.GOOGLE_GEMINI_2_5,
            result=json.dumps(detections_json),
            resolution_wh=resolution_wh)

        thickness = sv.calculate_optimal_line_thickness(resolution_wh)
        text_scale = sv.calculate_optimal_text_scale(resolution_wh)

        box_annotator = sv.BoxAnnotator(thickness=thickness)
        label_annotator = sv.LabelAnnotator(
            smart_position=True,
            text_color=sv.Color.BLACK,
            text_scale=text_scale,
            text_position=sv.Position.CENTER)

        annotated = resized_image
        for annotator in (box_annotator, label_annotator):
            annotated = annotator.annotate(scene=annotated, detections=detections)

        # Save annotated image
        id_folder = output_folder / main_id / subpoint_folder
        id_folder.mkdir(parents=True, exist_ok=True)

        output_path = id_folder / f"{image_path.stem}_annotated.jpg"
        annotated.save(output_path)

        print(f"✅ Saved annotated image: {output_path.name}")

    except Exception as e:
        print(f"❌ Error processing {image_path.name}: {e}")
        continue

# Save updated Excel
df.to_excel(excel_file, index=False, engine='openpyxl')

df["street_view_downloaded"] = df["street_view_downloaded"].map(
    lambda x: "True" if str(x).upper() in ["TRUE", "VERDADEIRO"] else "False")
df.to_excel(excel_file, index=False)

# End timer
end_time = time.time()
total_seconds = end_time - start_time
minutes = int(total_seconds // 60)
seconds = total_seconds % 60

print(f"Processing complete.")
print(f"Total execution time: {minutes} minutes and {seconds:.2f} seconds.")


Processing image 1/36: 5103403535_p1_h0.jpg
(ClientError) Retrying 1/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 2/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 3/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 4/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 5/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 6/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 7/10 for 5103403535_p1_h0.jpg
(ClientError) Retrying 8/10 for 5103403535_p1_h0.jpg
✅ Saved annotated image: 5103403535_p1_h0_annotated.jpg

Processing image 2/36: 5103403535_p1_h180.jpg
(ClientError) Retrying 1/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 2/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 3/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 4/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 5/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 6/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 7/10 for 5103403535_p1_h180.jpg
(ClientError) Retrying 8/10 for 5103403535_

KeyboardInterrupt: 