## Setup and Dependencies

This example requires several Python libraries. Install boto3 to interact with AWS APIs, Pillow for image processing, matplotlib for visualization, GDAL for GeoTIFF file handling, requests for downloading aerial images, and pyproj for coordinate system conversions. Install these dependencies with pip.

In [None]:
%conda install -c conda-forge gdal -y
%pip install boto3 matplotlib pillow pyproj gdal requests --quiet
import os
os.environ['GTIFF_SRS_SOURCE'] = 'EPSG'

In [None]:
import boto3
import json
import matplotlib.pyplot as plt
from textwrap import dedent
import time
from pyproj import Transformer
from PIL import Image
import io
import base64
from IPython.display import display, clear_output
from osgeo import gdal
import numpy as np
import random
import requests

Create a Bedrock Runtime client and set the MODEL_ID variable. This example uses Amazon Nova Pro, which supports multimodal input and vision understanding. You can use other multimodal models, but you may need to adjust prompts and parameters for optimal results.

In [None]:
boto3.setup_default_session()
bedrock_runtime = boto3.client("bedrock-runtime", region_name=os.environ.get('AWS_REGION', 'us-east-1'))
MODEL_ID = "amazon.nova-pro-v1:0"

Test model access with a simple request. If you encounter errors, verify that you have the required IAM permissions and that model access is enabled in the Amazon Bedrock console.

In [None]:
try:
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{"role": "user", "content": [{"text": "Hello"}]}]
    )
    print(f"✅ Model '{MODEL_ID}' is activated and available.")

except Exception as e:
    if "AccessDeniedException" in str(e) or "is not enabled" in str(e):
        print(f"❌ Model '{MODEL_ID}' is not activated.")
        print("🔗 Activate it here: https://console.aws.amazon.com/bedrock/home?modelaccess#/modelaccess")
    else:
        print(f"❌ Error: {e}")

This example uses high-resolution aerial imagery from SwissTopo's SWISSIMAGE 10 cm orthophoto mosaic. The imagery covers all of Switzerland with 10 cm ground resolution in plains and main valleys, and 25 cm resolution in the Alps. SwissTopo updates the imagery every 3 years. Download 1 square kilometer tiles from the [SwissTopo website](https://www.swisstopo.admin.ch/en/orthoimage-swissimage-10). The GeoTIFF format includes coordinate metadata that enables pixel-to-coordinate conversion.

In [None]:
# Replace the link with the link to the image you want to inspect
url = "https://data.geo.admin.ch/ch.swisstopo.swissimage-dop10/swissimage-dop10_2022_2752-1227/swissimage-dop10_2022_2752-1227_0.1_2056.tif"
satellite_image_path = os.path.basename(url)
with requests.get(url, stream=True, timeout=30) as r:
    r.raise_for_status()
    with open(satellite_image_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)


In [None]:
"""Load a GeoTIFF file and return the image data and metadata."""
dataset = gdal.Open(satellite_image_path)
image_data = dataset.ReadAsArray()
transform = dataset.GetGeoTransform()
crs = dataset.GetProjection()


## Image Processing Functions

SwissTopo's high-resolution images work well for landmark detection but exceed Amazon Bedrock's request size limits. Split large images into 1000x1000 pixel chunks to stay within these limits. This function divides the 1km x 1km aerial image into 100 parts and returns each chunk with its offset coordinates.

In [None]:
def split_image(image_data, chunk_size=1000):
    """Split a large image into smaller chunks."""
    chunks = []
    chunk_transforms = []

    if len(image_data.shape) == 3:
        bands, height, width = image_data.shape
    else:
        height, width = image_data.shape
        bands = 1

    num_chunks_y = (height + chunk_size - 1) // chunk_size
    num_chunks_x = (width + chunk_size - 1) // chunk_size

    for y in range(num_chunks_y):
        for x in range(num_chunks_x):
            y_start = y * chunk_size
            y_end = min((y + 1) * chunk_size, height)
            x_start = x * chunk_size
            x_end = min((x + 1) * chunk_size, width)

            if bands > 1:
                chunk = image_data[:, y_start:y_end, x_start:x_end]
            else:
                chunk = image_data[y_start:y_end, x_start:x_end]

            chunks.append(chunk)
            chunk_transforms.append((x_start, y_start))

    return chunks, chunk_transforms


In [None]:
chunk_size = 1000 # Pixels - chunking required to not exceed request limit for large images
chunks, chunk_transforms = split_image(image_data, chunk_size)

Convert image arrays to JPEG bytes for the Converse API, which requires raw image data in bytes format.

In [None]:
def encode_image_to_bytes(image_array):
    """Convert numpy array to JPEG bytes."""
    # Convert to RGB if needed and ensure uint8 format
    if image_array.shape[0] == 1:  # Single band
        img = Image.fromarray(image_array[0].astype(np.uint8))
    else:  # Multiple bands (assuming RGB)
        # Rearrange from (bands, height, width) to (height, width, bands)
        img = Image.fromarray(np.transpose(image_array, (1, 2, 0)).astype(np.uint8))

    buffered = io.BytesIO()
    img.save(buffered, format="JPEG", quality=100, optimize=False)
    return buffered.getvalue()


Convert pixel positions to real-world coordinates using GeoTIFF metadata and coordinate reference system information.

In [None]:
def convert_image_coords_to_geo(points, transform, crs):
    """Convert image coordinates to geographic coordinates using the transform."""
    coordinates = []

    for point in points:
        # GDAL transform: [x_origin, pixel_width, 0, y_origin, 0, -pixel_height]
        projected_x = transform[0] + point[0] * transform[1] + point[1] * transform[2]
        projected_y = transform[3] + point[0] * transform[4] + point[1] * transform[5]
        transformer = Transformer.from_crs(crs, "EPSG:4326", always_xy=True)
        lon, lat = transformer.transform(projected_x, projected_y)
        coordinates.append((lon, lat))

    return coordinates

## Object Detection Configuration

Use function calling to ensure structured output from Amazon Nova Pro. Function calling guarantees consistent data format for downstream processing, eliminating the need to parse unstructured text responses.

In [None]:
# Configuration for object detection
OBJECT_CRITERIA = {
    "crosswalks": "Visible parallel striped markings (white or yellow paint) on paved roads or intersections with clear pedestrian crossing pattern. NOT lane markings, parking lines, building shadows, or road edges.",
    "red_cars": "Vehicles that are clearly red in color, parked or moving on roads or parking areas. Must be distinguishable as cars (not trucks, motorcycles, or other vehicles)."
}

# Define the generic object detection tool
object_detection_tool = {
    "toolSpec": {
        "name": "detect_objects",
        "description": "ONLY call this function if you find the requested objects in the aerial image. Do not call if no objects are present.",
        "inputSchema": {
            "json": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "Written description of what was found in the image"
                    },
                    "objects": {
                        "type": "array",
                        "description": "List of detected objects matching the search criteria",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {
                                    "type": "string",
                                    "description": "Unique identifier for this object instance"
                                },
                                "object_type": {
                                    "type": "string",
                                    "description": "Type of object detected (e.g., crosswalk, red_car, swimming_pool, parking_lot)"
                                },
                                "confidence": {
                                    "type": "number",
                                    "minimum": 0,
                                    "maximum": 1,
                                    "description": "Confidence score (0-1) that this is actually the requested object type"
                                },
                                "attributes": {
                                    "type": "object",
                                    "description": "Object-specific attributes (e.g., color, size, shape)",
                                    "additionalProperties": True
                                },
                                "bounding_box": {
                                    "type": "object",
                                    "properties": {
                                        "x_min": {"type": "integer"},
                                        "y_min": {"type": "integer"},
                                        "x_max": {"type": "integer"},
                                        "y_max": {"type": "integer"}
                                    },
                                    "required": ["x_min", "y_min", "x_max", "y_max"]
                                }
                            },
                            "required": ["name", "object_type", "confidence", "bounding_box"]
                        }
                    }
                },
                "required": ["text", "objects"]
            }
        }
    }
}

def create_system_prompt(object_type):
    criteria = OBJECT_CRITERIA.get(object_type, "Objects matching the specified type and characteristics.")
    return dedent(f"""\
        You are a precise object detection system. Analyze aerial images to find ONLY genuine {object_type}.
        
        DETECTION CRITERIA: {criteria}
        
        ONLY call detect_objects function if you find actual {object_type} meeting the criteria.
        If no {object_type} are found, do NOT call any function - just respond with text explaining what you see.
        Be extremely conservative - false negatives are better than false positives.
    """)

def build_request(image_bytes, object_type="crosswalks"):
    system_prompt = create_system_prompt(object_type)
    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {"image": {"format": "jpeg", "source": {"bytes": image_bytes}}},
                    {"text": f"Analyze this aerial image and detect any {object_type}. Use the detect_objects function to report your findings."}
                ]
            }
        ],
        "system": [{"text": system_prompt}],
        "toolConfig": {
            "tools": [object_detection_tool],
            "toolChoice": {"auto": {}}
        },
        "inferenceConfig": {
            "maxTokens": 2500,
            "temperature": 0.1
        }
    }

def invoke_model(request_params, model_id):
    max_retries = 3

    for attempt in range(max_retries):
        try:
            response = bedrock_runtime.converse(modelId=model_id, **request_params)
            print("✅ Inference completed successfully.")
            return response
        except Exception as e:
            if (
                    hasattr(e, 'response') and 
                    e.response.get("Error", {}).get("Code") == "ThrottlingException"
                    and attempt < max_retries - 1
                ):
                wait_time = (2**attempt) + random.randint(0, 60)
                print(
                    f"ThrottlingException, retrying in {wait_time}s (attempt {attempt + 1}/{max_retries})"
                )
                time.sleep(wait_time)
                continue
            else:
                print(f"❌ Model invocation error: {str(e)}")
                return None



Function calling provides structured output that matches your defined schema. The parse_response function extracts function call results directly, avoiding text parsing.

In [None]:
def parse_response(model_response):
    if not model_response:
        return None
    try:
        # Extract function call result from the response
        content = model_response["output"]["message"]["content"]
        for item in content:
            if "toolUse" in item:
                tool_use = item["toolUse"]
                if tool_use["name"] == "detect_objects":
                    parsed = tool_use["input"]
                    print("✅ Parsed Output:")
                    print(json.dumps(parsed, indent=2))
                    return parsed
            elif "text" in item:
                print(f"Model response: {item['text']}")
        return None
    except (KeyError, json.JSONDecodeError) as e:
        print(f"❌ Response parsing failed: {str(e)}")
        return None


def draw_bounding_boxes(base64_img, objects, offset_x, offset_y):
    img = Image.open(io.BytesIO(base64.b64decode(base64_img)))
    img_width, img_height = img.size

    model_width = 1000
    model_height = 1000

    scale_x = img_width / model_width
    scale_y = img_height / model_height

    _, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(img)

    for obj in objects:
        box = obj["bounding_box"]
        point = convert_image_coords_to_geo([[box["x_min"]+offset_x, box["y_min"]+offset_y]], transform, crs)
        obj["coordinates"] = point[0]
        if point:
            x_min = box["x_min"] * scale_x
            y_min = box["y_min"] * scale_y
            x_max = box["x_max"] * scale_x
            y_max = box["y_max"] * scale_y
            label = f"{obj['name']}"
            color = "orange"
            rect = plt.Rectangle((x_min, y_min), x_max - x_min, y_max - y_min,
                                linewidth=2, edgecolor=color, facecolor='none')
            ax.add_patch(rect)
            ax.text(x_min, y_min - 10, label, color=color, fontsize=12, weight='bold')

    plt.axis('off')
    plt.show()


def display_result(parsed, base64_image, offset_x, offset_y):

    if parsed.get("objects"):
        print("🧾 Image with objects:")
        draw_bounding_boxes(base64_image, parsed["objects"], offset_x, offset_y)

        report_text = f"""{parsed['objects'][0]['name']}
Description:
{parsed['text']}

"""

        for obj in parsed["objects"]:
            box = obj["bounding_box"]
            report_text += f"""
- {obj['name']}:
  Bounding Box: x_min={box['x_min']}, y_min={box['y_min']}, x_max={box['x_max']}, y_max={box['y_max']}
  Coordinates: {obj["coordinates"][1]} {obj["coordinates"][0]}
"""
    else:

        report_text = f"""No objects found"""
    print(report_text)

## Point of Interest Detection

Process the image chunks to detect points of interest.

In [None]:
OBJECT_TYPE = "red_cars"

for i, (chunk, (offset_x, offset_y)) in enumerate(zip(chunks, chunk_transforms)):
    img_bytes = encode_image_to_bytes(chunk)
    request_params = build_request(img_bytes, OBJECT_TYPE)
    model_response = invoke_model(request_params, MODEL_ID)
    parsed_output = parse_response(model_response)
    if parsed_output:
        img_base64 = base64.b64encode(img_bytes).decode('utf-8')
        display_result(parsed_output, img_base64, offset_x, offset_y)
    else:
        print(f"No {OBJECT_TYPE} found")


To search for something else change the `OBJECT_TYPE`. To search for a new object, add search criteria to `OBJECT_CRITERIA` and run the `for-loop` again with your new `OBJECT_TYPE`.

## Cleanup

Delete the CloudFormation stack to remove the resources you deployed:

```bash
aws cloudformation delete-stack --stack-name GenerativePOIDetection
```

If you run the Jupyter notebook locally, no cleanup is needed as you have not provisioned any AWS resources.