# SP1: 3D Open-Vocabulary Object Detection Pipeline

**Interactive Testing Notebook**

This notebook demonstrates the complete RGB-only 3D object detection pipeline:
- **Stage A**: Open-vocabulary 2D detection (YOLO-World)
- **Stage B**: Monocular depth estimation (Depth Anything V2)
- **Stage C**: 3D geometric projection

## 1. Setup and Installation

In [None]:
# Install dependencies (run once)
!pip install -q torch torchvision ultralytics transformers pillow opencv-python matplotlib supervision pyyaml requests

In [None]:
import sys
import os
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import requests
from io import BytesIO

# Add src to path
sys.path.insert(0, '.')

print("Imports successful!")

## 2. Initialize Pipeline

In [None]:
from src import SP1Pipeline, PipelineVisualizer

# Initialize pipeline (use 'cuda:0' if GPU available)
pipeline = SP1Pipeline(
    detector_model='yolov8s-world',
    device='cpu',  # Change to 'cuda:0' for GPU
    confidence_threshold=0.25
)

print("\nPipeline ready!")

## 3. Load Test Image

In [None]:
# Download sample indoor scene
image_url = "https://images.unsplash.com/photo-1586023492125-27b2c045efd7?w=800"

response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert('RGB')
image_np = np.array(image)

print(f"Image shape: {image_np.shape}")

plt.figure(figsize=(12, 8))
plt.imshow(image_np)
plt.title("Test Image: Indoor Room Scene")
plt.axis('off')
plt.show()

## 4. Run 3D Object Detection

In [None]:
# Define objects to detect (open-vocabulary - you can use any object names!)
query_classes = [
    "chair", "sofa", "table", "lamp", 
    "tv", "door", "window", "plant", "pillow"
]

print(f"Searching for: {query_classes}")
print("\nRunning pipeline...")

# Run detection
result = pipeline.detect(image_np, query_classes)

# Print summary
print(result.summary())

## 5. Visualize Results

In [None]:
# Create comprehensive visualization
viz = PipelineVisualizer()

fig = viz.visualize_pipeline_result(
    image=image_np,
    detections_2d=result.detections_2d,
    depth_result=result.depth_result,
    detections_3d=result.detections_3d,
    title="SP1 3D Detection Pipeline Results"
)

plt.show()

## 6. Inspect Individual Detections

In [None]:
print("=" * 60)
print("DETAILED 3D DETECTION RESULTS")
print("=" * 60)

for i, det in enumerate(result.detections_3d):
    print(f"\n[{i+1}] {det.class_name.upper()}")
    print(f"    Confidence: {det.confidence:.2%}")
    print(f"    3D Center: X={det.center[0]:.2f}m, Y={det.center[1]:.2f}m, Z={det.center[2]:.2f}m")
    print(f"    Dimensions: W={det.dimensions[0]:.2f}m, H={det.dimensions[1]:.2f}m, D={det.dimensions[2]:.2f}m")
    print(f"    Distance from camera: {det.center[2]:.2f} meters")

## 7. Navigation Waypoint Generation

In [None]:
# Generate navigation waypoint to a target object
target_object = "sofa"  # Try different objects!

waypoint = pipeline.get_waypoint(
    image_np, 
    target_object=target_object,
    offset_distance=0.5  # Stop 0.5m from object
)

if waypoint:
    print(f"\nüéØ Navigation waypoint to '{target_object}':")
    print(f"   Object position: {waypoint['object_position']}")
    print(f"   Waypoint (stop here): {waypoint['waypoint_position']}")
    print(f"   Distance: {waypoint['distance_to_object']:.2f}m")
    print(f"   Confidence: {waypoint['confidence']:.2%}")
else:
    print(f"\n‚ùå '{target_object}' not found in scene")

## 8. Try Custom Queries

In [None]:
# Try your own queries - open vocabulary means any object name works!
custom_queries = ["couch", "coffee table", "potted plant", "cushion"]

print(f"Custom query: {custom_queries}\n")

custom_result = pipeline.detect(image_np, custom_queries)

for det in custom_result.detections_3d:
    print(f"Found: {det.class_name} at {det.center[2]:.2f}m depth (conf: {det.confidence:.2%})")

if not custom_result.detections_3d:
    print("No objects detected with these queries")

## 9. Depth Map Analysis

In [None]:
from src.depth_estimator import DepthMapVisualizer

# Visualize depth map with different colormaps
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

colormaps = ['plasma', 'viridis', 'magma']
for ax, cmap in zip(axes, colormaps):
    depth_vis = DepthMapVisualizer.to_colormap(result.depth_result.depth_map, cmap)
    ax.imshow(depth_vis)
    ax.set_title(f"Depth Map ({cmap})")
    ax.axis('off')

plt.suptitle(f"Depth Range: {result.depth_result.min_depth:.2f}m - {result.depth_result.max_depth:.2f}m", 
             fontsize=14)
plt.tight_layout()
plt.show()

# Depth statistics
print(f"\nDepth Statistics:")
print(f"  Min depth: {result.depth_result.min_depth:.2f}m")
print(f"  Max depth: {result.depth_result.max_depth:.2f}m")
print(f"  Mean depth: {result.depth_result.mean_depth:.2f}m")

## 10. Performance Benchmark

In [None]:
# Run performance benchmark
print("Running benchmark (5 iterations)...\n")

times = []
for i in range(5):
    r = pipeline.detect(image_np, query_classes)
    times.append(r.total_time_ms)
    print(f"  Run {i+1}: {r.total_time_ms:.1f}ms")

avg_time = np.mean(times)
fps = 1000 / avg_time

print(f"\nüìä Benchmark Results:")
print(f"   Average time: {avg_time:.1f}ms")
print(f"   FPS: {fps:.2f}")
print(f"   Breakdown:")
print(f"     - Detection: {r.detection_time_ms:.1f}ms")
print(f"     - Depth: {r.depth_time_ms:.1f}ms")
print(f"     - Projection: {r.projection_time_ms:.1f}ms")

## 11. Export Results

In [None]:
import json

# Export results to JSON
os.makedirs('outputs', exist_ok=True)

result.save_json('outputs/detection_results.json')
print("Results saved to outputs/detection_results.json")

# Preview JSON structure
print("\nJSON Structure:")
print(json.dumps(result.to_dict(), indent=2)[:1000] + "...")

## 12. Test with Your Own Image

In [None]:
# Upload your own image
# from google.colab import files
# uploaded = files.upload()
# your_image = np.array(Image.open(list(uploaded.keys())[0]).convert('RGB'))

# Or use a URL:
# your_url = "https://your-image-url.jpg"
# your_image = np.array(Image.open(BytesIO(requests.get(your_url).content)).convert('RGB'))

# Then run:
# your_result = pipeline.detect(your_image, ['chair', 'table', 'person'])
# print(your_result.summary())

---

## üéâ Pipeline Test Complete!

### Key Takeaways:

1. **Open-Vocabulary Detection**: Query any object by text description
2. **RGB-Only Depth**: No depth sensor needed - monocular depth from single image
3. **3D Localization**: Get actual metric positions (meters) in camera frame
4. **Navigation Ready**: Generate waypoints for robot navigation

### Next Steps for Deployment:

1. **TensorRT Optimization**: Convert models to FP16/INT8 for Jetson Orin
2. **Camera Calibration**: Use actual camera intrinsics for your robot
3. **ROS Integration**: Wrap pipeline in ROS2 node for navigation stack