# VisionDetect: Object Detection with Deep Learning

This notebook demonstrates how to use the VisionDetect framework for object detection tasks. We'll cover:

1. Setting up the environment
2. Loading and preprocessing data
3. Creating and training a model
4. Evaluating model performance
5. Making predictions on new images
6. Visualizing results

## 1. Setup

First, let's import the necessary modules and set up our environment.

In [None]:
import os
import sys
import torch
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import VisionDetect modules
from src.data.preprocessing import DataProcessor
from src.models.architecture import ObjectDetectionModel
from src.models.trainer import ModelTrainer
from src.models.predictor import Predictor
from src.utils.visualization import visualize_detection, plot_training_metrics

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Data Preparation

Next, we'll set up our data processing pipeline. For this example, we'll use a small dataset of images with object annotations.

In [None]:
# Create data directories if they don't exist
data_dir = project_root / "data"
os.makedirs(data_dir, exist_ok=True)

# Initialize data processor
processor = DataProcessor(
    data_dir=data_dir,
    img_size=(640, 640),
    batch_size=4,  # Small batch size for demonstration
    num_workers=2,
    augment=True
)

# Prepare dataset
# Note: In a real scenario, you would have your dataset ready or use the download option
dataloaders = processor.prepare_dataset(download=False)

# Print dataset information
for split, dataloader in dataloaders.items():
    print(f"{split} dataset: {len(dataloader.dataset)} samples, {len(dataloader)} batches")

### Visualize Sample Data

Let's visualize some samples from our dataset to understand what we're working with.

In [None]:
# Get a batch of training data
if 'train' in dataloaders:
    batch = next(iter(dataloaders['train']))
    images = batch['images']
    targets = batch['targets']
    
    # Display a few images with their bounding boxes
    fig, axs = plt.subplots(1, min(4, len(images)), figsize=(15, 5))
    if len(images) == 1:
        axs = [axs]
    
    for i, (image, target) in enumerate(zip(images[:4], targets[:4])):
        # Convert tensor to numpy array and denormalize
        img = image.permute(1, 2, 0).cpu().numpy()
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        img = (img * std + mean) * 255.0
        img = img.astype(np.uint8)
        
        # Display image
        axs[i].imshow(img)
        
        # Draw bounding boxes
        boxes = target['boxes'].cpu().numpy()
        labels = target['labels'].cpu().numpy()
        
        for box, label in zip(boxes, labels):
            x1, y1, x2, y2 = box
            rect = plt.Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='red', linewidth=2)
            axs[i].add_patch(rect)
            axs[i].text(x1, y1, f"Class {label}", color='white', backgroundcolor='red', fontsize=8)
        
        axs[i].set_title(f"Image {i+1}")
        axs[i].axis('off')
    
    plt.tight_layout()
    plt.show()

## 3. Model Creation and Training

Now, let's create and train our object detection model. We'll use a Faster R-CNN model with a ResNet50 backbone.

In [None]:
# Create model trainer
trainer = ModelTrainer(
    model_type="faster_rcnn",
    backbone_name="resnet50",
    num_classes=91,  # COCO dataset has 90 classes + background
    learning_rate=0.001,
    batch_size=4,
    num_epochs=10,  # Small number of epochs for demonstration
    checkpoint_dir=project_root / "checkpoints",
    device=device
)

# Print model summary
print(f"Model: {trainer.model_type} with {trainer.backbone_name} backbone")
print(f"Number of classes: {trainer.num_classes}")
print(f"Training device: {trainer.device}")

### Training the Model

Now let's train our model. This will take some time, especially if you're training on CPU.

In [None]:
# Train the model
# Note: In a real scenario, you would train for more epochs
if 'train' in dataloaders:
    metrics = trainer.train(
        train_dataloader=dataloaders['train'],
        val_dataloader=dataloaders.get('val')
    )
    
    # Plot training metrics
    plot_training_metrics(metrics, show=True)

## 4. Model Evaluation

Let's evaluate our trained model on the test dataset to see how well it performs.

In [None]:
# Create predictor with the best model
best_model_path = project_root / "checkpoints" / "best_model.pth"
if best_model_path.exists():
    predictor = Predictor(
        model_path=best_model_path,
        device=device,
        confidence_threshold=0.5
    )
    
    # Evaluate on test dataset
    if 'test' in dataloaders:
        # Initialize metrics
        all_predictions = []
        all_targets = []
        
        # Process each batch
        for batch in dataloaders['test']:
            # Get images and targets
            images = batch["images"].cpu().numpy()
            targets = batch["targets"]
            
            # Make predictions
            batch_predictions = []
            for i in range(len(images)):
                # Convert image from CHW to HWC format
                image = np.transpose(images[i], (1, 2, 0))
                
                # Denormalize image
                mean = np.array([0.485, 0.456, 0.406])
                std = np.array([0.229, 0.224, 0.225])
                image = (image * std + mean) * 255.0
                image = image.astype(np.uint8)
                
                # Make prediction
                prediction = predictor.predict(image)
                batch_predictions.append(prediction)
            
            # Store predictions and targets for mAP calculation
            all_predictions.extend(batch_predictions)
            all_targets.extend(targets)
        
        # Calculate metrics
        from src.utils.metrics import calculate_map, calculate_map_range
        map_50 = calculate_map(all_predictions, all_targets, iou_threshold=0.5)
        map_range = calculate_map_range(all_predictions, all_targets)
        
        print(f"Evaluation Results:")
        print(f"mAP@0.5: {map_50:.4f}")
        print(f"mAP@[0.5:0.95]: {map_range:.4f}")
else:
    print("No trained model found. Please train the model first.")

## 5. Making Predictions

Now let's use our trained model to make predictions on new images.

In [None]:
# Function to load and preprocess an image
def load_image(image_path):
    import cv2
    image = cv2.imread(str(image_path))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    return image

# Make predictions on sample images
if best_model_path.exists():
    # Get sample images
    sample_images_dir = project_root / "data" / "samples"
    if not sample_images_dir.exists():
        os.makedirs(sample_images_dir, exist_ok=True)
        print(f"Please add sample images to {sample_images_dir}")
    else:
        # Get all image files
        image_extensions = [".jpg", ".jpeg", ".png", ".bmp"]
        image_paths = []
        for ext in image_extensions:
            image_paths.extend(list(sample_images_dir.glob(f"*{ext}")))
        
        if image_paths:
            # Process each image
            for image_path in image_paths[:5]:  # Process up to 5 images
                # Load image
                image = load_image(image_path)
                
                # Make prediction
                result = predictor.predict(image, return_visualization=True)
                
                # Display results
                plt.figure(figsize=(10, 8))
                plt.imshow(result["visualization"])
                plt.title(f"Predictions for {image_path.name}")
                plt.axis('off')
                plt.show()
                
                # Print detection results
                print(f"\nDetections for {image_path.name}:")
                for i, (box, score, class_name) in enumerate(zip(result['boxes'], result['scores'], result['class_names'])):
                    print(f"  Object {i+1}: {class_name}, Score: {score:.2f}, Box: {box}")
        else:
            print(f"No sample images found in {sample_images_dir}")
else:
    print("No trained model found. Please train the model first.")

## 6. Real-time Inference with Webcam

If you have a webcam connected, you can run real-time object detection.

In [None]:
def run_webcam_detection(predictor, confidence_threshold=0.5):
    import cv2
    
    # Open webcam
    cap = cv2.VideoCapture(0)
    
    if not cap.isOpened():
        print("Error: Could not open webcam.")
        return
    
    print("Press 'q' to quit.")
    
    while True:
        # Read frame
        ret, frame = cap.read()
        if not ret:
            break
        
        # Convert frame to RGB
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Make prediction
        result = predictor.predict(frame_rgb, return_visualization=True)
        
        # Convert visualization back to BGR for display
        visualization = cv2.cvtColor(result["visualization"], cv2.COLOR_RGB2BGR)
        
        # Display result
        cv2.imshow("Object Detection", visualization)
        
        # Check for quit key
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    
    # Release resources
    cap.release()
    cv2.destroyAllWindows()

# Run webcam detection if model is available
if best_model_path.exists():
    # Uncomment the line below to run webcam detection
    # run_webcam_detection(predictor, confidence_threshold=0.5)

## 7. Conclusion

In this notebook, we've demonstrated how to use the VisionDetect framework for object detection tasks. We covered:

1. Setting up the environment
2. Loading and preprocessing data
3. Creating and training a model
4. Evaluating model performance
5. Making predictions on new images
6. Running real-time inference with a webcam

The VisionDetect framework provides a comprehensive solution for object detection tasks, with support for different models, backbones, and deployment options. It's designed to be modular, extensible, and easy to use, making it suitable for a wide range of computer vision applications.