Foundation Model Active Learning (FMAL) for autonomous robot object discovery.
Fuses three vision-language foundation models -- GroundingDINO, DINO, and CLIP -- into a unified acquisition function for active learning. The system enables robots to efficiently discover and learn novel objects in unstructured environments with minimal human annotation.
pip install cane-robotics# Run a single active learning experiment
cane-robotics run --images-dir data/images --labels-dir data/labels --classes box laptop chair
# Run all ablation variants across multiple seeds
cane-robotics ablations --images-dir data/images --labels-dir data/labels
# Evaluate sim-to-real transfer
cane-robotics sim2real --synthetic-dir data/synthetic --real-dir data/real
# Launch annotation GUI
cane-robotics annotate novel_detections/
# Plot experiment results
cane-robotics plot results/
# Generate synthetic training data (Isaac Sim)
cane-robotics generate --output-dir data/synthetic --num-scenes 50The active learning pipeline scores candidate object detections using three complementary signals:
- GroundingDINO -- open-vocabulary detection confidence
- DINO ViT -- class-agnostic attention saliency (filters background clutter)
- CLIP -- semantic novelty relative to known object classes
These are combined into a unified acquisition score:
score(x) = 0.5 * conf_gdino + 0.3 * attn_dino + 0.2 * sim_fg - 0.2 * sim_bg
A temporal deduplication module tracks previously queried objects via embedding similarity, reducing redundant annotation queries by ~69%.
Each round, the top-scoring proposals are labeled (by human or oracle), added to the training set, and a YOLOv8 detector is retrained. The loop repeats until convergence.
cane_robotics/
pipeline/ Core active learning pipeline, offline replay, ROS node
models/ Foundation model wrappers (GDINO, CLIP, DINO, dedup)
dataset/ Dataset management and augmentation
config/ Experiment configuration (dataclasses + YAML)
experiments/ Experiment runners, ablations, sim2real evaluation
training/ YOLO training and dataset preparation
sim/ Isaac Sim synthetic data generation
tools/ Annotation GUI, result plotting
from cane_robotics import (
ActiveLearningPipeline,
create_gdino_pipeline,
ExperimentConfig,
DatasetManager,
TemporalDeduplicator,
)
# Create pipeline with full multi-VLM acquisition
pipeline = create_gdino_pipeline(
known_classes=["mug", "bowl", "can"],
acquisition_type="full",
enable_dedup=True,
)
# Process a single image
result = pipeline.process_image("frame_001.jpg")
for obj in result["novel_objects"]:
print(f"{obj['label']} (score={obj['score']:.3f})")The experiment framework supports 8 acquisition function variants for systematic comparison:
| Variant | Description |
|---|---|
full |
All three VLM signals combined (default) |
random |
Random scoring baseline |
gdino_only |
GroundingDINO confidence only |
clip_only |
CLIP novelty signal only |
dino_only |
DINO attention only |
no_fg_bg_gate |
Full formula without foreground/background gating |
no_dedup |
Full scoring with deduplication disabled |
no_sam |
Full scoring with SAM splitting disabled |
Core: numpy, pyyaml, torch, torchvision, ultralytics, opencv-python, Pillow, transformers
Optional:
[sim]-- Isaac Sim for synthetic data generation[dev]-- pytest, ruff for development
MIT