Skip to content

behavior-vision-suite/behavior-vision-suite.github.io

Repository files navigation

Behavior Vision Suite


Stanford   USC   Harvard   Meta   UT Austin   UIUC

CVPR 2024 (Highlight)     Project Page

Overview

We introduce BEHAVIOR Vision Suite (BVS), our proposed toolkit for computer vision research. BVS builds upon the extended object assets and scene instances from BEHAVIOR-1K, and provides a customizable data generator that allows users to generate photorealistic, physically plausible labeled data in a controlled manner. We demonstrate BVS with three representative applications.

Installation

Our code depends on Omnigibson. Users can refer to their installation guide.

Other than Omnigibson, user also need to install:

pip install fire

Extended B1K Assets

Covering a wide range of object categories and scene types, our 3D assets have high visual and physical fidelity and rich annotations of semantic properties, allowing us to generate 1,000+ realistic scene configurations.

We show the examples of 3D objects and semantic properties they support in the below left figure; and distributions of scenes, room types and objects in the below right figure.

These extended assets will be merged into the Behavior datasets and released soon.

Scene Instance Augmentation

We enable the generation of diverse scene variations by altering furniture object models and incorporating additional everyday objects. Specifically, it can swap scene objects with alternative models from the same category, which are grouped based on visual and functional similarities. This randomization significantly varies scene appearances while maintaining layouts’ semantic integrity.

teaser.mp4

How to augment a scene

We provide the script for users to automatically generate augmented scene via object insertion and model replacement by:

# this will save augmented in the output_root (default is "output_data/aug_scene.json")
python augment_scene.py

Additional arguments are:

--scene_id(int) # Id for raw scene to create. Default is 0.
--num_insert(int) # Number of additional objects to be inserted. Default is 5. Set to 0 if don't want to insert new objects.
--replace_ratio(float) # Probability each object got replaced by a new model. Default is 0.2. Set to 0. if don't want to replace object models.
--save_json_name(str) # Json file name to save the scene file for augmented scene. Default is aug_scene.json.

Applications

As showcases of what BVS can support, we show three key applications, which are detailed as follows.

Holistic Scene Understanding

One of the major advantages of synthetic datasets, including BVS, is that they offer various types of labels (segmentation masks, depth maps, and bounding boxes) for the same sets of input images. We believe that this feature can fuel the development of versatile vision models that can perform multiple perception tasks at the same time in the future. We generated extensive traversal videos across representative scenes, each with 10+ camera trajectories.For each image, BVS generates various labels (e.g., scene graphs, segmentation masks, depth).

traverse.mp4

How to sample a trajectory

Omnigibson provides a series of scenes. For a given scene_id, we can sample a trajectory by:

# this will save in the output_root (default is "output_data")
python sample_fps.py --scene_id=<scene_id>

# this will save the trajectory poses in the output_root (default is "output_data/fps/poses.npy")
# also save the rendered video in output_root (default is "output_data/fps/video.mp4)
python collect_data.py --scene_id=<scene_id>

# this will sample trajectory in the augmented scene instance you generated before:
python collect_data.py --scene_id=<scene_id> --scene_file=<path to your saved json scene file>

Parametric Model Evaluation

Parametric model evaluation is essential for developing and understanding perception models, enabling a systematic assessment of performance robustness against various domain shifts. Leveraging the flexibility of the simulator, our generator extends parametric evaluation to more diverse axes, including scene, camera, and object state changes.

Below we show the model prediction for Articulation, Visibility, Lighting, Zoom and Pitch axis respectively.

articulation.mp4
visibility.mp4
lighting.mp4
zoom.mp4
pitch.mp4

Object States and Relations Prediction

Users can also leverage BVS to generate training data with specific object configurations that are difficult to accumulate or annotate in the real world. We illustrates BVS’s practical application in synthesizing a dataset that facilitates the training of a vision model capable of zero-shot transfer to real-world images on the task of object relationship prediction.

predicate_prediction.mp4

Citation

If you find our project helpful, please cite our paper: