Home

Overview:

The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. It covers over 6,000 m2 and contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces.

In more detail, the dataset is collected in 6 large-scale indoor areas that originate from 3 different buildings of mainly educational and office use. For each area, all modalities are registered in the same reference system, yielding pixel to pixel correspondences among them. In a nutshell, the presented dataset contains a total of 70,496 regular RGB and 1,413 equirectangular RGB images, along with their corresponding depths, surface normals, semantic annotations, global XYZ OpenEXR format and camera metadata. In addition, we provide whole building 3D reconstructions as textured meshes, as well as the corresponding 3D semantic meshes. We also include the colored 3D point cloud data of these areas with the total number of 695,878,620 points, that has been previously presented in the Stanford large-scale 3D Indoor Spaces Dataset (S3DIS).

For more information on the dataset visit: http://3Dsemantics.stanford.edu

## Dataset Modalities ### 3D modalities: The dataset contains colored point clouds and textured meshes for each scanned area. 3D semantic annotations for objects and scenes are offered for both modalities, with point-level and face-level labels correspondingly. The annotations were initially performed on the point cloud and then projected onto the closest surface on the 3D mesh model. Faces in the 3D mesh that account for no projected points belong to non-annotated parts of the dataset and are labeled with a default null value. We also provide the tightest axis-aligned bounding box of each object instance and further voxelize it into a 6x6x6 grid with binary occupancy and point correspondence.

2D modalities:

The dataset contains densely sampled RGB images per scan location. These images were sampled from equirectangular images that were generated per scan location and modality using the raw data captured by the scanner. All images in the dataset are stored in full high-definition at 1080x1080 resolution. For more details on the random sampling of RGB images read section 4.2 in the paper. We provide the camera metadata for each generated image.

We also provide depth images that were computed on the 3D mesh instead of directly on the 3D mesh, as well as surface normal images. 2D semantic annotations are computed for each image by projecting the 3D mesh labels on the image plane. Due to certain geo- metric artifacts present at the mesh model mainly because of the level of detail in the reconstruction, the 2D annotations occasionally present small local misalignment to the underlying pixels, especially for points that have a short distance to the camera. This issue can be easily addressed by fusing image content with the projected annotations using graphical models. The dataset also includes 3D coordinate encoded images where each pixel encodes the X, Y, Z location of the point in the world coordinate system. Last, an equirectangular projection is also provided per scan location and modality.

## Train and Test splits: Certain areas in the dataset represent parts of buildings with similarities in their appearance and architectural features, thus we define standard training and testing splits so that no areas from similarly looking buildings appear in both. We split the 6 areas in the dataset as per the Table below and follow a 3-fold cross-validation scheme.

Fold #	Training	Testing
	(Area #)	(Area #)
1	1, 2, 3, 4, 6	5
2	1, 3, 5, 6	2, 4
3	2, 4, 5	1, 3, 6

## Notes

The data was collected in 6 large-scale spaces, but there are 7 area folders contained in the dataset. Due to the very large covered surface by area 5, the data collection had to be split in two in order not to jeopardize the quality. Hence area 5 is contained in two files, area 5a and area 5b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Overview:

2D modalities:

Clone this wiki locally