### Balanced Subset Creation from ScanNet Dataset


#### About the ScanNet Dataset

The ScanNet dataset is a collection of annotated RGB-D videos of indoor environments, primarily aimed at 3D scene understanding tasks such as segmentation, reconstruction, and semantic labeling. It includes over 2.5 million frames and more than 1,500 reconstructed 3D scenes.

Key Features:

- RGB-D Sequences: Captures video streams with both RGB and depth data.
- 3D Reconstructions: Includes 3D meshes (.ply files) reconstructed from RGB-D sequences.
- Semantic Annotations: Manually labeled scene objects with semantic categories.
- Benchmark Splits: Defined splits for training, validation, and testing.


Data Structure

Each scene is stored in a directory named scene<spaceId>_<scanId>. Key contents include:

- Reconstructed Meshes: 3D .ply files for entire scenes.
- Frames: RGB and depth frames stored in a compressed .sens format.
- Annotations: Semantic labels and segmentation.

Associated Files:
- scannet-labels.combined.tsv: Contains label mappings, providing the mapping between raw labels and semantic categories.
- scannetv2_train.txt, scannetv2_val.txt, scannetv2_test.txt: Contains subsets of the training and validation data for 2d dataset.
- train_scans.txt, val_scans.txt: Contains subsets of the training and validation data for 3d dataset.

In [40]:
import os 
import random
import shutil
import pandas as pd

# Data path: 
original_data = '/Volumes/datasets/scannet/scans'
data_path = '/Volumes/projects/open3dsg/data/SCANNET'
subset_path = '/Volumes/projects/open3dsg/data/subset_scannet'

#### Dataset General Overview

##### Scannet 2d

In [54]:
def load_txt_file(file_path):
    with open(file_path, 'r') as file:
        return [line.strip() for line in file.readlines()]

train_files = load_txt_file(os.path.join(data_path, 'scannetv2_train.txt'))
val_files = load_txt_file(os.path.join(data_path,'scannetv2_val.txt'))
test_files = load_txt_file(os.path.join(data_path,'scannetv2_test.txt'))
total_data = len(train_files) + len(val_files) + len(test_files)

print('2D SCANNET OVERVIEW: ')
print(f"    Number of training files: {len(train_files)}")
print(f"    Number of validation files: {len(val_files)}")
print(f"    Number of test files: {len(test_files)}")
print(f"    Number of TOTAL files: {total_data}")
print(f"        Sample training files: {train_files[:3]}\n")


2D SCANNET OVERVIEW: 
    Number of training files: 1201
    Number of validation files: 312
    Number of test files: 100
    Number of TOTAL files: 1613
        Sample training files: ['scene0191_00', 'scene0191_01', 'scene0191_02']



##### Scannet 3d

In [55]:
train_files_3d = load_txt_file(os.path.join(data_path, 'train_scans.txt'))
val_files_3d = load_txt_file(os.path.join(data_path,'val_scans.txt'))
total_data = len(train_files_3d) + len(val_files_3d)

print('3D SCANNET OVERVIEW: ')
print(f"    Number of training files: {len(train_files_3d)}")
print(f"    Number of validation files: {len(val_files_3d)}")
print(f"    Number of TOTAL files: {total_data}")
print(f"        Sample training files: {train_files_3d[:3]}\n")

3D SCANNET OVERVIEW: 
    Number of training files: 1178
    Number of validation files: 157
    Number of TOTAL files: 1335
        Sample training files: ['7272e161-a01b-20f6-8b5a-0b97efeb6545', '7272e182-a01b-20f6-89b8-3bdec0091c89', '7272e189-a01b-20f6-8a2e-05b6c8395143']



In [None]:
labels_df = pd.read_csv(os.path.join(data_path, 'scannet-labels.combined.tsv'), sep='\t')

unique_labels = labels_df['category'].unique()
print(f"    Number of unique labels: {len(unique_labels)}")
print(f"       - {unique_labels}")
labels_df.head()

    Number of unique labels: 1163
       - ['wall' 'chair' 'floor' ... 'stove top' 'monitor from pc' 'stick']


Unnamed: 0,category,count,nyuId,nyu40id,eigen13id,nyuClass,nyu40class,eigen13class,ModelNet40,ModelNet10,ShapeNetCore55,synsetoffset,wnsynsetid,wnsynsetkey
0,wall,7274,21.0,1.0,12.0,wall,wall,Wall,,,,,n04546855,wall.n.01
1,chair,5419,5.0,5.0,4.0,chair,chair,Chair,chair,chair,chair,3001627.0,n03001627,chair.n.01
2,floor,3910,11.0,2.0,5.0,floor,floor,Floor,,,,,n03365592,floor.n.01
3,table,2664,19.0,7.0,10.0,table,table,Table,table,table,table,4379243.0,n04379243,table.n.02
4,door,1400,28.0,8.0,12.0,door,door,Wall,door,,,,n03221720,door.n.01


The identifiers, like '7272e161-a01b-20f6-8b5a-0b97efeb6545', are UUIDs (Universally Unique Identifiers) used in the ScanNet dataset to reference specific scenes. These UUIDs can be mapped to scene directories with the format `scene<spaceId>_<scanId>`, such as `scene0000_00`, using files like `scannet-labels.combined.tsv`.


----

#### Create the dataset subset

In [59]:
def create_subset_2d(input_txt_path, subset_path, subset_name, subset_size=10):
    """
    Creates a subset of folders by randomly selecting a specified number of items from a given list file.
        :param input_txt_path (str): Path to the input .txt file containing the list of folders.
        :param subset_path (str): Path to save the subset .txt file and folders.
        :param subset_name (str): Name for the subset (e.g., 'train', 'test', 'val').
        :param subset_size (int): Number of folders to include in the subset.
    """
    # Create folder for the subset
    os.makedirs(subset_path, exist_ok=True)
    print(f"Subset folder created or already exists at: {subset_path}")

    # Load the original list of files
    with open(input_txt_path, 'r') as file:
        file_list = [line.strip() for line in file.readlines()]
    print(f"Loaded {len(file_list)} folders from {input_txt_path}.")

    # Randomly select the subset
    random_subset = random.sample(file_list, subset_size)
    print(f"Randomly selected {len(random_subset)} folders for the {subset_name} subset.")

    # Store the subset list inside a .txt file
    subset_list_file = os.path.join(subset_path, f'scannetv2_{subset_name}.txt')
    with open(subset_list_file, 'w') as file:
        for folder in random_subset:
            file.write(folder + '\n')
    print(f"Subset list for {subset_name} saved to: {subset_list_file}")

    # Optional: Print the selected folders
    print(f"Selected folders for the {subset_name} subset:")
    for folder in random_subset:
        print(folder)
    print('\n\n')


def create_subset_3d(input_txt_path, subset_path, subset_name, subset_size=10):
    """
    Creates a subset of folders by randomly selecting a specified number of items from a given list file.
        :param input_txt_path (str): Path to the input .txt file containing the list of folders.
        :param subset_path (str): Path to save the subset .txt file and folders.
        :param subset_name (str): Name for the subset (e.g., 'train', 'val').
        :param subset_size (int): Number of folders to include in the subset.
    """
    # Create folder for the subset
    os.makedirs(subset_path, exist_ok=True)
    print(f"Subset folder created or already exists at: {subset_path}")

    # Load the original list of files
    with open(input_txt_path, 'r') as file:
        file_list = [line.strip() for line in file.readlines()]
    print(f"Loaded {len(file_list)} folders from {input_txt_path}.")

    # Randomly select the subset
    random_subset = random.sample(file_list, subset_size)
    print(f"Randomly selected {len(random_subset)} folders for the {subset_name} subset.")

    # Store the subset list inside a .txt file
    subset_list_file = os.path.join(subset_path, f'{subset_name}_scans.txt')
    with open(subset_list_file, 'w') as file:
        for folder in random_subset:
            file.write(folder + '\n')
    print(f"Subset list for {subset_name} saved to: {subset_list_file}")

    # Optional: Print the selected folders
    print(f"Selected folders for the {subset_name} subset:")
    for folder in random_subset:
        print(folder)
    print('\n\n')

In [60]:
# 2D Scannet Subset 
create_subset_2d(os.path.join(data_path, 'scannetv2_train.txt'), subset_path, 'train', subset_size=10)
create_subset_2d(os.path.join(data_path, 'scannetv2_val.txt'), subset_path, 'val', subset_size=10)
create_subset_2d(os.path.join(data_path, 'scannetv2_test.txt'), subset_path, 'test', subset_size=10)

Subset folder created or already exists at: /Volumes/projects/open3dsg/data/subset_scannet
Loaded 1201 folders from /Volumes/projects/open3dsg/data/SCANNET/scannetv2_train.txt.
Randomly selected 10 folders for the train subset.
Subset list for train saved to: /Volumes/projects/open3dsg/data/subset_scannet/scannetv2_train.txt
Selected folders for the train subset:
scene0547_00
scene0480_00
scene0173_02
scene0336_00
scene0242_02
scene0687_00
scene0092_04
scene0359_01
scene0039_00
scene0038_02



Subset folder created or already exists at: /Volumes/projects/open3dsg/data/subset_scannet
Loaded 312 folders from /Volumes/projects/open3dsg/data/SCANNET/scannetv2_val.txt.
Randomly selected 10 folders for the val subset.
Subset list for val saved to: /Volumes/projects/open3dsg/data/subset_scannet/scannetv2_val.txt
Selected folders for the val subset:
scene0100_01
scene0595_00
scene0664_00
scene0355_01
scene0412_00
scene0203_02
scene0131_00
scene0351_01
scene0169_01
scene0146_01



Subset folder

In [61]:
# 3D Scannet Subset
create_subset_3d(os.path.join(data_path, 'train_scans.txt'), subset_path, 'train', subset_size=10)
create_subset_3d(os.path.join(data_path, 'val_scans.txt'), subset_path, 'val', subset_size=10)

Subset folder created or already exists at: /Volumes/projects/open3dsg/data/subset_scannet
Loaded 1178 folders from /Volumes/projects/open3dsg/data/SCANNET/train_scans.txt.
Randomly selected 10 folders for the train subset.
Subset list for train saved to: /Volumes/projects/open3dsg/data/subset_scannet/train_scans.txt
Selected folders for the train subset:
2a7f9476-080c-26f9-86e9-c7ce1c76fc07
bcb0fe2f-4f39-2c70-9eaf-a074d1b3e47b
5341b7bf-8a66-2cdd-8794-026113b7c312
6bde609b-9162-246f-8f90-c3d2444a5ab8
bcb0fe15-4f39-2c70-9f48-a26b76dfe042
5104a9c9-adc4-2a85-917e-92cb27d635fb
1d233fea-e280-2b1a-8f15-cd60d12ae197
20c993c1-698f-29c5-86b8-50a2a0907e2b
4e858c81-fd93-2cb4-8469-d9226116b5de
eee5b052-ee2d-28f4-99fd-c3c5380db25e



Subset folder created or already exists at: /Volumes/projects/open3dsg/data/subset_scannet
Loaded 157 folders from /Volumes/projects/open3dsg/data/SCANNET/val_scans.txt.
Randomly selected 10 folders for the val subset.
Subset list for val saved to: /Volumes/projects/op