# BIDS Schema Integration Demo

This notebook demonstrates the new BIDS schema integration features in PyBIDS, showing how the new schema-based configuration compares to the legacy JSON-based configuration.

In [1]:
from bids import BIDSLayout
from bids.tests import get_test_data_path
import os

## Setup: Using a test dataset

We'll use the 7T TRT dataset that comes with PyBIDS for testing.

In [2]:
# Get path to test dataset
data_path = os.path.join(get_test_data_path(), '7t_trt')
print(f"Using dataset at: {data_path}")

# List the first few files to see the structure
!ls -la {data_path} | head -10

Using dataset at: /home/ashley/repos/bids/pybids/tests/data/7t_trt
total 76
drwxr-xr-x 12 ashley ashley 4096 Aug 21 08:59 .
drwxr-xr-x  8 ashley ashley 4096 Aug 21 08:59 ..
-rw-r--r--  1 ashley ashley    0 Aug 21 08:59 README
-rw-r--r--  1 ashley ashley   55 Aug 21 08:59 dataset_description.json
-rw-r--r--  1 ashley ashley  482 Aug 21 08:59 participants.tsv
drwxr-xr-x  4 ashley ashley 4096 Aug 21 08:59 sub-01
drwxr-xr-x  4 ashley ashley 4096 Aug 21 08:59 sub-02
drwxr-xr-x  4 ashley ashley 4096 Aug 21 08:59 sub-03
drwxr-xr-x  4 ashley ashley 4096 Aug 21 08:59 sub-04


## Part 1: Legacy JSON-based Configuration

The traditional PyBIDS approach uses JSON configuration files to define entities and their patterns.

In [3]:
# Initialize layout with legacy config (this is the default)
layout_legacy = BIDSLayout(data_path, config=['bids'])
print(f"Legacy layout initialized with {len(layout_legacy.get())} files")
print(f"\nAvailable entities: {sorted(layout_legacy.get_entities().keys())}")

# Print basic layout information like in the docs tutorial
print(f"\nLayout info:")
print(layout_legacy)

Legacy layout initialized with 339 files

Available entities: ['CogAtlasID', 'Columns', 'EchoTime', 'EchoTime1', 'EchoTime2', 'EffectiveEchoSpacing', 'IntendedFor', 'MTState', 'PhaseEncodingDirection', 'RepetitionTime', 'SamplingFrequency', 'SliceEncodingDirection', 'SliceTiming', 'StartTime', 'TaskName', 'acquisition', 'ceagent', 'chunk', 'datatype', 'direction', 'echo', 'extension', 'flip', 'fmap', 'inv', 'modality', 'mt', 'nucleus', 'part', 'proc', 'reconstruction', 'recording', 'run', 'sample', 'scans', 'session', 'space', 'staining', 'subject', 'suffix', 'task', 'tracer', 'tracksys', 'volume']

Layout info:
BIDS Layout: .../bids/pybids/tests/data/7t_trt | Subjects: 10 | Sessions: 20 | Runs: 20


In [4]:
# Query example: Get all BOLD files for subject 01
bold_files_legacy = layout_legacy.get(
    subject='01', 
    suffix='bold', 
    extension='nii.gz',
    return_type='filename'  # Returns filenames instead of BIDSFile objects
)
print(f"Found {len(bold_files_legacy)} BOLD files for subject 01:")
for f in bold_files_legacy:
    print(f"  {os.path.basename(f)}")

# Show the difference between return types like in docs
print(f"\nCompare return types:")
bold_objects = layout_legacy.get(subject='01', suffix='bold', extension='nii.gz')[:2]
bold_filenames = layout_legacy.get(subject='01', suffix='bold', extension='nii.gz', return_type='filename')[:2]

print(f"return_types={[type(obj) for obj in bold_objects]}")
print(f"return_types={[type(fname) for fname in bold_filenames]}")

Found 6 BOLD files for subject 01:
  sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz
  sub-01_ses-1_task-rest_acq-fullbrain_run-2_bold.nii.gz
  sub-01_ses-1_task-rest_acq-prefrontal_bold.nii.gz
  sub-01_ses-2_task-rest_acq-fullbrain_run-1_bold.nii.gz
  sub-01_ses-2_task-rest_acq-fullbrain_run-2_bold.nii.gz
  sub-01_ses-2_task-rest_acq-prefrontal_bold.nii.gz

Compare return types:
return_types=[<class 'bids.layout.models.BIDSImageFile'>, <class 'bids.layout.models.BIDSImageFile'>]
return_types=[<class 'str'>, <class 'str'>]


## Part 2: New Schema-based Configuration

The new approach uses the BIDS schema directly via `bidsschematools` to understand the dataset structure.

In [5]:
# Initialize layout with schema-based config
# This uses Config.load('bids-schema') internally when you pass 'bids-schema' in config
layout_schema = BIDSLayout(data_path, config=['bids-schema'])
print(f"Schema-based layout initialized with {len(layout_schema.get())} files")
print(f"\nAvailable entities: {sorted(layout_schema.get_entities().keys())}")

# Print basic layout information
print(f"\nLayout info:")
print(layout_schema)

Schema-based layout initialized with 339 files

Available entities: ['CogAtlasID', 'Columns', 'EchoTime', 'EchoTime1', 'EchoTime2', 'EffectiveEchoSpacing', 'IntendedFor', 'MTState', 'PhaseEncodingDirection', 'RepetitionTime', 'SamplingFrequency', 'SliceEncodingDirection', 'SliceTiming', 'StartTime', 'TaskName', 'acquisition', 'ceagent', 'chunk', 'datatype', 'density', 'description', 'direction', 'echo', 'extension', 'flip', 'hemisphere', 'inversion', 'label', 'modality', 'mtransfer', 'nucleus', 'part', 'processing', 'reconstruction', 'recording', 'resolution', 'run', 'sample', 'segmentation', 'session', 'space', 'split', 'stain', 'subject', 'suffix', 'task', 'tracer', 'tracksys', 'volume']

Layout info:
BIDS Layout: .../bids/pybids/tests/data/7t_trt | Subjects: 10 | Sessions: 20 | Runs: 20


In [6]:
# Same query with schema-based layout
bold_files_schema = layout_schema.get(
    subject='01', 
    suffix='bold', 
    extension='nii.gz',
    return_type='filename'
)
print(f"Found {len(bold_files_schema)} BOLD files for subject 01:")
for f in bold_files_schema:
    print(f"  {os.path.basename(f)}")

# Show the difference between return types like in docs
print(f"\nCompare return types:")
bold_objects_schema = layout_schema.get(subject='01', suffix='bold', extension='nii.gz')[:2]
bold_filenames_schema = layout_schema.get(subject='01', suffix='bold', extension='nii.gz', return_type='filename')[:2]

print(f"return_types={[type(obj) for obj in bold_objects_schema]}")
print(f"return_types={[type(fname) for fname in bold_filenames_schema]}")


Found 6 BOLD files for subject 01:
  sub-01_ses-1_task-rest_acq-fullbrain_run-1_bold.nii.gz
  sub-01_ses-1_task-rest_acq-fullbrain_run-2_bold.nii.gz
  sub-01_ses-1_task-rest_acq-prefrontal_bold.nii.gz
  sub-01_ses-2_task-rest_acq-fullbrain_run-1_bold.nii.gz
  sub-01_ses-2_task-rest_acq-fullbrain_run-2_bold.nii.gz
  sub-01_ses-2_task-rest_acq-prefrontal_bold.nii.gz

Compare return types:
return_types=[<class 'bids.layout.models.BIDSImageFile'>, <class 'bids.layout.models.BIDSImageFile'>]
return_types=[<class 'str'>, <class 'str'>]


## Part 3: Comparing the Two Approaches

Let's verify that both approaches give us the same results.

In [7]:
# Compare file counts
legacy_count = len(layout_legacy.get())
schema_count = len(layout_schema.get())

print(f"Legacy layout: {legacy_count} files")
print(f"Schema layout: {schema_count} files")
print(f"\nMatch: {legacy_count == schema_count}")

Legacy layout: 339 files
Schema layout: 339 files

Match: True


In [8]:
# Compare entities
legacy_entities = set(layout_legacy.get_entities().keys())
schema_entities = set(layout_schema.get_entities().keys())

print(f"Legacy entities: {len(legacy_entities)}")
print(f"Schema entities: {len(schema_entities)}")

# Check for differences
only_legacy = legacy_entities - schema_entities
only_schema = schema_entities - legacy_entities

if only_legacy:
    print(f"\nEntities only in legacy: {sorted(only_legacy)}")
if only_schema:
    print(f"\nEntities only in schema: {sorted(only_schema)}")
if not only_legacy and not only_schema:
    print("\nAll entities match!")

Legacy entities: 44
Schema entities: 49

Entities only in legacy: ['fmap', 'inv', 'mt', 'proc', 'scans', 'staining']

Entities only in schema: ['density', 'description', 'hemisphere', 'inversion', 'label', 'mtransfer', 'processing', 'resolution', 'segmentation', 'split', 'stain']


In [9]:
# Compare specific query results
def compare_queries(layout1, layout2, name1="Layout 1", name2="Layout 2", **kwargs):
    """Compare results from two layouts for the same query."""
    result1 = set(layout1.get(return_type='filename', **kwargs))
    result2 = set(layout2.get(return_type='filename', **kwargs))
    
    print(f"Query: {kwargs}")
    print(f"{name1}: {len(result1)} files")
    print(f"{name2}: {len(result2)} files")
    
    if result1 == result2:
        print("✓ Results match!\n")
    else:
        only1 = result1 - result2
        only2 = result2 - result1
        if only1:
            print(f"Only in {name1}: {[os.path.basename(f) for f in only1]}")
        if only2:
            print(f"Only in {name2}: {[os.path.basename(f) for f in only2]}")
        print()

# Test various queries
compare_queries(layout_legacy, layout_schema, "Legacy", "Schema", 
                suffix='bold', extension='nii.gz')

compare_queries(layout_legacy, layout_schema, "Legacy", "Schema",
                suffix='T1w', extension='nii.gz')

# 1. Test with scans TSV files (these exist)
compare_queries(layout_legacy, layout_schema, "Legacy", "Schema",
subject='01', suffix='scans', extension='tsv')


Query: {'suffix': 'bold', 'extension': 'nii.gz'}
Legacy: 60 files
Schema: 60 files
✓ Results match!

Query: {'suffix': 'T1w', 'extension': 'nii.gz'}
Legacy: 10 files
Schema: 10 files
✓ Results match!

Query: {'subject': '01', 'suffix': 'scans', 'extension': 'tsv'}
Legacy: 2 files
Schema: 2 files
✓ Results match!



## Part 5: Performance Comparison

Let's compare the initialization time for both approaches.

In [10]:
import time

# Time legacy initialization
start = time.time()
layout_legacy_timed = BIDSLayout(data_path, config=['bids'], reset_database=True)
legacy_time = time.time() - start

# Time schema initialization  
start = time.time()
layout_schema_timed = BIDSLayout(data_path, config=['bids-schema'], reset_database=True)
schema_time = time.time() - start

print(f"Legacy initialization: {legacy_time:.3f} seconds")
print(f"Schema initialization: {schema_time:.3f} seconds")
print(f"\nDifference: {abs(legacy_time - schema_time):.3f} seconds")
print(f"Schema is {'faster' if schema_time < legacy_time else 'slower'} by {abs(1 - schema_time/legacy_time)*100:.1f}%")

Legacy initialization: 0.684 seconds
Schema initialization: 0.679 seconds

Difference: 0.005 seconds
Schema is faster by 0.8%
