In [1]:
import json

import pandas as pd

from bids2table import bids2table

In [2]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

## Building the index

Generate the BIDS index with 4 parallel workers. Save the index to disk (in parquet format) for easy reload later.

Note that we are simultaneously indexing all datasets in the bids-examples repository.

In [3]:
bids2table(
    root="../bids-examples",
    index_path="bids-examples.b2t",
    persistent=True,
    overwrite=True,
    workers=4,
    return_table=False,
)

193it [00:00, 308.38it/s, tot=193, good=193, rec=2386, err=0]
172it [00:00, 284.34it/s, tot=172, good=172, rec=2240, err=0]
202it [00:00, 284.34it/s, tot=202, good=202, rec=2828, err=0]
213it [00:00, 295.75it/s, tot=213, good=213, rec=2812, err=0]


With `persistent=True`, the index is saved to disk for later use. By default it's saved to `bids-examples/index.b2t`. The index is saved as a directory of Parquet files, one per worker.

From the [Parquet docs](https://parquet.apache.org/):

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

In [4]:
! ls -lht bids-examples.b2t/

total 1608
-rw-------@ 1 clane  staff   197K May  1 16:00 part-20240501160029-0002-of-0004.parquet
-rw-------@ 1 clane  staff   240K May  1 16:00 part-20240501160029-0003-of-0004.parquet
-rw-------@ 1 clane  staff   167K May  1 16:00 part-20240501160029-0000-of-0004.parquet
-rw-------@ 1 clane  staff   194K May  1 16:00 part-20240501160029-0001-of-0004.parquet


## Load and explore the index

Now when `bids2table` is called again, the persistent index is just loaded.

Each row in the table corresponds to a BIDS data file. The table is organized with several groups of columns:

- dataset (`ds__*`): dataset name, relative dataset path, and the JSON dataset description
- entities (`ent__*`): All [valid BIDS entities](https://bids-specification.readthedocs.io/en/stable/appendices/entities.html) plus an `extra_entities` dict containing any extra entities
- metadata (`meta__*`): BIDS JSON metadata
- file info (`finfo__*`): General file info including the full file path and last modified time

In [5]:
tab = bids2table("../bids-examples", index_path="bids-examples.b2t")
print("Shape:", tab.shape)
tab.head(3)

Shape: (10266, 40)


Unnamed: 0,ds__dataset,ds__dataset_type,ds__dataset_path,ds__dataset_description,ent__sub,ent__ses,ent__sample,ent__task,ent__acq,ent__ce,ent__trc,ent__stain,ent__rec,ent__dir,ent__run,ent__mod,ent__echo,ent__flip,ent__inv,ent__mt,ent__part,ent__proc,ent__hemi,ent__space,ent__split,ent__recording,ent__chunk,ent__atlas,ent__res,ent__den,ent__label,ent__desc,ent__datatype,ent__suffix,ent__ext,ent__extra_entities,meta__json,finfo__file_path,finfo__link_target,finfo__mod_time
0,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,inplaneT2,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
2,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,probabilisticclassification,,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},"{'RepetitionTime': 2.0, 'TaskName': 'probabili...",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0


Now let's look at the column types.

> TODO: not all types are preserved when converting parquet to pandas. In particular, strings are mapped to objects and ints with None to float with NaN.

In [6]:
schema = pd.DataFrame.from_records([tab.dtypes.to_dict()])
schema

Unnamed: 0,ds__dataset,ds__dataset_type,ds__dataset_path,ds__dataset_description,ent__sub,ent__ses,ent__sample,ent__task,ent__acq,ent__ce,ent__trc,ent__stain,ent__rec,ent__dir,ent__run,ent__mod,ent__echo,ent__flip,ent__inv,ent__mt,ent__part,ent__proc,ent__hemi,ent__space,ent__split,ent__recording,ent__chunk,ent__atlas,ent__res,ent__den,ent__label,ent__desc,ent__datatype,ent__suffix,ent__ext,ent__extra_entities,meta__json,finfo__file_path,finfo__link_target,finfo__mod_time
0,object,object,object,json,object,object,object,object,object,object,object,object,object,object,float64,object,float64,float64,float64,object,object,object,object,object,float64,object,float64,object,object,object,object,object,object,object,object,json,json,object,object,float64


The dataframe returned by `bids2table` is in fact a special `BIDSTable` subclass of `pandas.DataFrame` with a few extra helper methods.

- You can view the table with [nested columns](https://pandas.pydata.org/docs/user_guide/advanced.html#hierarchical-indexing-multiindex)

In [7]:
tab.nested.head(3)

Unnamed: 0_level_0,ds,ds,ds,ds,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,ent,meta,finfo,finfo,finfo
Unnamed: 0_level_1,dataset,dataset_type,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,json,file_path,link_target,mod_time
0,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,inplaneT2,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
2,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,probabilisticclassification,,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},"{'RepetitionTime': 2.0, 'TaskName': 'probabili...",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0


- You can easily access the dataset (`ds`), entities (`ent`), metadata (`meta`), or file info (`finfo`) subtables.

In [8]:
tab.ent.head(3)

Unnamed: 0,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities
0,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{}
1,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,inplaneT2,.nii.gz,{}
2,15,,,probabilisticclassification,,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{}


- You can view the full table without the group prefixes

In [9]:
tab.flat.head(3)

Unnamed: 0,dataset,dataset_type,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,json,file_path,link_target,mod_time
0,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,inplaneT2,.nii.gz,{},,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
2,ds002,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.0.0', 'License': 'This data...",15,,,probabilisticclassification,,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},"{'RepetitionTime': 2.0, 'TaskName': 'probabili...",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0


- You can access flattened JSON metadata.

In [10]:
tab.flat_meta.head(3)

Unnamed: 0,RepetitionTime,TaskName,InstitutionAddress,InstitutionName,InstitutionalDepartmentName,PowerLineFrequency,ManufacturersModelName,EEGReference,Manufacturer,EEGChannelCount,MiscChannelCount,RecordingType,RecordingDuration,SamplingFrequency,EOGChannelCount,ECGChannelCount,EMGChannelCount,SoftwareFilters,onset,duration,trial_type,response_time,sample,value,SoftwareVersions,MagneticFieldStrength,ReceiveCoilName,ReceiveCoilActiveElements,ScanningSequence,SequenceVariant,ScanOptions,SequenceName,PulseSequenceDetails,ParallelReductionFactorInPlane,PartialFourier,EchoTime,InversionTime,DwellTime,FlipAngle,MRAcquisitionType,PulseSequenceType,PhaseEncodingDirection,EffectiveEchoSpacing,TotalReadoutTime,RepetitionTimePreparation,IntendedFor,AcquisitionVoxelsize,NumberShots,ArterialSpinLabelingType,PostLabelingDelay,...,SpoilingRFPhaseIncrement,MagneticFliedStrength,PulseSequence,SpoilingState,SpoilingType,SpoilingGradientMoment,SpoilingGradientDuration,a_comp_cor_179,a_comp_cor_180,a_comp_cor_181,a_comp_cor_182,a_comp_cor_183,t_comp_cor_06,a_comp_cor_184,a_comp_cor_185,a_comp_cor_186,a_comp_cor_187,a_comp_cor_188,aroma_motion_45,aroma_motion_46,aroma_motion_47,aroma_motion_48,aroma_motion_49,aroma_motion_50,aroma_motion_51,aroma_motion_52,aroma_motion_53,dropped_568,dropped_569,dropped_570,dropped_571,PharmaceuticalName,PharmaceuticalDoseAmount,PharmaceuticalDoseAmountUnits,PharmaceuticalDoseRegimen,PharmaceuticalDoseTime,InfusionRadioactivity,InfusionStart,InfusionSpeed,InfusionSpeedUnits,InjectedVolume,TracerInjectionType,InjectionEnd,AttenuationCorrectionMethodReference,NonLinearGradientCorrection,PhaseOversampling,PercentSampling,InjectedMassPerWeight,InjectedMassPerWeightUnits,ElectricalStimulationParameters
0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2.0,probabilistic classification,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


- You can still slice the table and get back a `BIDSTable`

In [11]:
subtab = tab.iloc[:500]
print(type(subtab), subtab.shape)

<class 'bids2table.table.BIDSTable'> (500, 40)


### Sorting rows

By default the rows are in arbitrary order. We can sort the rows by dataset, subject, session, task, and run.

In [12]:
sort_tab = tab.sort_entities(["dataset", "sub", "ses", "task", "run"])
sort_tab.head(3)

Unnamed: 0,ds__dataset,ds__dataset_type,ds__dataset_path,ds__dataset_description,ent__sub,ent__ses,ent__sample,ent__task,ent__acq,ent__ce,ent__trc,ent__stain,ent__rec,ent__dir,ent__run,ent__mod,ent__echo,ent__flip,ent__inv,ent__mt,ent__part,ent__proc,ent__hemi,ent__space,ent__split,ent__recording,ent__chunk,ent__atlas,ent__res,ent__den,ent__label,ent__desc,ent__datatype,ent__suffix,ent__ext,ent__extra_entities,meta__json,finfo__file_path,finfo__link_target,finfo__mod_time
3788,7t_trt,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
3790,7t_trt,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,physio,.tsv.gz,{},"{'StartTime': 0, 'SamplingFrequency': 100, 'Co...",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
3786,7t_trt,,/Users/clane/Projects/B2T/bids2table/bids-exam...,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,2.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0


### Filtering

In addition to all the usual pandas slicing operations, `BIDSTable`s also support higher-level filtering operations inspired by the PyBIDS `BIDSLayout.get` method and the pandas `Series.filter` method.

In [13]:
filtered = (
    tab
    .filter("task", contains="rest")
    .filter("sub", items=["04", "08"])
    .filter("RepetitionTime", 2.5)
)

filtered

Unnamed: 0,ds__dataset,ds__dataset_type,ds__dataset_path,ds__dataset_description,ent__sub,ent__ses,ent__sample,ent__task,ent__acq,ent__ce,ent__trc,ent__stain,ent__rec,ent__dir,ent__run,ent__mod,ent__echo,ent__flip,ent__inv,ent__mt,ent__part,ent__proc,ent__hemi,ent__space,ent__split,ent__recording,ent__chunk,ent__atlas,ent__res,ent__den,ent__label,ent__desc,ent__datatype,ent__suffix,ent__ext,ent__extra_entities,meta__json,finfo__file_path,finfo__link_target,finfo__mod_time
1554,synthetic/derivatives/fmriprep,derivative,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'fMRIPrep - fMRI PREProcessing workfl...,4,2,,rest,,,,,,,,,,,,,,,,T1w,,,,,,,,preproc,func,bold,.nii,{},{'Sources': ['bids:raw:sub-04/ses-02/sub-04_se...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1567,synthetic/derivatives/fmriprep,derivative,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'fMRIPrep - fMRI PREProcessing workfl...,4,2,,rest,,,,,,,,,,,,,,,,MNI152NLin2009cAsym,,,,,,,,preproc,func,bold,.nii,{},{'Sources': ['bids:raw:sub-04/ses-02/sub-04_se...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1576,synthetic/derivatives/fmriprep,derivative,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'fMRIPrep - fMRI PREProcessing workfl...,4,1,,rest,,,,,,,,,,,,,,,,T1w,,,,,,,,preproc,func,bold,.nii,{},{'Sources': ['bids:raw:sub-04/ses-01/sub-04_se...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
1579,synthetic/derivatives/fmriprep,derivative,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'fMRIPrep - fMRI PREProcessing workfl...,4,1,,rest,,,,,,,,,,,,,,,,MNI152NLin2009cAsym,,,,,,,,preproc,func,bold,.nii,{},{'Sources': ['bids:raw:sub-04/ses-01/sub-04_se...,/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
4222,synthetic,raw,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'Synthetic dataset for inclusion in B...,4,2,,rest,,,,,,,,,,,,,,,,,,,,,,,,,func,bold,.nii,{},"{'TaskName': 'Rest', 'RepetitionTime': 2.5}",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0
4235,synthetic,raw,/Users/clane/Projects/B2T/bids2table/bids-exam...,{'Name': 'Synthetic dataset for inclusion in B...,4,1,,rest,,,,,,,,,,,,,,,,,,,,,,,,,func,bold,.nii,{},"{'TaskName': 'Rest', 'RepetitionTime': 2.5}",/Users/clane/Projects/B2T/bids2table/bids-exam...,,1691420000.0


In [14]:
print("\n".join(sorted([str(f.path.relative_to("/Users/clane/Projects/B2T/bids2table/bids-examples/")) for f in filtered.files])))

synthetic/derivatives/fmriprep/sub-04/ses-01/func/sub-04_ses-01_task-rest_space-MNI152NLin2009cAsym_desc-preproc_bold.nii
synthetic/derivatives/fmriprep/sub-04/ses-01/func/sub-04_ses-01_task-rest_space-T1w_desc-preproc_bold.nii
synthetic/derivatives/fmriprep/sub-04/ses-02/func/sub-04_ses-02_task-rest_space-MNI152NLin2009cAsym_desc-preproc_bold.nii
synthetic/derivatives/fmriprep/sub-04/ses-02/func/sub-04_ses-02_task-rest_space-T1w_desc-preproc_bold.nii
synthetic/sub-04/ses-01/func/sub-04_ses-01_task-rest_bold.nii
synthetic/sub-04/ses-02/func/sub-04_ses-02_task-rest_bold.nii


You can also apply multiple filters at the same time with `filter_multi`.

In [15]:
filtered2 = tab.filter_multi(
    task={"contains": "rest"},
    sub={"items": ["04", "08"]},
    RepetitionTime=2.5,
)

print("Filters equal:", filtered.equals(filtered2))

Filters equal: True


### Getting files

The rows of the table can also be converted to a list of structured `BIDSFile`s.

In [16]:
files = filtered.files

print("First file:", files[0])

First file: BIDSFile(dataset='synthetic/derivatives/fmriprep', root=PosixPath('/Users/clane/Projects/B2T/bids2table/bids-examples/synthetic/derivatives/fmriprep'), path=PosixPath('/Users/clane/Projects/B2T/bids2table/bids-examples/synthetic/derivatives/fmriprep/sub-04/ses-02/func/sub-04_ses-02_task-rest_space-T1w_desc-preproc_bold.nii'), entities=BIDSEntities(sub='04', ses='02', sample=None, task='rest', acq=None, ce=None, trc=None, stain=None, rec=None, dir=None, run=None, mod=None, echo=None, flip=None, inv=None, mt=None, part=None, proc=None, hemi=None, space='T1w', split=None, recording=None, chunk=None, atlas=None, res=None, den=None, label=None, desc='preproc', datatype='func', suffix='bold', ext='.nii', extra_entities={}), metadata={'Sources': ['bids:raw:sub-04/ses-02/sub-04_ses-02_task-rest_bold.nii'], 'TaskName': 'Rest', 'RepetitionTime': 2.5})


In [17]:
print("File paths:\n", "\n".join([str(f.relative_path) for f in files]), sep="")

File paths:
sub-04/ses-02/func/sub-04_ses-02_task-rest_space-T1w_desc-preproc_bold.nii
sub-04/ses-02/func/sub-04_ses-02_task-rest_space-MNI152NLin2009cAsym_desc-preproc_bold.nii
sub-04/ses-01/func/sub-04_ses-01_task-rest_space-T1w_desc-preproc_bold.nii
sub-04/ses-01/func/sub-04_ses-01_task-rest_space-MNI152NLin2009cAsym_desc-preproc_bold.nii
sub-04/ses-02/func/sub-04_ses-02_task-rest_bold.nii
sub-04/ses-01/func/sub-04_ses-01_task-rest_bold.nii


### Skipping metadata

Extracting JSON sidecar metadata can often be the most time-consuming step of the indexing process. By setting `with_meta=False`, `bidstable` can skip this expensive up-front processing. Here we index without metadata and get a small speedup. 

In [18]:
tab_no_meta = bids2table(root="../bids-examples", with_meta=False)

780it [00:02, 319.48it/s, tot=780, good=780, rec=10266, err=0]


If you want to extract metadata for a subset of the files after the fact, you can use the `BIDSTable.with_meta` method.

In [19]:
filtered_no_meta = (
    tab_no_meta
    .filter("task", contains="rest")
    .filter("sub", items=["04", "08"])
)

filtered_with_meta = filtered_no_meta.with_meta()
filtered_with_meta.flat_meta.head(2)

Unnamed: 0,TaskName,Manufacturer,ManufacturersModelName,ImageType,AcquisitionTime,AcquisitionDate,MagneticFieldStrength,FlipAngle,EchoTime,RepetitionTime,EffectiveEchoSpacing,SliceTiming,PhaseEncodingDirection,CogAtlasID,SliceEncodingDirection,StartTime,SamplingFrequency,Columns,Sources
913,Rest,Siemens,Skyra,"[ORIGINAL, PRIMARY, M, MB, ND, MOSAI]",192106.68,20180511.0,3.0,51.0,0.0424,0.735,0.00064,"[0, 0.09, 0.18, 0.2675, 0.3575, 0.4475, 0.5375...",j-,,,,,,
993,Rest,Siemens,Skyra,"[ORIGINAL, PRIMARY, M, MB, ND, MOSAI]",192106.68,20180511.0,3.0,51.0,0.0424,0.735,0.00064,"[0, 0.09, 0.18, 0.2675, 0.3575, 0.4475, 0.5375...",j-,,,,,,


## Analyze the table

Next we'll do some more detailed analysis of the table to demonstrate some of the more advanced manipulation that's possible.

### Counting occurrences of BIDS entities

Count the number of non-null entries per BIDS entity.

In [20]:
ent_counts = tab.ent.count(axis=0)
ent_counts

sub               10266
ses                3545
sample               16
task               7972
acq                 422
ce                    0
trc                   0
stain                 8
rec                  58
dir                   3
run                6736
mod                   1
echo                541
flip                 53
inv                  20
mt                   25
part                 16
proc                208
hemi                 83
space               310
split                 0
recording             5
chunk                 8
atlas                 0
res                  84
den                   0
label                84
desc                280
datatype           8885
suffix            10218
ext               10266
extra_entities    10266
dtype: int64

We see that some BIDS entities never appear in any of the example datasets.

In [21]:
ent_counts[ent_counts == 0]

ce       0
trc      0
split    0
atlas    0
den      0
dtype: int64

### File counts

Count the number of data files per dataset and the number of files with json metadata.

In [22]:
tab.flat.groupby("dataset").agg(
    {"file_path": "count", "json": "count"}
)

Unnamed: 0_level_0,file_path,json
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
7t_trt,635,350
asl001,4,2
asl002,5,3
asl003,5,3
asl004,6,4
asl005,5,3
ds000001-fmriprep,416,52
ds000117,1105,657
ds000246,33,23
ds000247,105,75


## Using the command-line interface

bids2table comes with a command-line iterface `bids2table` (alias `b2t`). Check out the help message.

In [23]:
! bids2table -h

usage: bids2table [-h] [--output OUTPUT] [--incremental] [--overwrite]
                  [--workers COUNT] [--worker_id RANK] [--verbose]
                  ROOT

positional arguments:
  ROOT                  Path to BIDS dataset

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Path to output parquet dataset directory (default:
                        {ROOT}/index.b2t)
  --incremental, --inc  Update index incrementally with only new or changed
                        files.
  --overwrite, -x       Overwrite previous index.
  --workers COUNT, -w COUNT
                        Number of worker processes. Setting to -1 runs as many
                        processes as there are cores available. (default: 1)
  --worker_id RANK, --id RANK
                        Optional worker ID to use when scheduling parallel
                        tasks externally. Incompatible with --overwrite.
                        (defa

Re-generate the index using the CLI.

In [24]:
! bids2table -o bids-examples.b2t -x -w 4 ../bids-examples/

172it [00:00, 296.08it/s, tot=172, good=172, rec=2240, err=0]
193it [00:00, 314.65it/s, tot=193, good=193, rec=2386, err=0]
202it [00:00, 288.84it/s, tot=202, good=202, rec=2828, err=0]
213it [00:00, 301.57it/s, tot=213, good=213, rec=2812, err=0]


You can also generate each partition independently by calling `bids2table` with `--worker_id`. This can be useful in HPC environments where you want to schedule many extraction tasks in parallel through a scheduler like [SLURM](https://slurm.schedmd.com/documentation.html).


```bash
# Can't use --overwrite together with --worker_id
# Remove in advance
rm -r bids-examples.b2t

for worker_id in {0..3}; do
  bids2table -o bids-examples.b2t --worker_id $worker_id --workers 4 ../bids-examples/ &
done
```