In [1]:
import logging

# Required to load columns with extension types
import elbow.dtypes
import pandas as pd

from bids2table import bids2table

In [2]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

## Building the index

Generate the BIDS index with 4 parallel workers. Save the index to disk (in parquet format) for easy reload later.

Note that we are simultaneously indexing all datasets in the bids-examples repository.

In [3]:
df = bids2table(root="../bids-examples", persistent=True, overwrite=True, workers=4)

194it [00:03, 61.73it/s, tot=194, good=194, rec=2473, err=0]
187it [00:03, 59.59it/s, tot=187, good=187, rec=2410, err=0]
200it [00:03, 61.87it/s, tot=200, good=200, rec=2564, err=0]
199it [00:03, 57.10it/s, tot=199, good=199, rec=2740, err=0]


With `persistent=True`, the index is saved to disk for later use. By default it's saved to `bids-examples/index.b2t`. The index is saved as a directory of Parquet files, one per worker.

From the [Parquet docs](https://parquet.apache.org/):

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

In [4]:
! ls -lht ../bids-examples/index.b2t

total 2056
-rw-------  1 clane  staff   208K Jun 27 12:21 part-20230627122130-0002-of-0004.parquet
-rw-------  1 clane  staff   191K Jun 27 12:21 part-20230627122130-0001-of-0004.parquet
-rw-------  1 clane  staff   197K Jun 27 12:21 part-20230627122130-0000-of-0004.parquet
-rw-------  1 clane  staff   199K Jun 27 12:21 part-20230627122130-0003-of-0004.parquet


## Load and explore the index

Now when `bids2table` is called again, the persistent index is just loaded.

Each row in the table corresponds to a BIDS data file. The table is organized with several groups of columns:

- `dataset`: dataset name, relative dataset path, and the JSON dataset description
- `entities`: All [valid BIDS entities](https://bids-specification.readthedocs.io/en/stable/appendices/entities.html) plus an `extra_entities` dict containing any extra entities
- `metadata`: BIDS JSON "sidecar" metadata
- `file`: General file metadata including the full file path and last modified time.

In [5]:
df = bids2table("../bids-examples")

df.head(3)

Unnamed: 0_level_0,dataset,dataset,dataset,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,metadata,file,file,file
Unnamed: 0_level_1,dataset,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,sidecar,file_path,link_target,mod_time
0,ds002,../bids-examples/ds002,"{'BIDSVersion': '1.0.0', 'License': 'This data...",14,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{},,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0
1,ds002,../bids-examples/ds002,"{'BIDSVersion': '1.0.0', 'License': 'This data...",14,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,inplaneT2,.nii.gz,{},,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0
2,ds002,../bids-examples/ds002,"{'BIDSVersion': '1.0.0', 'License': 'This data...",14,,,mixedeventrelatedprobe,,,,,,,1.0,,,,,,,,,,,,,,,,,,func,events,.tsv,{},,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0


### Columns and types

Now let's look at the column names and pandas types.

> TODO: not all types are preserved when converting parquet to pandas. In particular, strings are mapped to objects and ints with `None` to float with `NaN`.

In [6]:
print(f"Shape: ", df.shape)
print(
    "Columns:\n"
    + "\n".join(f"  {name}: {typ}" for name, typ in df.dtypes.to_dict().items())
)

Shape:  (10187, 39)
Columns:
  ('dataset', 'dataset'): object
  ('dataset', 'dataset_path'): object
  ('dataset', 'dataset_description'): json
  ('entities', 'sub'): object
  ('entities', 'ses'): object
  ('entities', 'sample'): object
  ('entities', 'task'): object
  ('entities', 'acq'): object
  ('entities', 'ce'): object
  ('entities', 'trc'): object
  ('entities', 'stain'): object
  ('entities', 'rec'): object
  ('entities', 'dir'): object
  ('entities', 'run'): float64
  ('entities', 'mod'): object
  ('entities', 'echo'): float64
  ('entities', 'flip'): float64
  ('entities', 'inv'): float64
  ('entities', 'mt'): object
  ('entities', 'part'): object
  ('entities', 'proc'): object
  ('entities', 'hemi'): object
  ('entities', 'space'): object
  ('entities', 'split'): float64
  ('entities', 'recording'): object
  ('entities', 'chunk'): float64
  ('entities', 'atlas'): object
  ('entities', 'res'): object
  ('entities', 'den'): object
  ('entities', 'label'): object
  ('entities', '

### Sorting rows

By default the rows are in arbitrary order. We can sort the values in place.

If you find the hierarchical index annoying, you can drop the top level with:

```python
df = df.droplevel(0, axis=1)
```

You can also select one group of columns with e.g.

```python
ents = df["entities"]
```

In [7]:
sort_cols = [("dataset", "dataset")] + [("entities", k) for k in ["sub", "ses", "task", "run"]]

df.sort_values(sort_cols, inplace=True)

df.head(3)

Unnamed: 0_level_0,dataset,dataset,dataset,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,entities,metadata,file,file,file
Unnamed: 0_level_1,dataset,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,sidecar,file_path,link_target,mod_time
6749,7t_trt,../bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0
6751,7t_trt,../bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,physio,.tsv.gz,{},"{'StartTime': 0, 'SamplingFrequency': 100, 'Co...",/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0
6747,7t_trt,../bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,2.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1687883000.0


### Counting occurrences of BIDS entities

Count the number of non-null entries per BIDS entity.

In [8]:
ent_counts = df["entities"].count(axis=0)
ent_counts

sub               10187
ses                3502
sample               16
task               7972
acq                 422
ce                    0
trc                   0
stain                 8
rec                  58
dir                   3
run                6736
mod                   1
echo                541
flip                 53
inv                  20
mt                   25
part                 16
proc                208
hemi                 83
space               301
split                 0
recording             5
chunk                 8
atlas                 0
res                  84
den                   0
label                84
desc                280
datatype           8806
suffix            10139
ext               10187
extra_entities    10187
dtype: int64

We see that some BIDS entities never appear in any of the example datasets.

In [9]:
ent_counts[ent_counts == 0]

ce       0
trc      0
split    0
atlas    0
den      0
dtype: int64

### File counts

Count the number of data files per dataset and the number of files with sidecar metadata.

In [10]:
df.droplevel(0, axis=1).groupby("dataset").agg(
    {"file_path": "count", "sidecar": "count"}
)

Unnamed: 0_level_0,file_path,sidecar
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
7t_trt,635,350
asl001,4,2
asl002,5,3
asl003,5,3
asl004,6,4
asl005,5,3
ds000001-fmriprep,416,52
ds000117,1089,641
ds000246,32,22
ds000247,100,70


## Using the command-line interface

bids2table comes with a command-line iterface `bids2table` (alias `b2t`). Check out the help message.

In [11]:
! bids2table -h

usage: bids2table [-h] [--output OUTPUT] [--incremental] [--overwrite]
                  [--workers COUNT] [--worker_id RANK] [--verbose]
                  ROOT

positional arguments:
  ROOT                  Path to BIDS dataset

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Path to output parquet dataset directory (default:
                        {ROOT}/index.b2t)
  --incremental, --inc  Update index incrementally with only new or changed
                        files.
  --overwrite, -x       Overwrite previous index.
  --workers COUNT, -w COUNT
                        Number of worker processes. Setting to -1 runs as many
                        processes as there are cores available. (default: 1)
  --worker_id RANK, --id RANK
                        Optional worker ID to use when scheduling parallel
                        tasks externally. Incompatible with --overwrite.
                        (defa

Re-generate the index using the CLI.

In [12]:
! bids2table -x -w 4 bids-examples/

0it [00:00, ?it/s]

0it [00:00, ?it/s]
0it [00:00, ?it/s]


You can also generate each partition independently by calling `bids2table` with `--worker_id`. This can be useful in HPC environments where you want to schedule many extraction tasks in parallel through a scheduler like [SLURM](https://slurm.schedmd.com/documentation.html).


```bash
# Can't use --overwrite together with --worker_id
# Remove in advance
rm -r bids-examples/index.b2t

for worker_id in {0..3}; do
  bids2table --worker_id $worker_id --workers 4 bids-examples/ &
done
```