In [1]:
import logging

# Required to load columns with extension types
import elbow.dtypes
import pandas as pd

from bids2table import bids2table

In [2]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

## Building the index

Generate the BIDS index with 4 parallel workers. Save the index to disk (in parquet format) for easy reload later.

Note that we are simultaneously indexing all datasets in the bids-examples repository.

In [3]:
df = bids2table(root="bids-examples", persistent=True, overwrite=True, workers=4)

110it [00:00, 1099.68it/s, tot=217, good=217, rec=166, err=0]

Traceback (most recent call last):
  File "/Users/clane/Projects/ScalableQC/code/bids2table-v2/bids2table/extractors/image.py", line 23, in extract_image_meta
    img = nib.load(str(path))
  File "/Users/clane/Projects/ScalableQC/code/bids2table-v2/.venv/lib/python3.8/site-packages/nibabel/loadsave.py", line 115, in load
    raise ImageFileError(msg)
nibabel.filebasedimages.ImageFileError: File bids-examples/asl002/sub-Sub103/anat/sub-Sub103_T1w.nii.gz is not a gzip file
Traceback (most recent call last):
  File "/Users/clane/Projects/ScalableQC/code/bids2table-v2/bids2table/extractors/image.py", line 23, in extract_image_meta
    img = nib.load(str(path))
  File "/Users/clane/Projects/ScalableQC/code/bids2table-v2/.venv/lib/python3.8/site-packages/nibabel/loadsave.py", line 115, in load
    raise ImageFileError(msg)
nibabel.filebasedimages.ImageFileError: File bids-examples/asl002/sub-Sub103/perf/sub-Sub103_m0scan.nii.gz is not a gzip file
Traceback (most recent call last):
  File "/U

5027it [00:04, 1237.20it/s, tot=5027, good=5027, rec=2489, err=0]0:02, 1234.80it/s, tot=2763, good=2763, rec=1301, err=0]
5169it [00:04, 1257.88it/s, tot=5169, good=5169, rec=2543, err=0]
5088it [00:04, 1224.97it/s, tot=5088, good=5088, rec=2575, err=0]
5125it [00:04, 1229.02it/s, tot=5125, good=5125, rec=2614, err=0]


Note we see many empty file warnings. These are due to the empty placeholder nifti files in the bids-examples. To suppress the warnings, you can run

```python
logging.getLogger().setLevel(logging.ERROR)
```

With `persistent=True`, the index is saved to disk for later use. By default it's saved to `bids-examples/index.b2t`. The index is saved as a directory of Parquet files, one per worker.

From the [Parquet docs](https://parquet.apache.org/):

> Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

In [4]:
! ls -lht bids-examples/index.b2t

total 2264
-rw-------  1 clane  staff   238K Jun 16 14:47 part-20230616144727-0003-of-0004.parquet
-rw-------  1 clane  staff   211K Jun 16 14:47 part-20230616144727-0002-of-0004.parquet
-rw-------  1 clane  staff   235K Jun 16 14:47 part-20230616144727-0000-of-0004.parquet
-rw-------  1 clane  staff   276K Jun 16 14:47 part-20230616144727-0001-of-0004.parquet


## Load and explore the index

Now when `bids2table` is called again, the persistent index is just loaded.

Each row in the table corresponds to a BIDS data file. The table is organized with several groups of columns:

- `dataset`: dataset name, relative dataset path, and the JSON dataset description
- `bids`: All [valid BIDS entities](https://bids-specification.readthedocs.io/en/stable/appendices/entities.html), a dict containing any extra entities, and the JSON sidecar metadata.
- `image`: Metadata specific to image files.
- `file`: General file metadata including the full file path and last modified time.

In [5]:
df = bids2table("bids-examples")

df.head(3)

Unnamed: 0_level_0,dataset,dataset,dataset,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,image,image,file,file,file
Unnamed: 0_level_1,dataset,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,sidecar,image_header,image_affine,file_path,link_target,mod_time
0,asl002,bids-examples/asl002,"{'Name': 'ASL_Philips_PCASL_2DEPI', 'BIDSVersi...",Sub103,,,,,,,,,,,,,,,,,,,,,,,,,,,,perf,m0scan,.nii.gz,{},"{'Manufacturer': 'Philips', 'ManufacturersMode...",,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0
1,asl002,bids-examples/asl002,"{'Name': 'ASL_Philips_PCASL_2DEPI', 'BIDSVersi...",Sub103,,,,,,,,,,,,,,,,,,,,,,,,,,,,perf,asl,.nii.gz,{},"{'Manufacturer': 'Philips', 'ManufacturersMode...",,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0
2,ds002,bids-examples/ds002,"{'BIDSVersion': '1.0.0', 'License': 'This data...",13,,,,,,,,,,,,,,,,,,,,,,,,,,,,anat,T1w,.nii.gz,{},,,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0


### Columns and types

Now let's look at the column names and pandas types.

> TODO: not all types are preserved when converting parquet to pandas. In particular, strings are mapped to objects and ints with `None` to float with `NaN`.

In [6]:
print(f"Shape: ", df.shape)
print(
    "Columns:\n"
    + "\n".join(f"  {name}: {typ}" for name, typ in df.dtypes.to_dict().items())
)

Shape:  (10221, 41)
Columns:
  ('dataset', 'dataset'): object
  ('dataset', 'dataset_path'): object
  ('dataset', 'dataset_description'): json
  ('bids', 'sub'): object
  ('bids', 'ses'): object
  ('bids', 'sample'): object
  ('bids', 'task'): object
  ('bids', 'acq'): object
  ('bids', 'ce'): object
  ('bids', 'trc'): object
  ('bids', 'stain'): object
  ('bids', 'rec'): object
  ('bids', 'dir'): object
  ('bids', 'run'): float64
  ('bids', 'mod'): object
  ('bids', 'echo'): float64
  ('bids', 'flip'): float64
  ('bids', 'inv'): float64
  ('bids', 'mt'): object
  ('bids', 'part'): object
  ('bids', 'proc'): object
  ('bids', 'hemi'): object
  ('bids', 'space'): object
  ('bids', 'split'): float64
  ('bids', 'recording'): object
  ('bids', 'chunk'): float64
  ('bids', 'atlas'): object
  ('bids', 'res'): object
  ('bids', 'den'): object
  ('bids', 'label'): object
  ('bids', 'desc'): object
  ('bids', 'datatype'): object
  ('bids', 'suffix'): object
  ('bids', 'ext'): object
  ('bids', 

### Sorting rows

By default the rows are in arbitrary order. We can sort the values in place.

If you find the hierarchical index annoying, you can drop the top level with:

```python
df = df.droplevel(0, axis=1)
```

You can also select one group of columns with e.g.

```python
bids_df = df["bids"]
```

In [7]:
sort_cols = [("dataset", "dataset")] + [("bids", k) for k in ["sub", "ses", "task", "run"]]

df.sort_values(sort_cols, inplace=True)

df.head(3)

Unnamed: 0_level_0,dataset,dataset,dataset,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,bids,image,image,file,file,file
Unnamed: 0_level_1,dataset,dataset_path,dataset_description,sub,ses,sample,task,acq,ce,trc,stain,rec,dir,run,mod,echo,flip,inv,mt,part,proc,hemi,space,split,recording,chunk,atlas,res,den,label,desc,datatype,suffix,ext,extra_entities,sidecar,image_header,image_affine,file_path,link_target,mod_time
1602,7t_trt,bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0
6695,7t_trt,bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,1.0,,,,,,,,,,,,,,,,,,func,physio,.tsv.gz,{},"{'StartTime': 0, 'SamplingFrequency': 100, 'Co...",,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0
4186,7t_trt,bids-examples/7t_trt,"{'BIDSVersion': '1.8.0', 'Name': '7t_trt'}",1,1,,rest,fullbrain,,,,,,2.0,,,,,,,,,,,,,,,,,,func,bold,.nii.gz,{},{'CogAtlasID': 'https://www.cognitiveatlas.org...,,,/Users/clane/Projects/ScalableQC/code/bids2tab...,,1685583000.0


### Counting occurrences of BIDS entities

Count the number of non-null entries per BIDS entity.

In [8]:
ent_counts = df["bids"].count(axis=0)
ent_counts

sub               10221
ses                3502
sample               16
task               7992
acq                 422
ce                    0
trc                   0
stain                 8
rec                  58
dir                   3
run                6736
mod                   1
echo                541
flip                 53
inv                  20
mt                   25
part                 16
proc                208
hemi                 83
space               301
split                 0
recording             5
chunk                 8
atlas                 0
res                  84
den                   0
label                84
desc                280
datatype           8836
suffix            10159
ext               10221
extra_entities    10221
sidecar            5023
dtype: int64

We see that some BIDS entities never appear in any of the example datasets.

In [9]:
ent_counts[ent_counts == 0]

ce       0
trc      0
split    0
atlas    0
den      0
dtype: int64

### File counts

Count the number of data files per dataset, the number of files with sidecar metadata, and the number of files with intact image headers.

> Note, most of the image files in the bids-examples datasets have empty headers.

In [10]:
df.droplevel(0, axis=1).groupby("dataset").agg(
    {"file_path": "count", "sidecar": "count", "image_header": "count"}
)

Unnamed: 0_level_0,file_path,sidecar,image_header
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7t_trt,635,350,0
asl001,4,2,0
asl002,5,3,0
asl003,5,3,0
asl004,6,4,0
asl005,5,3,0
ds000001-fmriprep,420,52,0
ds000117,1089,641,0
ds000246,32,22,0
ds000247,100,70,0


## Using the command-line interface

bids2table comes with a command-line iterface `bids2table` (alias `b2t`). Check out the help message.

In [11]:
! bids2table -h

usage: bids2table [-h] [--output OUTPUT] [--incremental] [--overwrite]
                  [--workers COUNT] [--worker_id RANK] [--verbose]
                  ROOT

positional arguments:
  ROOT                  Path to BIDS dataset

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Path to output parquet dataset directory (default:
                        {ROOT}/index.b2t)
  --incremental, --inc  Update index incrementally with only new or changed
                        files.
  --overwrite, -x       Overwrite previous index.
  --workers COUNT, -w COUNT
                        Number of worker processes. Setting to -1 runs as many
                        processes as there are cores available. (default: 1)
  --worker_id RANK, --id RANK
                        Optional worker ID to use when scheduling parallel
                        tasks externally. Incompatible with --overwrite.
                        (defa

Re-generate the index using the CLI.

In [12]:
! bids2table -x -w 4 bids-examples/

5027it [00:04, 1136.72it/s, tot=5027, good=5027, rec=2489, err=0]057, good=5057, rec=2558, err=0]75, rec=408, err=0] rec=886, err=0]
5169it [00:04, 1167.26it/s, tot=5169, good=5169, rec=2543, err=0]
5088it [00:04, 1149.19it/s, tot=5088, good=5088, rec=2575, err=0]
5125it [00:04, 1136.73it/s, tot=5125, good=5125, rec=2614, err=0]


You can also generate each partition independently by calling `bids2table` with `--worker_id`. This can be useful in HPC environments where you want to schedule many extraction tasks in parallel through a scheduler like [SLURM](https://slurm.schedmd.com/documentation.html).


```bash
# Can't use --overwrite together with --worker_id
# Remove in advance
rm -r bids-examples/index.b2t

for worker_id in {0..3}; do
  bids2table --worker_id $worker_id --workers 4 bids-examples/ &
done
```