# Query benchmark

In this benchmark we compare the query performance for [PyBIDS](https://github.com/bids-standard/pybids), [ancpBIDS](https://github.com/ANCPLabOldenburg/ancp-bids), and [bids2table](https://github.com/childmindresearch/bids2table). The queries are modeled after the [PyBIDS `BIDSLayout` tutorial](https://bids-standard.github.io/pybids/examples/pybids_tutorial.html#querying-the-bidslayout).

For this benchmark, we use raw data from the [Chinese Color Nest Project](http://deepneuro.bnu.edu.cn/?p=163) (195 subjects, 2 resting state sessions per subject).

In [1]:
import datetime
from pathlib import Path

import pandas as pd

import bids
import ancpbids
import bids2table as b2t
from ancpbids import pybids_compat as bids2



In [2]:
print("PyBIDS:", bids.__version__)
print("ancpBIDS:", ancpbids.__version__)
print("bids2table:", b2t.__version__)

PyBIDS: 0.16.0
ancpBIDS: 0.2.2
bids2table: 0.1.dev29+gb7b1658


## Index the dataset using the different backends

We first index the data. We don't benchmark this step since [indexing is benchmarked separately](../indexing/).

Note that both PyBIDS and ancpBIDS provide a similar `BIDSLayout` interface, whereas bids2table returns a pandas dataframe. This is a lower-level representation that arguably offers more flexibility at the price of more complicated syntax. In the future, we may consider implementing a [PyBIDS-compatible layout interface](https://github.com/bids-standard/pybids/issues/989) on top of the bids2table dataframe.

In [3]:
bids_dir = Path("/ocean/projects/med220004p/shared/data_raw/RBC/CCNP_BIDS")
index_dir = Path("indexes")

In [4]:
# pybids indexing
indexer = bids.BIDSLayoutIndexer(
    validate=False,
    index_metadata=True,
)
pb_layout = bids.BIDSLayout(
    root=bids_dir,
    validate=False,
    absolute_paths=True,
    derivatives=False,
    database_path=index_dir / "pybids.db",
    indexer=indexer,
)

Example contents of 'dataset_description.json':
{"Name": "Example dataset", "BIDSVersion": "1.0.2", "GeneratedBy": [{"Name": "Example pipeline"}]}


In [5]:
# ancpbids indexing
ab_layout = bids2.BIDSLayout(bids_dir)

In [6]:
# bids2table indexing
b2t_df = b2t.bids2table(bids_dir, persistent=True, output="indexes/index.b2t")

# drop hierarchical index
b2t_df = b2t_df.droplevel(0, axis=1)

# extract json sidecar data
sidecar_df = pd.json_normalize(b2t_df["sidecar"])
b2t_df = pd.concat([b2t_df, sidecar_df], axis=1)

## Compare query performance

Next we compare the performance of the different indices on four queries:

- Get subjects: Get a list of all unique subjects
- Get BOLD: Get a list of all BOLD Nifti image files
- Query Metadata: Find scans with a specific value for a sidecar metadata field
- Get morning scans: Find scans that were acquired before 10 AM

Below is a summary table of the query run times in milliseconds. We find that bids2table is >20x faster than PyBIDS and ancpBIDS.

| Index | Get subjects (ms) | Get BOLD (ms) | Query metadata (ms) | Get morning scans (ms) |
| -- | -- | -- | -- | -- |
| PyBIDS | 1350 | 12.3 | 6.53 | 34.3 |
| ancpBIDS | 30.6 | 19.2 | -- | -- |
| bids2table | **0.046** | **0.346** | **0.312** | **0.352** |


Note that ancpBIDS is missing values for the two queries that require accessing the sidecar metadata. It's possible that ancpBIDS supports these queries, but [looking at the documentation](https://ancpbids.readthedocs.io/en/latest/advancedQueries.html), it wasn't obvious to me how.

In [7]:
# for checking that the different indices produce the same result
def is_equal(l1, l2):
    return sorted(l1) == sorted(l2)

### Get all subject IDs

In [8]:
def pb_sub_query():
    return pb_layout.get(return_type="id", target="subject")

def ab_sub_query():
    return ab_layout.get(return_type="id", target="subject")

def b2t_sub_query():
    return b2t_df["sub"].unique().tolist()

In [9]:
%timeit pb_sub_query()
%timeit ab_sub_query()
%timeit b2t_sub_query()

1.35 s ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30.6 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [10]:
print("pybids == ancpbids:", is_equal(pb_sub_query(), ab_sub_query()))
print("pybids == bids2table:", is_equal(pb_sub_query(), b2t_sub_query()))

pybids == ancpbids: True
pybids == bids2table: True


### Get filenames for all BOLD images

In [11]:
def pb_bold_query():
    return pb_layout.get(extension="nii.gz", suffix="bold", return_type="filename")

def ab_bold_query():
    return ab_layout.get(extension="nii.gz", suffix="bold", return_type="filename")

def b2t_bold_query():
    return b2t_df.loc[
        (b2t_df["ext"] == '.nii.gz') & (b2t_df["suffix"] == 'bold'),
        "file_path"
    ].tolist()

In [12]:
%timeit pb_bold_query()
%timeit ab_bold_query()
%timeit b2t_bold_query()

12.3 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.2 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
346 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [13]:
print("pybids == ancpbids:", is_equal(pb_bold_query(), ab_bold_query()))
print("pybids == bids2table:", is_equal(pb_bold_query(), b2t_bold_query()))

pybids == ancpbids: True
pybids == bids2table: True


### Query metadata

In [14]:
def pb_meta_query():
    return pb_layout.get(
        subject=["colornest167", "colornest168"],
        NumVolumes=184,
        return_type="filename",
    )

def ab_meta_query():
    # NOTE: This doesn't work. Does ancpbids support querying sidecar metadata?
    return ab_layout.get(
        subject=["colornest167", "colornest168"],
        NumVolumes=184,
        return_type="filename",
    )

def b2t_meta_query():
    return b2t_df.loc[
        (b2t_df["sub"].isin(["colornest167", "colornest168"]))
        & (b2t_df["NumVolumes"] == 184),
        "file_path"
    ].tolist()

In [15]:
%timeit pb_meta_query()
%timeit ab_meta_query()
%timeit b2t_meta_query()

6.53 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
45.5 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
312 µs ± 850 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [16]:
print("pybids == ancpbids:", is_equal(pb_meta_query(), ab_meta_query()))
print("pybids == bids2table:", is_equal(pb_meta_query(), b2t_meta_query()))

pybids == ancpbids: False
pybids == bids2table: True


### Get morning scans

In [17]:
def pb_morning_query():
    file_names = pd.Series(
        pb_layout.get(extension="nii.gz", return_type="filename")
    )
    acq_times = pd.Series(
        pb_layout.get(extension="nii.gz", target="AcquisitionTime", return_type="id")
    )
    file_names = file_names[acq_times < datetime.time(10).strftime("%H:%M:%S.%f")]
    return file_names.to_list()

def b2t_morning_query():
    return (
        b2t_df
        .loc[
            (b2t_df["ext"] == ".nii.gz")
            & (b2t_df["AcquisitionTime"] < datetime.time(10).strftime("%H:%M:%S.%f")),
            "file_path"
        ].tolist()
    )

In [18]:
%timeit pb_morning_query()
%timeit b2t_morning_query()

34.3 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
352 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [19]:
print("pybids == bids2table:", is_equal(pb_morning_query(), b2t_morning_query()))

pybids == bids2table: True
