Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inheritance principle clarification for exact matches #1583

Open
effigies opened this issue Aug 11, 2023 · 15 comments
Open

Inheritance principle clarification for exact matches #1583

effigies opened this issue Aug 11, 2023 · 15 comments

Comments

@effigies
Copy link
Collaborator

From previous discussions, there does not seem to be clear consensus on either relaxing or tightening up the inheritance principle. I can report, however, a practical difficulty with the current rule and derivatives as written. Example:

sub-01/
  anat/
    sub-01_desc-preproc_T1w.json
    sub-01_desc-preproc_T1w.nii.gz
    sub-01_space-MNI_desc-preproc_T1w.json
    sub-01_space-MNI_desc-preproc_T1w.nii.gz

According to the current rule, both JSON files apply to the MNI-space T1w.nii.gz, which violates

  1. There MUST NOT be multiple metadata files applicable to a data file at one level of the directory hierarchy.

One way of relaxing (or clarifying) this would be:

  1. Multiple metadata files may not apply to a data file at a single level of the directory hierarchy. If multiple applicable metadata files exist at the same level of the hierarchy and one file shares all entities with the data file, then only that metadata file that shares all entities with the data file applies to the data file at that level of the hierarchy.

This is what is implemented in bids-standard/bids-validator#1773, which is needed to handle fMRIPrep derivatives right now. The alternative is to clarify derivatives to figure out what fMRIPrep should be doing differently. We have been trying to avoid inheritance principle ambiguities, and yet we ran into this.

For what it's worth, the proposal above comports with PyBIDS' current behavior, which was intended to support the current reading of inheritance, and has not changed for several years.

Related issues:

@Remi-Gau
Copy link
Collaborator

ha. apparently bids-matlab would get confused by this.
#BugFixingTime

@effigies
Copy link
Collaborator Author

Copying @Remi-Gau's reply from my original comment:

  • maybe just me but I think that the rephrasing is just making more explicit what was in the original point 4.

@bendhouseart
Copy link
Collaborator

Multiple metadata files may not apply to a data file at a single level of the directory hierarchy. If multiple applicable metadata files exist at the same level of the hierarchy and one file shares all entities with the data file, then only that metadata file that shares all entities with the data file applies to the data file at that level of the hierarchy.

This seems like a sane enough check for both humans and machines, but, yeah, I can see where this leads with inheritance in derivatives. I look forward to seeing dozens of similar, but not quite, json's littered all over the top of derivative datasets.

@Remi-Gau
Copy link
Collaborator

I am glad you are volunteering to update the regex that will be inevitably become necessary to parse all this. 😘

https://github.com/bids-standard/pybids/blob/ae1bb002b518b02748d9c6e41a264870b5afd475/bids/layout/config/bids.json#L167-L191

@yarikoptic
Copy link
Collaborator

I think it would need further clarification to proposed formulation: The original 4. is applicable to any level, not only the lowest level, where the data file is available, which is of the concern leading to the rewrite. I think proposed formulation should depict that somehow since proposed version would lead to absent check at higher levels, leading to ambiguities we tried to avoid.

E.g. currently 4. would forbid having both top level desc-preproc_T1w.json and space-MNI_T1w.json. Under proposed rewrite, that would be allowed and would establish ambiguity in reading them while applying to the files down the hierarchy (not desired), while the long clarifying 2nd sentence would not be applicable "as is" since neither of those two .json files would match the data file(s) under lower directories. Only if extended with sub- entity - sub-01_desc-preproc_T1w.json would start matching, and then neither would match for the composite sub-01_space-MNI_desc-preproc_T1w.json.

So, reformulation might need to explicitly separate cases for different levels -- old version for levels higher than the corresponding "data file" level, and then relaxation (multiple allowed, but only 1 exactly matching is taken...) at the data file level. But that would complicate it even more :-/

But also it requires analysis (I have only vague memory) of the cases we might have already where some entities are on purpose omitted (e.g. _part-) to provide common inheritable metadata.

@yarikoptic
Copy link
Collaborator

yarikoptic commented Aug 11, 2023

for the latter concern, looking at Example 4 at https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#the-inheritance-principle

Example 4: Single metadata file applying to multiple data files (corollary 2)

└─ sub-01/
   ├─ anat/
   └─ func/
      ├─ sub-01_task-xyz_acq-test1_run-1_bold.nii.gz 
      ├─ sub-01_task-xyz_acq-test1_run-2_bold.nii.gz 
      └─ sub-01_task-xyz_acq-test1_bold.json

which would still be ok under proposed formulation since that "special casing" of 4. would not kick in, but if we extend by hypothetical addition of a file with extra entity (_desc-1):

└─ sub-01/
   ├─ anat/
   └─ func/
      ├─ sub-01_task-xyz_acq-test1_run-1_bold.nii.gz 
      ├─ sub-01_task-xyz_acq-test1_run-2_bold.nii.gz 
      └─ sub-01_task-xyz_acq-test1_bold.json
      ├─ sub-01_task-xyz_acq-test1_desc-1_run-1_bold.nii.gz 
      ├─ sub-01_task-xyz_acq-test1_desc-1_run-2_bold.nii.gz 
      └─ sub-01_task-xyz_acq-test1_desc-1_bold.json
  • currently we would have a strict violation -- error
  • in proposed formulation it just would be that none of those _bold.json would get applied (silently?) since neither matches data file names exactly.

@effigies
Copy link
Collaborator Author

So I was going here for a minimal relaxation, but after discussing with @ericearl, @Remi-Gau and @tsalo, we ended up at something dangerously close to @Lestropie's "even more relaxed" approach (Lestropie#5).

This is my understanding of the rules, for the sake of making sure we're agreed (I'm generally happy to accept Rob's wording):

  1. Within a level in the directory hierarchy ("level", hereafter), applicable files are sorted from fewest to most entities. The ordering within the level does not matter (mostly). Levels are then concatenated from shallow (near root) to deep (near leaf).
  2. For non-JSON files, if the last two files in the list have the same number of entities and are from the same level, then this is a prohibited ambiguity and an error must be raised. Otherwise, the last file in the list is the associated file.
  3. For JSON files, the procedure is to accumulate metadata from start-to-finish within each level. If a key appears in multiple files with different values, this is a prohibited ambiguity and an error must be raised. Otherwise the level dictionary is aggregated, and the order is defined but irrelevant due to the lack of conflict. The dictionaries are then merged from shallowest to deepest, and the deeper dictionary overrides the shallower dictionary.

This has the following advantages:

  1. Is unambiguous and as simple as possible for the problem being solved.
  2. Resolves a problem "in the wild".
  3. Is backwards compatible in the sense that existing datasets that followed the 1-per-level rule will continue to be interpreted the same.
  4. Takes advantage of considerable time and thought put in previously by @Lestropie and others.

There is a cliff where tools that do not respect this algorithm may not be able to handle datasets constructed after its establishment. My hope is that this will primarily affect derivatives, as both driving use cases come up in derivatives, where the tooling is in greater flux.

@effigies
Copy link
Collaborator Author

Note that any relaxation may be disallowed before BIDS 2.0 (#1339). In any case, I think the relaxation I suggested in my last comment is off the table unless the SG reconsiders.

The solutions I see are:

  1. fMRIPrep can add entities to eliminate ambiguity.
  2. We can find some lesser relaxation that is possibly more complicated.

@arokem
Copy link
Collaborator

arokem commented Aug 17, 2023

I think that we can certainly reconsider #1339. I am wondering whether this also suggests a reprioritization of 2.0, given this additional pressure? I also wonder whether the direction pursued (and abandoned) in #1280 can help us circumvent some of these issues, without change to the inheritance principle? I still think that it has a naturalness to it that is appealing, possibly preferable to these relatively opaque (if simple!) rules.

@Lestropie
Copy link
Collaborator

Muahahaha. Welcome all back to my nightmare... 😈

If multiple applicable metadata files exist at the same level of the hierarchy and one file shares all entities with the data file, then only that metadata file that shares all entities with the data file applies to the data file at that level of the hierarchy.

Personally I would strongly discourage this. Firstly, it's an unnecessary and unintuitive layer of complexity on a rule that is already highly criticised for those very attributes. Secondly, it would directly preclude the very situation for which I require a more advanced inheritance principle for BEP016.

There is a cliff where tools that do not respect this algorithm may not be able to handle datasets constructed after its establishment.

Agreed. Unfortunately it may be silent mishandling.

My thoughts here have generally arrived at:

  1. Generate exemplar datasets that can be used for unit testing, to ensure correct sidecar association / correct order of sidecar dictionary load & merge / correct identification of sidecar content clashes.

  2. Need major BIDS parsing softwares to support new principle before specification can be updated.

  3. Need to encourage more widespread utilisation of those softwares in the domain, as hack solutions will be increasingly likely to result in erroneous parsing.

  4. Contemplate how to deal with validator. Entirely out of my depth here.

  5. Consider whether such a change needs to be deferred to BIDS 2.0.

fMRIPrep can add entities to eliminate ambiguity.

Can you link to any relevant discussion? I might have ideas given how long I've spent trying to get around this problem.

@effigies
Copy link
Collaborator Author

  1. Generate exemplar datasets that can be used for unit testing, to ensure correct sidecar association / correct order of sidecar dictionary load & merge / correct identification of sidecar content clashes.

@Remi-Gau was saying the same thing.

  • Need major BIDS parsing softwares to support new principle before specification can be updated.

I am extremely confident that the algorithm described above can be done in pybids and the validator. I'm sure in bids-matlab as well. While I'm okay with pybids being a little experimental, the validator is very nearly identical to the spec for most people, so we'd want to co-release. If a working branch is needed to demonstrate the feasibility, I'm for it.

@ericearl is also interested in writing a set of tools that implement the inheritance principle that can be called/imported in various languages.

  • Need to encourage more widespread utilisation of those softwares in the domain, as hack solutions will be increasingly likely to result in erroneous parsing.

Very difficult to enforce, but Eric's plan of a reference implementation would help. If it's fast and easy to use from multiple languages, even better.

  • Contemplate how to deal with validator. Entirely out of my depth here.

If you just mean implementation, we can do it. If you mean something like "different behavior based on the dataset's claimed BIDSVersion", that will be a much more complicated demand.

  • Consider whether such a change needs to be deferred to BIDS 2.0.

Possibly. The steering group gets the final word, obviously.

fMRIPrep can add entities to eliminate ambiguity.

Can you link to any relevant discussion? I might have ideas given how long I've spent trying to get around this problem.

I just meant this (from first post):

sub-01/
  anat/
    sub-01_desc-preproc_T1w.json
    sub-01_desc-preproc_T1w.nii.gz
    sub-01_space-MNI_desc-preproc_T1w.json
    sub-01_space-MNI_desc-preproc_T1w.nii.gz

I could change it to:

sub-01/
  anat/
    sub-01_space-orig_desc-preproc_T1w.json
    sub-01_space-orig_desc-preproc_T1w.nii.gz
    sub-01_space-MNI_desc-preproc_T1w.json
    sub-01_space-MNI_desc-preproc_T1w.nii.gz

Practically, this probably means always adding a space- entity, even when not needed to distinguish results, because you don't want the naming scheme of your preprocessed T1w image to depend on whether you also target other spaces in the current command. And a second run of the tool to get more outputs shouldn't turn a valid derivative into an invalid one.

That would be dissatisfying to me, but if that's what's needed, we can do it.

@Remi-Gau
Copy link
Collaborator

  1. Generate exemplar datasets that can be used for unit testing, to ensure correct sidecar association / correct order of sidecar dictionary load & merge / correct identification of sidecar content clashes.

@Remi-Gau was saying the same thing.

Side comment: I think that there is an empty niche in our tooling eco system for a package to generate dummy datasets that match certain "requirements" so they can be used for testing.

There is f(ake)mriprep but not very configurable and very narrow in usecase

Maybe this could be the occasion to have a first draft for such tool.

@Remi-Gau
Copy link
Collaborator

I'm sure in bids-matlab as well

I am sure that bids-matlab could actually be improved from this.

@yarikoptic
Copy link
Collaborator

yarikoptic commented Aug 18, 2023

3. For JSON files, the procedure is to accumulate metadata from start-to-finish within each level. If a key appears in multiple files with different values, this is a prohibited ambiguity and an error must be raised

Although I see the rationale, I worry about this one would making it too restrictive, as indeed cases like

sub-01/
  anat/
    sub-01_desc-preproc_T1w.json
    sub-01_desc-preproc_T1w.nii.gz
    sub-01_space-MNI_desc-preproc_T1w.json
    sub-01_space-MNI_desc-preproc_T1w.nii.gz

or alike leading easily to conflicts.

edit: I do not worry about tools (as you have mentioned - they can disambiguate) but rather about humans . But may be my concern is ungrounded -- worth looking through openneuro datasets on where we would find such conflicts ATM.

@Lestropie
Copy link
Collaborator

Practically, this probably means always adding a space- entity, even when not needed to distinguish results, because you don't want the naming scheme of your preprocessed T1w image to depend on whether you also target other spaces in the current command.

When it comes to the inclusion of entities, I have always conceptualised it as being on an as-needed-to-disambiguate basis. That's the reason why we don't put _run-1 on literally everything. So I would be fully in support of omitting "_space-*" if the data have not been transformed to any other space and no other data are present that have been, but being forced to include "_space-orig" if the latter is not true. Any downstream applications using these derivatives as input would need to be clever enough to know this and search for the image it wants accordingly.

I don't like the idea of having the difference between two data files being the presence vs. absence of an entity, as opposed to different values for an entity. Given enough thought I wonder if I could make the case that the specification should be forbidding it outright (for data files; for metadata that's just inheritance, and should IMO be used wherever possible).

And a second run of the tool to get more outputs shouldn't turn a valid derivative into an invalid one.

Being more explicit about your use case here, since it's important for the argument: You want it to be possible for a tool to update an existing derivatives dataset with content that was not requested when that dataset was first created. So that's introducing a degree of mutability to a derivatives dataset over and above regenerating absent / withheld data.

In that scenario, I would extend my entity philosophy above and say that you need to include whatever entities are necessary to provide disambiguation for all possible outputs from that tool from all possible re-runs.

Although I see the rationale, I worry about this one would making it too restrictive

Thanks for accidentally providing support for my banning of data files with a strict entity subset of another :-P

Generate exemplar datasets that can be used for unit testing

Regarding this one, I would actually suggest just taking something like "advanced example 1" currently proposed for addition in #1003, and producing a filesystem realisation of it along with the expected associations & orders for cross-checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants