Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle file exclusions and inclusions in 0.8? #378

Closed
tyarkoni opened this issue Feb 5, 2019 · 20 comments
Closed

How to handle file exclusions and inclusions in 0.8? #378

tyarkoni opened this issue Feb 5, 2019 · 20 comments
Labels

Comments

@tyarkoni
Copy link
Collaborator

tyarkoni commented Feb 5, 2019

This is an issue that has cropped up repeatedly in various guises (e.g., #215, #364, #277, #184, #131, and probably others). The question is how to allow users to specify explicit inclusion and exclusion paths at BIDSLayout initialization. The reason for bringing this up again is that, as of version 0.8 (see #369), pybids will no longer depend on grabbit. The cord-cutting means we can no longer rely on the behavior implemented in grabbit. Since this was entirely undocumented in pybids, I think we have a good opportunity to start afresh and hopefully settle on something that works for everyone.

The main constraints I think we should try to respect are:

  • We want to exclude a bunch of hard-coded subdirectories by default (e.g., 'code', 'stimuli', 'sourcedata', etc.)
  • Users should be able to easily override any of the default exclusions and make sure they're indexed
  • Users should be able to specify arbitrary directories anywhere in the file system that should not be indexed (in the event they're encountered in any raw or derivatives BIDSLayout)

The current approach doesn't allow users to specify explicit exclusions at all (well, it does, but this is an undocumented grabbit feature). It uses an include argument only as a means of negating the default exclusions. E.g., if you want 'stimuli'/ to be indexed, you pass include=['stimuli']. Beyond this, there's no pybids-level ability to control inclusions or exclusions (aside from specifying derivatives, which is a separate matter that I think we're handling in a satisfactory way). I don't think this is satisfactory, and a bunch of the opened issues reflect that.

Here are a few proposals (feel free to suggest others):

  1. Keep the current approach, where include negates values in the default exclusion list, but add an exclude argument that causes any matching files/dirs to be skipped during indexing. The main downside I see here is that the behavior is counterintuitive, as include and exclude act asymmetrically. A potential fix is to give these arguments different names (e.g., override_exclusions and exclude_paths).

  2. Stick with just exclude, and have any manually specified value override the default internal list (e.g., if you pass ['code', 'sourcedata'], then things like 'stimuli' will now be indexed, and only files/dirs that match the elements in your list will be skipped). The downside of this is it requires users to know what the default exclusions are, and reproduce them, and this will probably get pretty messy.

  3. Get rid of the current default exclusion list entirely, and treat exclude as a strict list of paths to exclude from indexing. Now that the validator is working properly, directories like 'stimuli' will automatically be skipped if validate=True, because files won't pass the validator unless they're explicitly part of the spec. The downside of this option is that it makes it difficult to index selectively—e.g., if you want to index only what's in 'stimuli', you need to set validate=False and then pass a whole pile of exclusions (i.e., everything that doesn't pass the validator except for 'stimuli').

I lean towards (1) (with more explicit argument names). Thoughts? If I don't get any feedback in the next couple of days, I'll make an executive decision in the interest of getting 0.8 merged, so speak up now if you have an opinion! (Tagging in @effigies @adelavega @yarikoptic @gkiar)

@tyarkoni tyarkoni mentioned this issue Feb 5, 2019
5 tasks
@satra
Copy link
Contributor

satra commented Feb 5, 2019

if i understand this correctly, there are three potential operators:

  1. V: validate=True - restrict based on spec
  2. E: exclude=[] excludes explicitly
  3. I: include=[] includes explicitly

and then there is order of operations and combinations of operations.

the base state (let's call it B) is index everything in a root and then apply some operations on it. here are some operations in decreasing order of use-cases (in my mind)

  • V(E(B))
  • V(B1) + B2 where + stands for include results of both operands
  • V(B1) + V(B2)

each of the operations returns a layout object. so one could do things like:

L1 = BIDSLayout('/some/root')
L2 = BIDSLayout('/some/other/root')
L1a = L1.exclude([...], in_place=False) ->  returns a different object than L1
L2a = L2.include([...], in_place=False) ->  returns a different object than L1
L3 = (L1 + L2).validate() -> operates on the original L1 and L2
...

the actual implementations can do optimizations under the hood in terms of order of operations, especially if you think of the index being a future. the actual indexing, exclusion, validation doesn't have to happen till the point a program asks for something concrete.

@adelavega
Copy link
Collaborator

adelavega commented Feb 5, 2019

I lean towards 3 for API simplicity (as include is guaranteed to include all non-BIDS valid entities as well). If we allow merging of Layouts as @satra suggested, that would allow the V(B1) + B2 scenario (although it seems that the roots might conflict? and it's also potentially confusing if adding derivatives uses a different mechanism).

1 with different names also seems reasonable. Essentially exclude works as an append to a default exclude list, and include over-rides the resulting complete exclude list.

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 5, 2019

@satra sure, we could add explicit .exclude() and .include() methods to the BIDSLayout (though I think I'd prefer them to operate in-place by default). But I don't think we want to make every user make multiple calls just to end up with the right files indexed. I'd rather settle on a sensible set of initialization arguments that cover nearly all use cases, and it seems like that should be possible.

@adelavega my (3) doesn't include any include argument. Are you proposing to add one, and handle it as the inverse of exclude? I.e., if validate=True, all paths that fail validation are excluded, unless they're found in include? If so, let's call that option (4).

Let's set aside the question of merging BIDSLayout instances for now. It's not entirely clear what it would actually mean to merge two layouts, because a BIDSLayout is defined as a complete, valid BIDS project (either raw or derivatives), not just as an arbitrary file tree. E.g., you can't merge two objects that have different dataset_description.json files—which one takes precedence? (What you can do is nest BIDSLayout instances within each other using derivative/source relationships, which is how things are currently implemented).

@satra
Copy link
Contributor

satra commented Feb 5, 2019

@tyarkoni - i agree that merging bidslayout for different datasets is out of scope for now. i was more thinking of where the derivatives of a dataset are on different filesystems, i can use layouts to integrate them. the following is not an uncommon scenario in hpc clusters.

dataset: /filesystem_read_only/dataset
deriv1: /filesystem1/dataset/fmriprep/
deriv2: /filesystem2/dataset/freesurfer/

now i want a layout that integrates these into a single layout index.

@satra
Copy link
Contributor

satra commented Feb 5, 2019

@tyarkoni - the only reason to add extra calls is to allow for order of operations, otherwise as an argument you would have to somehow be able to specify order of operations.

rules = [('validate', 'True'), ('exclude', [...])]
rules = [('include', [...]), ('validate', 'True'), ('exclude', [...])]
...

@adelavega
Copy link
Collaborator

@tyarkoni Ah, sorry I misspoke. But, what you suggested I think is better. That way, most people don't have to touch include or exclude, but they allow you to over-ride the validator.

But I agree the order of operations is problematic. Is it not suffient to just set an order and have the user work around that order (e.g. validate, include, exclude).?

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 5, 2019

@satra that kind of derivative handling is already supported! You can do:

layout = BIDSLayout('/bids/raw/root', derivatives=['/deriv1', '/deriv2'])

Then each derivative folder is initialized as its own BIDSLayout, and all the derivatives are available from the parent (raw) BIDSLayout in a derivatives dict (the keys are the pipeline names). When you search on the parent layout (e.g., with get()), you can specify the scope to control whether the derivatives are also scanned.

Re: order of operations, I agree there will probably be use cases where that matters. I think implementing .include() and .exclude() (or maybe add() and remove()) methods is a good idea independently of what we decide here. But I think for the vast majority of cases, order shouldn't matter—as @adelavega says, we just need to be clear on what the precedence rules are, and then the user can work with that. At first glance, I think validation --> exclude --> include seems reasonable. I'm operating under the assumption that > 99% of the time, the only reason anyone will want to explicitly use an include directive is to override a default exclusion.

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 5, 2019

@adelavega actually, if we assume that include takes precedence over exclude, and that by default the hard-coded list of dirs is used as the exclude, then I think this option (4) (which is basically (2), but with an added include that has no special respect for the default exclusion list) makes good sense. I like it because exclude and include are then operating symmetrically, and it's just that the latter takes precedence.

The only non-crazy use case I can think of where this wouldn't work is if the user wants to, say, index 'stimuli', but exclude 'stimuli/millionfiles'. In that case, if we provide explicit class methods like Satra suggested, then those users can tweak the layout after initialization. Or, alternatively, we could come up with some alternate advanced specification represented as a dict or something, and allow that as another argument. But I think I'd want to defer something like that until someone has a concrete need for it.

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2019

Just to be obnoxious... If we're considering a .bidsignore file, would it help intuition to rename exclude to ignore? Assuming each entry operates functionally the way we expect an entry in .bidsignore to behave.

@adelavega
Copy link
Collaborator

That makes sense to me. But in that case then the order would be: ignore validate include, correct?

And @tyarkoni in such a case, isn't it possible to pass a regex for include that "excludes' stimuli/millionfiles?

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2019

Maybe? This discussion has been pretty much exclusively at a level I'm not currently prepared to engage, so order of operations is not something I can comment on. This is just a UX consideration.

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 5, 2019

@effigies I'd be fine with ignore; I have no preference between that and exclude. I prefer ignore by itself, but that's offset by the symmetry of exclude and include. EDIT: maybe if we rename include to force_indexing?

@adelavega I don't think switching ignore and validate can make any difference (they both just remove files that would otherwise be indexed), as long as include takes precedence over both.

I'm on the fence about whether or not to allow regexes here, and leaning towards not. The main issue is that allowing regexes by default is dangerous, because lots of things could inadvertently match strings that aren't meant to be interpreted as regexes. E.g., suppose someone passes code, intending that to exclude all code/ directories directly below the root. But if you have code show up anywhere in a filename, poof, now those files are gone too. So then to prevent that we either need to add a bunch of detail in the docstring (which most people won't read carefully) and maybe a new formatting flag, or we need to add yet another argument controlling whether or not regexes are allowed. I kind of like keeping it simple and just sticking with fully-matching names, at the risk of potential redundancy.

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2019

How about ignore and attend? 😛

@adelavega
Copy link
Collaborator

Okay, I suppose I can live without regex. I'm currently using it to exclude confound files from the fmriprep dataset (since we provide our own events, including confounds), but I think I can workaround that in a more explicit fashion.

To be concrete, I'm currently passing the following to fitlins, which passes it on to pybids: --exclude=(fmriprep.*$(?<=tsv))

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 5, 2019

This is one of the places where I really miss Ruby's (and shudder Perl's) native regex support. This kind of thing would be so much easier if you could just pass /(fmriprep.*$(?<=tsv))/ without messing things up for people passing strings. I suppose the next best thing is to allow user to pass SRE_Pattern objects (i.e., compiled regular expressions), and then check for those before branching.

@effigies
Copy link
Collaborator

effigies commented Feb 6, 2019

Okay, thinking about this, a little. I think we should choose a consistent syntax with .bidsignore for the ignore/exclude list. So ignore=[...] would add virtual entries to .bidsignore. force_index would not affect the validation, so would not be removing entries from the ignore list but after-the-fact adding them to the index. I think this is what @adelavega meant by the ignore-validate-include ordering.

If you think that's a good idea, perhaps it makes sense to think more about bids-standard/bids-specification#131. The .gitignore rules are more-or-less fnmatch + ** globs with a couple additions involving comments and bangs (!). That seems fairly reasonable.

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 6, 2019

Seems reasonable. I don't want to deal with .gitignore inside pybids though if I can help it. It makes more sense to me to have the bids-validator deal with .bidsignore than to do file validation in the validator and pre-screening in pybids. The former seems cleaner and more maintainable. I propose we do this by having the BIDSValidator take the .gitignore information either at initialization, or in a .setup() call made prior to any attempt to validate individual files. Then the BIDSValidator can internally set up its own rules for screening files.

That would make things extremely simple on the pybids end, because, as you suggest, anything passed to ignore would literally be appended to the .gitignore list passed to the validator. So that would be handled in one step. It would also make it really easy to handle files in force_index, because the internal _validate_file call could first check if the file matches something in force_index, and if it does, it would never even make a call to the BIDSValidator.

@effigies
Copy link
Collaborator

effigies commented Feb 6, 2019

Would BIDSValidator then act as a filter?

class BIDSValidator:
    ...
    def validate(files):
        out_files = [file for file in files if not ignored(file)]
        ...
        return out_files, warnings, errors

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 6, 2019

I was thinking it would follow the current API and do everything one file at a time. We could e.g. add an is_not_ignored() class method alongside the others like is_top_level(), is_phenotypic(), etc. Then pybids would continue to just call is_bids() on each file, which would now also run the is_not_ignored() check. The .bidsignore info could be made an initialization argument of BIDSValidator, where it uses a list if passed, or extracts the list if a filename is given. Does that work?

@adelavega
Copy link
Collaborator

Yes that's what I meant @effigies and I agree with @tyarkoni that letting the BIDSValidator handle the ignore list, with pybids handing it over, would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants