-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add --input flag #2464
ENH: Add --input flag #2464
Conversation
We'd need something like feature described in dataladgh-1424 to address this.
The next commit will add a `get` call to run, and `get` sets the message to a tuple.
Allow the caller to specify files that we should obtain with 'git annex get' before the run. In addition to data files, this input could be a singularity container (as described in dataladgh-2158). There are at least a couple of decisions here that should be given more thought: * The path to --input must be relative to the root of the dataset. * If a glob is given as the input value, it is expanded at the time of the run. As an example, if '*' was used to indicate all annex files, a rerun would consider the inputs to be all of the annex files at the time of the initial run. Expansion during the initial run seems appropriate because most of the time re-executing a command should only require those inputs. However, it's possible that the caller would want to rerun a command that uses globs itself to get its input files, in which case files that were added after the initial run wouldn't be retrieved. Perhaps we could also store the unexpanded inputs and add a rerun flag that says to use the unexpanded set. Re: datalad#2432
Codecov Report
@@ Coverage Diff @@
## master #2464 +/- ##
==========================================
- Coverage 88.99% 80.66% -8.33%
==========================================
Files 283 240 -43
Lines 33146 29629 -3517
==========================================
- Hits 29499 23901 -5598
- Misses 3647 5728 +2081
Continue to review full report at Codecov.
|
Thanks! First comment: I think we should stick to the expansion style that annex is using, and that we are relying on elsewhere too. Otherwise we break the paradigm (for no obvious reason). Hence Also we should be careful with expanding the inputs. |
My reason is that the ability to glob was requested in gh-2432. What's currently possible is to glob via annex's
The globbing is documented as "what would At any rate, I'll plan to add
Yes, I agree. For globbing in general, I tried to touch on this in my opening comment (and the message of d5d00f2). As I mentioned there, I see the stricter "expand at the initial run time" as the better default, but I can imagine situations where the caller would want the "expand on each rerun" behavior. So
|
We'll use this for --output as well.
An upcoming commit will want lists from partition again, so let's avoid repeating this pattern.
This doesn't matter for inputs, but for outputs we'll need to do this check before unlocking them, so we might as well move the inputs block to keep inputs/outputs processing close together.
This also avoids the false warning of no matching files on rerun.
Right now, we always expand at run time, so there is no point in calling _resolve_files.
While --input maps to 'git annex get', --output maps to 'git annex unlock' or 'git rm', depending on whether content is present. We can't unconditionally unlock because that leads to an error if content isn't present (dataladgh-2432). We're expanding the globs on the initial run in the same way we do with --input. As discussed in d5d00f2, this should be revisited because there are cases for both --input and --output where the caller would want to store the unexpanded patterns and re-glob on each rerun. Re: datalad#2432
Unlike the previous version, this will work with annex files in subdatasets, and it's simpler because we let `remove` handle the annex checks.
This pattern is no longer used in multiple places.
When content isn't present for a file that was added/modified in a revision, it'd be better to remove the file rather than trying to unlock it, which fails. But if we do that, we should make sure we process inputs first since that step may retrieve an added/modified file (perhaps via --input=.). To do both these things, piggyback on run's outputs argument.
Playing with the beast. Observations: In a plain Git repo:
The latter code path assumes the presence of an Continuing:
All files are tracked in Git directly, hence nothing is there that concerns annex, but still the message is misleading in this case. Here I have a folder will 5461 files under annex:
It aborts before anything happens (10s for checking something). Repo stays clean, no added commit, no untracked stuff. The corresponding fire-and-forget
That is half the time (and probably still slower than what it could be doing). Now the
So it wants
I guess argument expansion hit.
It managed
So no argument expansion per se, but if we want to go down this path, we would need to write the commit message into a file first and then ingest it via But I am not sure at all if that is a good idea. I have real datasets with >200k annex files that are all input into an analysis. In my DICOM case they maybe be spit into ~50k chunks per directory (one acquisition session). So I would routinely end up with 50k line commit messages for saying "I need this directory" -- if aggregated metadata hits such size, we are already putting it in annex ;-) Thanks @kyleam for pushing this forward! |
This test (introduced by this series) will fail when this branch is merged to master because dcdbf65 removes add's commit argument.
Thanks for giving this a test drive @mih. I'll need to digest/work through your comments, but there are clearly issues with my initial attempt at implementing this :/ |
In a plain git repo, this avoids (1) displaying a misleading "No matching files..." warning when a glob is passed to --input and (2) running an unnecessary 'datalad get .' when '.' is passed to --input. Re: datalad#2464 (comment)
We want to be able to give a directory to --input. Doing so allows a command to depend on a potentially large number of files without depending on everything with "." or globbing with --input='subdir/*', which stores all the subdirectory's file names in the commit message (which isn't pleasant to look at in the log and will currently fail if the message reaches a certain length). AnnexRepo.is_under_annex, however, fails when passed a directory because it calls AnnexRepo.info, which raises an AssertionError. It's not clear what "under annex" should mean for a directory (contains one or more annexed files?), but at the least AnnexRepo.info should probably return information for the directory rather than raising an AssertionError. So avoid passing directories to is_under_annex. If is_under_annex / info are adjusted to handle directories, it might make sense to modify _resolve_files to return a directory only when it contains an annexed file. Re: datalad#2464 (comment)
OK, so the issues I took away from @mih's feedback were
The first three should be fixed. For (3), passing a directory to The fourth should be less of an issue now because, for a subdirectory with lots of files, the subdirectory name itself can now be passed. But we should probably still address it by either using |
Thanks @kyleam ! I'll try to explore it a little more over the next two days and get back to you shortly. |
The main motivation for this change is that (1) globbing isn't restricted to the current dataset, (2) passing the pattern to glob.glob respects the current working directory rather than the top-level of the repo, and (3) we might want to make --input/--output about more than just getting/unlocking files, so we may not want to couple the globbing to 'git annex find'.
Globbing can quickly result in really long commit messages, so store the unexpanded form by default. The next commit will add an option to store the expanded globs.
The latest push is a bit rough and a few tests are failing for reasons I don't yet understand. I'll will dig into it more tomorrow. |
When reruning, `outputs` is used to unlock the added/modified files. But with --onto, these might not be present. Don't issue a warning if the glob doesn't match in this case because it's expected and the output isn't a user-supplied output. One issue with this solution is that we no longer warn if user-supplied output globs don't match on rerun. In order to do this, we'd need to decouple automated and user-supplied outputs.
If a dataset is given on the command line (or when run is called as a dataset method), expand globs relative to the top-level of the dataset. This follows the convention that datalad uses for path handling. The large test diff is due to an indentation change; we no longer need a chpwd block.
Should now be addressed. The main changes in the recent pushes
|
I gonna go with "ballerina crossword", I guess? |
BAMM! |
This is a first pass at allowing inputs to be specified when calling
datalad run
.WIP because
I haven't tested it too thoroughly
I haven't yet tried to add
--output
, but we might want to do that here (as opposed to a separate PR)I'm not sure how we should handle glob expansion.
If a glob is given as the input value, it is expanded at the time of the run. As an example, if '*' was used to indicate all annex files, a rerun would consider the inputs to be all of the annex files at the time of the initial run.
Expansion during the initial run seems appropriate because most of the time re-executing a command should only require those inputs. However, it's possible that the caller would want to rerun a command that uses globs itself to get its input files, in which case files that were added after the initial run wouldn't be retrieved.
Perhaps we could also store the unexpanded inputs and add a rerun flag that says to use the unexpanded set.
Update: I'm still not sure, but the current approach is to store unexpanded unless the
--expand
option is used.Note: #2432 mentioned --input=. as meaning all annex files should be present, but I didn't end up adding that because it's possible with globbing (
--input='*'
).re: #2158, #2432