Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: specify push --data modes; keep --force to only force git push and checkdatapresent #4620

Merged
merged 12 commits into from Jun 23, 2020

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jun 9, 2020

This is an implementation for #4604 .

Notes:

  • it is not harmonized with publish or get (see #4604 for notes on that)
  • from Python programming side perspective I do dislike that we cannot say data=True or False. data='nothing' is not worse IMHO though than force=['no-datatransfer'] though.
  • --force=checkdatapresent (resolves #4603) in b20ed13
  • --force=all instead of --force=pushall in 65aae48 since IMHO forces should be orthogonal and could all be combined
  • Closes #4602 which was an alternative proposed solution (--force=onlywanteddatatransfer)
  • Removed datalad.push.copy-auto-if-wanted config adding a mode auto-if-wanted (1be419f) and making it to be default mode of operation. It would cause transfer ("anything") if no "wanted" is not set, thus fulfilling the "do the right thing by default".
  • Uses --data (see #4622)

See individual commits for more information. I had tried to make them atomic enough so we could drop some if so desired.

yarikoptic added 7 commits Jun 9, 2020
Changes are intensionally minimal.  Subsequent commits might refine further
I am luke warm about this change -- just wanted to compartmentalize
"forces" to _push instead of spreading logic around all helper functions
… (to be more to the point of the effect)

Closes datalad#4603

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "git-sedi datatransfer checkdatapresent",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "datalad/core/distributed/push.py",
  "datalad/core/distributed/tests/test_push.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
…just enable all forces - orthogonal)

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "git-sedi pushall all",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "datalad/core/distributed/push.py",
  "datalad/core/distributed/tests/test_push.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Any option is about changing "automatic decision making" IMHO. I think it is
important to mention possible effects such as absence of --force for push is
a safety measure. Default --fast  is optimization and name "checkdatapresent" is
self descriptive on the effect
…auto-if-wanted config

I was going back and forth either we need some additional "datalad.push.transfer-data"
config setting which could be set per repository to indicate what "datalad push"
should do with data.  But I remain of an idea that it would be counter-productive
to establish yet another level of data transfer behavior instead of using what is
already there provided by git-annex itself.

To maintain "do the right thing" behavior by default I had decided also to make
auto-if-wanted to be the default mode of operation -- it will transfer data if there
is no explicit "wanted" setting for the remote.
…t-annex

To ease possible troubleshooting later on whenever some expected data transfer
does not happen.  Also tuned up a comment to say "copy" not "move", which is a
valid (but not supported ATM) operation.
@mih
Copy link
Member

@mih mih commented Jun 9, 2020

No in-depth review yet, just a bunch of comments after looking at the general changes.

--force=all instead of --force=pushall in 65aae48 since IMHO forces should be orthogonal and could all be combined

We have discussed that before and came to the conclusion that it should not be all, because it puts a strong limitation on future changes. Namely that any addition is indeed orthogonal. However, there is no requirement for that. A mode is something that is mutually exclusive to other modes, not that the behavior is fully orthogonal (whatever that even means in detail). What does orthogonality bring as a benefit? Original change was e9727d1

I started looking for a push invocation in my workflows that does not require --transfer-data anything now (or rather again, because that was one motivation to keep this separate and different from publish). So far I have not found a single case. I think a command that is meant to transfer things elsewhere should do that by default. I have not manually verified all use of push shown in the handbook, but it looks as if they also all need --transfer-data anything.

'auto-if-wanted' would enable '--auto' mode only if there is a
"wanted" setting for the remote, and transfer 'anything' otherwise.
Note: 'anything' and 'nothing' are "wanted" expressions understood
by git-annex.""",
Copy link
Member

@mih mih Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose has the "note"? These values are not actually passed on to git-annex, hence it seems the note merely points out an inconsequential similarity.

Copy link
Member Author

@yarikoptic yarikoptic Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It avoids possible user questions:

  • why not just 'all' and 'nothing'? (or any other pair of words)
  • why not just 'auto', 'none', 'all' as publish has it?

I did not want to unnecessarily expand it here, but overall docstring above or other docs could be extended to state e.g. that "instead of specifying explicitly --transfer-data=nothing to push, you can configure the specific remote via git annex wanted REMOTE nothing to make no data pushed to that particular remote by default."

Copy link
Member Author

@yarikoptic yarikoptic Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also it probably asks for RFing publish to use those values, which could be done for 0.13, possibly just introducing equivalences of none == nothing, and all == anything; while issuing deprecation warnings whenever someone uses "old style" values. With publish RFing later on (I could attempt it on top of this PR to verify if my assumptions are correct) to use push - it would probably be just a matter of changing default from auto-if-wanted to auto and be done (for the push aspect altogether. Will need to RF also just to call create-sibling first ;))

Copy link
Contributor

@kyleam kyleam Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also felt this note was out of place.

Copy link
Member Author

@yarikoptic yarikoptic Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will remove it then.

@codecov
Copy link

@codecov codecov bot commented Jun 9, 2020

Codecov Report

Merging #4620 into master will increase coverage by 39.83%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4620       +/-   ##
===========================================
+ Coverage   49.82%   89.65%   +39.83%     
===========================================
  Files         286      288        +2     
  Lines       38822    39073      +251     
===========================================
+ Hits        19344    35032    +15688     
+ Misses      19478     4041    -15437     
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100.00% <ø> (ø)
datalad/core/distributed/push.py 89.17% <100.00%> (+7.87%) ⬆️
datalad/core/distributed/tests/test_push.py 99.42% <100.00%> (+23.15%) ⬆️
datalad/interface/tests/test_utils.py 100.00% <0.00%> (ø)
datalad/tests/test_utils_cached_dataset.py 100.00% <0.00%> (ø)
datalad/tests/utils_cached_dataset.py 92.30% <0.00%> (ø)
datalad/interface/base.py 91.22% <0.00%> (+0.07%) ⬆️
datalad/support/constraints.py 87.66% <0.00%> (+0.44%) ⬆️
datalad/core/local/tests/test_create.py 100.00% <0.00%> (+0.44%) ⬆️
datalad/local/subdatasets.py 95.65% <0.00%> (+0.86%) ⬆️
... and 183 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0adaed...0646464. Read the comment docs.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jun 9, 2020

I started looking for a push invocation in my workflows that does not require --transfer-data anything now (or rather again, because that was one motivation to keep this separate and different from publish). So far I have not found a single case.

To clarify: Did you try that and this description:

'auto-if-wanted' would enable '--auto' mode only if there is a "wanted" setting for the remote, and transfer 'anything' otherwise.

turns out to be incorrect or did you miss it?

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 9, 2020

--force=all instead of --force=pushall in 65aae48 since IMHO forces should be orthogonal and could all be combined

We have discussed that before and came to the conclusion that it should not be all, because it puts a strong limitation on future changes. Namely that any addition is indeed orthogonal. However, there is no requirement for that. A mode is something that is mutually exclusive to other modes, not that the behavior is fully orthogonal (whatever that even means in detail). What does orthogonality bring as a benefit? Original change was e9727d1

Unfortunately I think discussion was not in the issues, but in our weekly meeting. all made no sense given non-orthogonal "forces" to start with. The ultimate and the only combination which made sense was "datatransfer" + "gitpush" which was indeed "bundled" into "pushall" in e9727d1. Indeed we had a concern that if yet another mode would be introduced - it would make "all" even less clear. As now --force "flags" are separate from modes, and we could force them upon any action where they might make sense, e.g. --transfer-data=auto --force=all to signal that I am using "auto" mode but will force push git branch and do "checkdatapresent". --transfer-data=auto --force=pushall makes it a bit more ambiguous IMHO with "push" in the force name -- "am I now forcing some additional 'push'?" . Overall -- I do not see the point in "push" prefix there after this RF. But if you feel strongly - we could just revert that change, I hope to never use it one way or another due to it incorporating git push --force ;-)

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 9, 2020

To clarify: Did you try that and this description:

'auto-if-wanted' would enable '--auto' mode only if there is a "wanted" setting for the remote, and transfer 'anything' otherwise.

turns out to be incorrect or did you miss it?

Yes @bpoldrack observation is correct: default behavior of push was not changed in cases where wanted was not setup. It will "do the right thing" and "push" data by default. And unless some --force was used/mentioned in the handbook, nothing would need to be changed there. I only found following uses:

docs/basics/101-141-push.rst:  in some cases. Therefore, ``--force datatransfer`` is a slower, but more fail-safe
docs/basics/101-141-push.rst:the ``-f/--force`` option. In the latter case, specifying ``-f no-datatransfer``

needing tune up.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jun 9, 2020

Re pushall vs all:
Agree with @mih here. --force should still be "modes" (and it is w/ this PR!) rather than flags and we might need to introduce other force modes in the future leading to all not making sense.
For clarity re modes vs flags: Flags would allow for arbitrary combinations. Modes "encode" particular combinations, that you have to choose from. In the state of this PR there isn't much of a difference, because ATM all combinations are available, but it is conceptually different, and it makes a difference for future additions.

@mih
Copy link
Member

@mih mih commented Jun 10, 2020

I went through the code again and besides renaming things and adding an additional option, I cannot see which use case is enabled by this PR. I thought that is the key here, that push cannot do a particular thing and it is being added here. However, it looks as if it effectively turns the default of datalad.push.copy-auto-if-wanted from false to true. Which would invalidate some of the discussions we had, and ultimately led to #4615

I would appreciate if my note from #4602 (comment) could be elaborate on in this new context. How and why is this different from

datalad datalad.push.copy-auto-if-wanted=true push

I have a a few more data moving operations scheduled for next week, and will be able to get some practical experience with this code at that point.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 10, 2020

There were no auto mode before (which wouldn't transfer unless wanted set), only auto-if-wanted (which would transfer if wanted isnt set).

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 10, 2020

For clarity re modes vs flags: Flags would allow for arbitrary combinations. Modes "encode" particular combinations, that you have to choose from. In the state of this PR there isn't much of a difference, because ATM all combinations are available, but it is conceptually different, and it makes a difference for future additions.

"--force" modes were and remain subject for possible combination, e.g. gitpush is completely orthogonal to any other "mode". With RFing a dedicated --transfer-data establishes a set of "modes" which cannot be combined among themselves, and I thought that we agreed with @bpoldrack on the value of having them separated from other "--force modes" which to me are just additional flags, but if you like -- I can call them "modes". gitpush is pretty much orthogonal from --transfer-data and could be used/combined with any data transfer mode. checkdatapresent is orthogonal to gitpush and to any data transfer mode , besides completely ignored when --transfer-data=nothing and IMHO that is Ok as e.g. a matter of fact for --force in other tools: git push, rm etc -- --force flags just allow for some operation to happen, but not necessarily mandating it to happen.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jun 11, 2020

I thought that we agreed with @bpoldrack on the value of having them separated from other "--force modes"

Yes, agree. Just wanted some clarity in language. ;-)
Because I also agree with @mih, in that the API by design should be concerned with "modes", that only require to be mutually exclusive. Only that now there are two such options for two independent aspects of push behavior. That some of the currently available modes are not only mutually exclusive but also orthogonal in functionality is a different story. The API design should not require that. That's why I agree with that split, but no further.

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "sed -i -e 's,transfer_data,data,g' -e 's,--transfer-data,--data,g' {outputs}",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "datalad/core/distributed/push.py",
  "datalad/core/distributed/tests/test_push.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 12, 2020

Added a short 7dd99fd making it into --data, will adjust top description, but ok if we decide to not go with it.

@yarikoptic yarikoptic changed the title ENH: specify push --transfer-data modes; keep --force to only force git push and checkdatapresent ENH: specify push --data modes; keep --force to only force git push and checkdatapresent Jun 12, 2020
@@ -123,16 +116,27 @@ class Push(Interface):
data or changes for those paths are considered for a push.""",
nargs='*',
constraints=EnsureStr() | EnsureNone()),
data=Parameter(
args=("--data",),
doc="""what to do with data (annex'ed) data. 'anything' would cause
Copy link
Contributor

@kyleam kyleam Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: repeated "data"

Copy link
Member Author

@yarikoptic yarikoptic Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doc="""what to do with data (annex'ed) data. 'anything' would cause
doc="""what to do with (annex'ed) data. 'anything' would cause

ds_repo.get_preferred_content('wanted', target)
):
lgr.debug("Invoking copy --auto")
cmd.append('--auto')
Copy link
Contributor

@kyleam kyleam Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so ignoring the obvious addition of the --data flag, the core change is here. We switch the decision to pass --auto from

you didn't explicitly ask to transfer data and you set this config

to

you didn't explicitly ask to transfer data and you've configured wanted for the remote

As far as I can tell, that'd achieve both the behavior that @yarikoptic is after (respect "wanted") and @mih is after (just transfer the data). It comes with an assumption that there aren't callers that set wanted yet would still prefer push to not use --auto; sounds pretty safe.

It swaps a config.obtain call for a .get_preferred_content call per dataset, which seems acceptable given any data transfer should swamp that expense.

args=("--data",),
doc="""what to do with data (annex'ed) data. 'anything' would cause
transfer of all annexed content, 'nothing' would avoid call to
`git annex copy` altogether. 'auto' would use 'git annex copy' with
Copy link
Contributor

@kyleam kyleam Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop 'auto' until someone requests it? It seems no one complained about publish not honoring numcopies, so perhaps 'auto-if-wanted' is sufficient.

Copy link
Member

@mih mih Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, what would remain is identical to datalad datalad.push.copy-auto-if-wanted=true push -- according to #4620 (comment)

I started using this code in daily operations, will report.

Copy link
Member Author

@yarikoptic yarikoptic Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto mode is to be able (not yet tested) to make publish to use push since it is it's mode of operation - to not copy data by default, unless wanted is set, and no paths are explicitly specified

Also placed list of constraints to a new line to make it fit 80 chars
@mih
Copy link
Member

@mih mih commented Jun 16, 2020

So, I have now used it on a few datasets and it didn't break my workflows.

I still feel that this move negates/neglects one of the core aspects of our previous discussions, namely that adding --auto whenever there is wanted, but it seems there is no way around that.

So feel free to go ahead.

However, it would be good to get tests for the additional functionality that was added. In particular, because we are lacking an actual use case for it, on which it could be evaluated.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 16, 2020

However, it would be good to get tests for the additional functionality that was added. In particular, because we are lacking an actual use case for it, on which it could be evaluated.

good point. Will do. Also probably would be valuable to test with paths being provided to restrict effect of auto and subdatasets.

Adapt the test posted at dataladgh-4541.  Aside from tweaking the way to
specify "use --auto", add a block that tests that the following two
behave the same when a file does not match a remote's preferred
content:

  datalad push --to=r --data=auto FILE
  git annex copy --to=r --auto FILE
@kyleam
Copy link
Contributor

@kyleam kyleam commented Jun 18, 2020

@mih:

However, it would be good to get tests for the additional functionality that was added. In particular, because we are lacking an actual use case for it, on which it could be evaluated.

Not really a use case, but I've pushed a commit that tests data='auto' (basically the test from gh-4541).

@yarikoptic:

Also probably would be valuable to test with paths being provided to restrict effect of auto and subdatasets.

I wasn't able to figure out what the intended effect is here and in general I'm now a bit turned around about what the path argument's role is.

@kyleam
Copy link
Contributor

@kyleam kyleam commented Jun 19, 2020

Also probably would be valuable to test with paths being provided to restrict effect of auto and subdatasets.

I've tried to address that with 20d80cf.

I said:

I wasn't able to figure out what the intended effect is here and in general I'm now a bit turned around about what the path argument's role is.

My confusion was based on doing something stupid in the setup of an earlier iteration of the test added in 20d80cf. I think what's now in that test captures the intended effect.

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 19, 2020

Thank you @kylam -- tests look great to me! crippledfs failure - filed separate #4644 (search for the test name didn't point to any existing issue)

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 19, 2020

coverage boost of (+39.83%) is especially impressive ;)

@yarikoptic
Copy link
Member Author

@yarikoptic yarikoptic commented Jun 19, 2020

altogether -- I would just wait for @mih 's final blessing or TODOs ;)

@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Jun 19, 2020
@mih mih merged commit dac6cfb into datalad:master Jun 23, 2020
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants