New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BF/RF: push - get diff depth-first not breadth-first #5416
Conversation
Codecov Report
@@ Coverage Diff @@
## maint #5416 +/- ##
==========================================
- Coverage 90.20% 84.52% -5.68%
==========================================
Files 296 293 -3
Lines 42029 42035 +6
==========================================
- Hits 37912 35530 -2382
- Misses 4117 6505 +2388
Continue to review full report at Codecov.
|
sweep results:
and otherwise I do not see any side-effects, so unless I am stopped, I will try to tackle that failing test some time soon. |
My immediate response would be that I cannot think of a strong reason to prefer one over the other. But I feel like it deserves some more thinking. The reason for giving if parentds != cur_ds:
if ds_res:
# we switch to another dataset, yield this one so outside
# code can start processing immediately
yield (cur_ds, ds_res) So whenever the dataset changes (which I assume can now happen before the last report of the previous dataset was received) it will report a dataset to be pushed. As a consequence, I fear that we will now process one and the same dataset multiple times (with partial records). It is likely that the tests do not look for such a condition. |
I think I am getting a clue what def _datasets_since_(dataset, since, paths, recursive, recursion_limit):
"""Generator""" should yield in its pairs, but I am a bit puzzled on why
so -- the first record is good since |
Clean items are expected in the sense that datalad/datalad/support/gitrepo.py Lines 3610 to 3613 in b13ed99
Then I'm not sure on the deeper question of why |
I cannot comment on the implications of a switch from |
To be truly depth-first there should be no records yielded for super-dataset before yielding all the records from the subdatasets. So, similarly to how we "cache" reports from subdasets for breadth-first, "cache" reports within the dataset until done with subdatasets. While at it, and to catch possible typos etc, made if/else for "order" handling into if/elif/else: raise. Such fix is needed for any outside logic relying on depth-first to really be depth first (like ongoing RF for "push")
Remote repositories must exist for push already, and might have hooks enabled to process the "clean" state of the repository. In hooks we deploy with web-ui e.g. we "aggregate" information about sizes within submodules. For that to work correctly, submodules should be pushed first, and not after their "superdatasets". Closes datalad#5410
21c5775
to
55effbd
Compare
Appveyor unrelated. @mih please have a look again - I think I fixed underlying issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RF: push - get diff depth-first not breadth-first
The rationale in the second commit for making push
process deeper subdatasets first makes sense to me. But...
BF: diff._diff_ds - delay reporting within ds entries for depth-first
To be truly depth-first there should be no records yielded for super-dataset
before yielding all the records from the subdatasets. [...]
... in my opinion sticking with the "depth-first" terminology is confusing.
demo repo
set -eu
cd "$(mktemp -d "${TMPDIR:-/tmp}"/dl-XXXXXXX)"
datalad create
datalad create -d . a
datalad create -d a a/b
datalad create -d . c
datalad save -r
touch a/b/foo
touch c/bar
On maint (2b3468a), here's the depth-first order of datalad diff
.
$ datalad -f json diff -r | jq -r '.path' | grep -v '.git'
/tmp/dl-j2jRQzN/.datalad/config
/tmp/dl-j2jRQzN/a
/tmp/dl-j2jRQzN/a/.datalad/config
/tmp/dl-j2jRQzN/a/b
/tmp/dl-j2jRQzN/a/b/foo
/tmp/dl-j2jRQzN/a/b/.datalad/config
/tmp/dl-j2jRQzN/c
/tmp/dl-j2jRQzN/c/bar
/tmp/dl-j2jRQzN/c/.datalad/config
I disagree with the commit message; that is a depth-first walk of the tree. Child nodes are traversed before moving to a sister node.
This PR changes that to
$ datalad -f json diff -r | jq -r '.path' | grep -v '.git'
/tmp/dl-j2jRQzN/a/b/foo
/tmp/dl-j2jRQzN/a/b/.datalad/config
/tmp/dl-j2jRQzN/a/.datalad/config
/tmp/dl-j2jRQzN/a/b
/tmp/dl-j2jRQzN/c/bar
/tmp/dl-j2jRQzN/c/.datalad/config
/tmp/dl-j2jRQzN/.datalad/config
/tmp/dl-j2jRQzN/a
/tmp/dl-j2jRQzN/c
So, perhaps we should use the "bottom up" terminology (used in datalad subdatasets
) for the behavior introduced by this PR?
GREAT analysis @kyleam, thank you! Indeed, I make it into "bottom up", and it was a legit "depth-first". I will introduce "bottom-up" to represent (and use for |
With this script (adjusted version of @kyleam's): #!/bin/bash export PS4='> ' set -eu set -x cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)" datalad create datalad create -d . a datalad create -d a a/b datalad create -d . c datalad save -r touch a/b/foo touch c/bar pwd for s in depth-first breadth-first bottom-up; do datalad -f '{path}' diff -r --order $s | grep -v '\.git' done and following diff in the code base to expose the --order option (not committed) ```diff diff --git a/datalad/core/local/diff.py b/datalad/core/local/diff.py index 1534d5e49..ab5fb00f7 100644 --- a/datalad/core/local/diff.py +++ b/datalad/core/local/diff.py @@ -34,6 +34,7 @@ from datalad.distribution.dataset import ( ) from datalad.support.constraints import ( + EnsureChoice, EnsureNone, EnsureStr, ) @@ -94,6 +95,10 @@ class Diff(Interface): any identifier that Git understands. If none is specified, the state of the working tree will be compared.""", constraints=EnsureStr() | EnsureNone()), + order=Parameter( + args=("--order",), + doc="""TODO""", + constraints=EnsureChoice('depth-first', 'breadth-first', 'bottom-up')), ) _examples_ = [ @@ -126,7 +131,9 @@ class Diff(Interface): annex=None, untracked='normal', recursive=False, - recursion_limit=None): + recursion_limit=None, + order='depth-first' + ): yield from diff_dataset( dataset=dataset, fr=ensure_unicode(fr), @@ -136,7 +143,9 @@ class Diff(Interface): annex=annex, untracked=untracked, recursive=recursive, - recursion_limit=recursion_limit) + recursion_limit=recursion_limit, + reporting_order=order + ) @staticmethod def custom_result_renderer(res, **kwargs): # pragma: more cover ``` we see: > for s in depth-first breadth-first bottom-up > datalad -f '{path}' diff -r --order depth-first > grep -v '\.git' /home/yoh/.tmp/dl-pMjQD8U/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a/b /home/yoh/.tmp/dl-pMjQD8U/a/b/foo /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/c /home/yoh/.tmp/dl-pMjQD8U/c/bar /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config > for s in depth-first breadth-first bottom-up > datalad -f '{path}' diff -r --order breadth-first > grep -v '\.git' /home/yoh/.tmp/dl-pMjQD8U/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a /home/yoh/.tmp/dl-pMjQD8U/c /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a/b /home/yoh/.tmp/dl-pMjQD8U/a/b/foo /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/c/bar /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config > for s in depth-first breadth-first bottom-up > datalad -f '{path}' diff -r --order bottom-up > grep -v '\.git' /home/yoh/.tmp/dl-pMjQD8U/a/b/foo /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a/b /home/yoh/.tmp/dl-pMjQD8U/c/bar /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/.datalad/config /home/yoh/.tmp/dl-pMjQD8U/a /home/yoh/.tmp/dl-pMjQD8U/c
dc92bf8
to
13fbb07
Compare
yeap, and there it is a flag (just a note) and it does have it hypened in a doc:
so at some point (may be of some bigger RF) we might want to straighten it up and make it
it could be
for some kind of
if supported by an official "approval", I would merge it immediately ;) |
I would also be ok to merge it into |
eh, no approvals... I guess if no objections would be voiced (@mih?), I will merge in a day or so |
Had no chance to look yet. Should be able to do so today. Sorry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that it took so long, and thanks for the detailed discussions. The changes make sense to me and rational for doing them seems sound. I did not have the chance to look at potential implications re performance on datasets with deep hierarchies or many subdatasets. But I feel like those could be future performance optimizations, if needed at all.
Thank you @mih ! |
Remote repositories must exist for push already, and might have hooks enabled to process the "clean" state of the repository. In hooks we deploy with web-ui e.g. we "aggregate" information about sizes within submodules. For that to work correctly, submodules should be pushed first, and not after their "superdatasets".
No reason/use case comes to mind why we might want to push super datasets first (besides that they are more "readily" analyzable, so
push
might start pushing sooner), and original commit 9796a5f seems to not state specific choice for the breadth-first behavior.Closes #5410
TODOs:
fix up the test (or code) which I marked to be skipped to do the sweepmagically "Fixed" itself and @yarikoptic wonders if may be related to git-annex version boost