Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF/RF: push - get diff depth-first not breadth-first #5416

Merged
merged 4 commits into from Apr 2, 2021

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 9, 2021

Remote repositories must exist for push already, and might have hooks enabled to process the "clean" state of the repository. In hooks we deploy with web-ui e.g. we "aggregate" information about sizes within submodules. For that to work correctly, submodules should be pushed first, and not after their "superdatasets".

No reason/use case comes to mind why we might want to push super datasets first (besides that they are more "readily" analyzable, so push might start pushing sooner), and original commit 9796a5f seems to not state specific choice for the breadth-first behavior.

Closes #5410

TODOs:

  • do full tests sweep (here in this PR) to spot more side-effects
  • fix up the test (or code) which I marked to be skipped to do the sweep magically "Fixed" itself and @yarikoptic wonders if may be related to git-annex version boost
  • if decided to proceed, add a dedicated test to establish this order
  • figure out proper naming... see below

@codecov
Copy link

codecov bot commented Feb 9, 2021

Codecov Report

Merging #5416 (0c354c8) into maint (2b3468a) will decrease coverage by 5.67%.
The diff coverage is 79.31%.

❗ Current head 0c354c8 differs from pull request most recent head 13fbb07. Consider uploading reports for the commit 13fbb07 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #5416      +/-   ##
==========================================
- Coverage   90.20%   84.52%   -5.68%     
==========================================
  Files         296      293       -3     
  Lines       42029    42035       +6     
==========================================
- Hits        37912    35530    -2382     
- Misses       4117     6505    +2388     
Impacted Files Coverage Δ
datalad/core/distributed/push.py 88.52% <ø> (ø)
datalad/core/local/diff.py 90.62% <76.92%> (-4.73%) ⬇️
datalad/core/distributed/tests/test_push.py 98.45% <81.25%> (-1.32%) ⬇️
datalad/support/tests/utils.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/metadata/extractors/tests/test_audio.py 19.35% <0.00%> (-80.65%) ⬇️
datalad/metadata/extractors/xmp.py 12.96% <0.00%> (-79.63%) ⬇️
datalad/metadata/extractors/exif.py 18.75% <0.00%> (-78.13%) ⬇️
datalad/metadata/extractors/image.py 19.35% <0.00%> (-77.42%) ⬇️
datalad/metadata/extractors/audio.py 20.00% <0.00%> (-77.15%) ⬇️
datalad/metadata/extractors/tests/test_exif.py 24.00% <0.00%> (-76.00%) ⬇️
... and 95 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b3468a...13fbb07. Read the comment docs.

@yarikoptic
Copy link
Member Author

sweep results:

  • mac fail is unrelated - connectivity issue "error: Failed to connect to datasets-tests.datalad.org port 80: "
  • travis is still that dreaded NFS stall (I keep forgetting that we have not pinned it down and arrogantly was praising how green we are today, shame! ;) )

and otherwise I do not see any side-effects, so unless I am stopped, I will try to tackle that failing test some time soon.

@mih
Copy link
Member

mih commented Feb 10, 2021

My immediate response would be that I cannot think of a strong reason to prefer one over the other. But I feel like it deserves some more thinking. The reason for giving breadth-first is likely that this code evaluating the diff records assumes this order (as hinted by we rely on all records of a dataset coming out concerning subdataset content (with even more typos fixed to communicate the intent better). The key would be this piece

        if parentds != cur_ds:
            if ds_res:
                # we switch to another dataset, yield this one so outside
                # code can start processing immediately
                yield (cur_ds, ds_res)

So whenever the dataset changes (which I assume can now happen before the last report of the previous dataset was received) it will report a dataset to be pushed. As a consequence, I fear that we will now process one and the same dataset multiple times (with partial records). It is likely that the tests do not look for such a condition.

@yarikoptic
Copy link
Member Author

I think I am getting a clue what

def _datasets_since_(dataset, since, paths, recursive, recursion_limit):
    """Generator"""

should yield in its pairs, but I am a bit puzzled on why diff_dataset gives me records even for the items which did not change:

(git-annex)lena:~/datalad/openneuro[master]git-annex
$> python -c 'from rich import print; from datalad.core.local.diff import diff_dataset; print(list(x for x in diff_dataset("/home/yoh/datalad", "HEAD~3", "HEAD", False, path=".", annex=None, recursive=True, recursion_limit=100, eval_file_type=False, reporting_order="breadth-first") if x["type"] == "dataset"))' | head -n 30
[
    {
        'type': 'dataset',
        'gitshasum': '1998cf752ed31e39b0769f8edc506e9423f66ff6',
        'prev_gitshasum': '365d888426090f3d917198d953e76271d64b7ee4',
        'state': 'modified',
        'path': '/home/yoh/datalad/openneuro',
        'parentds': '/home/yoh/datalad',
        'status': 'ok',
        'refds': '/home/yoh/datalad',
        'logger': <Logger datalad.core.local.diff (INFO)>,
        'action': 'diff'
    },
    {
        'type': 'dataset',
        'gitshasum': 'f8e27ac909e50b5b5e311f6be271f0b1757ebb7b',
        'prev_gitshasum': 'f8e27ac909e50b5b5e311f6be271f0b1757ebb7b',
        'state': 'clean',
        'path': '/home/yoh/datalad/openneuro/ds000001',
        'parentds': '/home/yoh/datalad/openneuro',
        'status': 'ok',
        'refds': '/home/yoh/datalad',
        'logger': <Logger datalad.core.local.diff (INFO)>,
        'action': 'diff'
    },
...

so -- the first record is good since gitshasum is different from the prev_gitshasum. But then it is the same for '/home/yoh/datalad/openneuro/ds000001' ... shouldn't it be just diff's @mih?

@kyleam
Copy link
Contributor

kyleam commented Mar 2, 2021

so -- the first record is good since gitshasum is different from the prev_gitshasum. But then it is the same for '/home/yoh/datalad/openneuro/ds000001' ... shouldn't it be just diff's @mih?

Clean items are expected in the sense that _diff_ds uses GitRepo.diffstatus:

def diffstatus(self, fr, to, paths=None, untracked='all',
eval_submodule_state='full', eval_file_type=True,
_cache=None):
"""Like diff(), but reports the status of 'clean' content too.

Then Diff.custom_result_renderer filters them out from the standard reporting.

I'm not sure on the deeper question of why .diffstatus is used rather than .diff, though my guess is so that Python callers of the dataset-level command can have access to the full comparison.

@mih
Copy link
Member

mih commented Mar 4, 2021

I cannot comment on the implications of a switch from .diffstatus() to .diff(), but the principle motivation re reporting "clean" vs reporting nothing was/is that we had routine issues with deciding whether something evaluated "OK" or was not evaluated at all in the context of git-annex returns (still a major problem for save #3844). I generally switched to the report-clean paradigm in the -revolution times in order to improve on the guessing. Moreover, given that we need to come to a "clean" result anyways, it is relatively cheap to report it, and enable a wider set of use-cases for consumers of more than just a single piece of information. For me this leads to less duplication of very similar code pieces, and to an overall efficiency gain vs the previous model of having multiple more specialized methods to determine one single piece of information from a more complex pile (but haven to query twice, when two pieces from the pile are needed).

To be truly depth-first there should be no records yielded for super-dataset
before yielding all the records from the subdatasets.  So, similarly to how
we "cache" reports from subdasets for breadth-first, "cache" reports within
the dataset until done with subdatasets.  While at it, and to catch possible
typos etc, made if/else for "order" handling into if/elif/else: raise.

Such fix is needed for any outside logic relying on depth-first to really be
depth first (like ongoing RF for "push")
Remote repositories must exist for push already, and might have
hooks enabled to process the "clean" state of the repository.
In hooks we deploy with web-ui e.g. we "aggregate" information
about sizes within submodules.  For that to work correctly,
submodules should be pushed first, and not after their
"superdatasets".

Closes datalad#5410
@yarikoptic
Copy link
Member Author

Appveyor unrelated. @mih please have a look again - I think I fixed underlying issue

@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Mar 30, 2021
Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RF: push - get diff depth-first not breadth-first

The rationale in the second commit for making push process deeper subdatasets first makes sense to me. But...

BF: diff._diff_ds - delay reporting within ds entries for depth-first

To be truly depth-first there should be no records yielded for super-dataset
before yielding all the records from the subdatasets. [...]

... in my opinion sticking with the "depth-first" terminology is confusing.

demo repo
set -eu

cd "$(mktemp -d "${TMPDIR:-/tmp}"/dl-XXXXXXX)"

datalad create
datalad create -d . a
datalad create -d a a/b
datalad create -d . c
datalad save -r
touch a/b/foo
touch c/bar

On maint (2b3468a), here's the depth-first order of datalad diff.

$ datalad -f json diff -r | jq -r '.path' | grep -v '.git'
/tmp/dl-j2jRQzN/.datalad/config
/tmp/dl-j2jRQzN/a
/tmp/dl-j2jRQzN/a/.datalad/config
/tmp/dl-j2jRQzN/a/b
/tmp/dl-j2jRQzN/a/b/foo
/tmp/dl-j2jRQzN/a/b/.datalad/config
/tmp/dl-j2jRQzN/c
/tmp/dl-j2jRQzN/c/bar
/tmp/dl-j2jRQzN/c/.datalad/config

I disagree with the commit message; that is a depth-first walk of the tree. Child nodes are traversed before moving to a sister node.

This PR changes that to

$ datalad -f json diff -r | jq -r '.path' | grep -v '.git'
/tmp/dl-j2jRQzN/a/b/foo
/tmp/dl-j2jRQzN/a/b/.datalad/config
/tmp/dl-j2jRQzN/a/.datalad/config
/tmp/dl-j2jRQzN/a/b
/tmp/dl-j2jRQzN/c/bar
/tmp/dl-j2jRQzN/c/.datalad/config
/tmp/dl-j2jRQzN/.datalad/config
/tmp/dl-j2jRQzN/a
/tmp/dl-j2jRQzN/c

So, perhaps we should use the "bottom up" terminology (used in datalad subdatasets) for the behavior introduced by this PR?

@yarikoptic
Copy link
Member Author

GREAT analysis @kyleam, thank you! Indeed, I make it into "bottom up", and it was a legit "depth-first". I will introduce "bottom-up" to represent (and use for push) behavior I have introduced.

@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 31, 2021

refactored, see the last commit (dc92bf8 13fbb07 ATM) with examples from running @kyleam 's script in the commit message

With this script (adjusted version of @kyleam's):

    #!/bin/bash

    export PS4='> '

    set -eu
    set -x

    cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

    datalad create
    datalad create -d . a
    datalad create -d a a/b
    datalad create -d . c
    datalad save -r
    touch a/b/foo
    touch c/bar

    pwd
    for s in depth-first breadth-first bottom-up; do
        datalad -f '{path}' diff -r --order $s | grep -v '\.git'
    done

and following diff in the code base to expose the --order option
(not committed)

```diff
diff --git a/datalad/core/local/diff.py b/datalad/core/local/diff.py
index 1534d5e49..ab5fb00f7 100644
--- a/datalad/core/local/diff.py
+++ b/datalad/core/local/diff.py
@@ -34,6 +34,7 @@ from datalad.distribution.dataset import (
 )

 from datalad.support.constraints import (
+    EnsureChoice,
     EnsureNone,
     EnsureStr,
 )
@@ -94,6 +95,10 @@ class Diff(Interface):
             any identifier that Git understands. If none is specified,
             the state of the working tree will be compared.""",
             constraints=EnsureStr() | EnsureNone()),
+        order=Parameter(
+            args=("--order",),
+            doc="""TODO""",
+            constraints=EnsureChoice('depth-first', 'breadth-first', 'bottom-up')),
     )

     _examples_ = [
@@ -126,7 +131,9 @@ class Diff(Interface):
             annex=None,
             untracked='normal',
             recursive=False,
-            recursion_limit=None):
+            recursion_limit=None,
+            order='depth-first'
+    ):
         yield from diff_dataset(
             dataset=dataset,
             fr=ensure_unicode(fr),
@@ -136,7 +143,9 @@ class Diff(Interface):
             annex=annex,
             untracked=untracked,
             recursive=recursive,
-            recursion_limit=recursion_limit)
+            recursion_limit=recursion_limit,
+            reporting_order=order
+        )

     @staticmethod
     def custom_result_renderer(res, **kwargs):  # pragma: more cover
```

we see:

    > for s in depth-first breadth-first bottom-up
    > datalad -f '{path}' diff -r --order depth-first
    > grep -v '\.git'
    /home/yoh/.tmp/dl-pMjQD8U/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a
    /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a/b
    /home/yoh/.tmp/dl-pMjQD8U/a/b/foo
    /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/c
    /home/yoh/.tmp/dl-pMjQD8U/c/bar
    /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config
    > for s in depth-first breadth-first bottom-up
    > datalad -f '{path}' diff -r --order breadth-first
    > grep -v '\.git'
    /home/yoh/.tmp/dl-pMjQD8U/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a
    /home/yoh/.tmp/dl-pMjQD8U/c
    /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a/b
    /home/yoh/.tmp/dl-pMjQD8U/a/b/foo
    /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/c/bar
    /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config
    > for s in depth-first breadth-first bottom-up
    > datalad -f '{path}' diff -r --order bottom-up
    > grep -v '\.git'
    /home/yoh/.tmp/dl-pMjQD8U/a/b/foo
    /home/yoh/.tmp/dl-pMjQD8U/a/b/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a/b
    /home/yoh/.tmp/dl-pMjQD8U/c/bar
    /home/yoh/.tmp/dl-pMjQD8U/c/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/.datalad/config
    /home/yoh/.tmp/dl-pMjQD8U/a
    /home/yoh/.tmp/dl-pMjQD8U/c
@kyleam
Copy link
Contributor

kyleam commented Mar 31, 2021

refactored, see the last commit (dc92bf8 13fbb07 ATM)

Thanks, looks very good to me.

It doesn't matter so much given that there's no exposed option, but just a note that subdatasets spells its value without a hyphen.

@yarikoptic
Copy link
Member Author

It doesn't matter so much given that there's no exposed option, but just a note that subdatasets spells its value without a hyphen.

yeap, and there it is a flag (just a note) and it does have it hypened in a doc:

  --bottomup            whether to report subdatasets in bottom-up order along each branch in
                        the dataset tree, and not top-down.

so at some point (may be of some bigger RF) we might want to straighten it up and make it --bottom-up or even reflect on @mih's stated dislike of flags and convert to --order top-down|bottom-up for higher consistency and opening the road for other orders e.g. instead of current one:

$> datalad -f '{path}' subdatasets -r --bottomup
/home/yoh/.tmp/dl-WyTjJpB/a/b
/home/yoh/.tmp/dl-WyTjJpB/a
/home/yoh/.tmp/dl-WyTjJpB/c/a
/home/yoh/.tmp/dl-WyTjJpB/c

it could be

/home/yoh/.tmp/dl-WyTjJpB/a/b
/home/yoh/.tmp/dl-WyTjJpB/c/a
/home/yoh/.tmp/dl-WyTjJpB/a
/home/yoh/.tmp/dl-WyTjJpB/c

for some kind of depth-up (with similar one for depth-down?) if any such need arises.

Thanks, looks very good to me.

if supported by an official "approval", I would merge it immediately ;)

@yarikoptic
Copy link
Member Author

I would also be ok to merge it into master instead of maint since it came out a bit more intrusive than originally planned, so feedback on this would be welcome too

@yarikoptic
Copy link
Member Author

eh, no approvals... I guess if no objections would be voiced (@mih?), I will merge in a day or so

@mih
Copy link
Member

mih commented Apr 2, 2021

Had no chance to look yet. Should be able to do so today. Sorry

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that it took so long, and thanks for the detailed discussions. The changes make sense to me and rational for doing them seems sound. I did not have the chance to look at potential implications re performance on datasets with deep hierarchies or many subdatasets. But I feel like those could be future performance optimizations, if needed at all.

@yarikoptic
Copy link
Member Author

Thank you @mih !

@yarikoptic yarikoptic merged commit b01be3f into datalad:maint Apr 2, 2021
@yarikoptic yarikoptic deleted the rf-push-depthfirst branch April 29, 2021 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-push merge-if-ok OP considers this work done, and requests review/merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants