Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Avoid needless calls with diff(fr=None, annex=!None) #4544

Merged
merged 4 commits into from May 20, 2020

Conversation

mih
Copy link
Member

@mih mih commented May 16, 2020

diff treats fr=None as a diff-from-scratch. However, when recursing
into subdatasets it breaks this paradigm by translating this into
fr=PRE_INIT_COMMIT_SHA. This can lead to a needless call to
AnnexRepo.get_content_annexinfo(). This changes rectifies that by
using None also for subdatasets.

Moreover, push uses diff(fr=since), which can also be None. This
change also avoid an equivalent needless call for a top-level dataset.

@mih mih added the performance Improve performance of an existing feature label May 16, 2020
@mih mih added this to the 0.13.0 milestone May 16, 2020
@codecov
Copy link

codecov bot commented May 16, 2020

Codecov Report

Merging #4544 into master will increase coverage by 0.44%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4544      +/-   ##
==========================================
+ Coverage   89.21%   89.66%   +0.44%     
==========================================
  Files         285      285              
  Lines       38597    38611      +14     
==========================================
+ Hits        34435    34620     +185     
+ Misses       4162     3991     -171     
Impacted Files Coverage Δ
datalad/core/distributed/push.py 86.02% <ø> (+0.63%) ⬆️
datalad/core/local/diff.py 95.29% <100.00%> (-0.06%) ⬇️
datalad/core/local/tests/test_diff.py 99.52% <100.00%> (+0.02%) ⬆️
datalad/log.py 89.71% <0.00%> (ø)
datalad/interface/base.py 91.15% <0.00%> (ø)
datalad/interface/tests/test_unlock.py 97.82% <0.00%> (ø)
datalad/distribution/tests/test_create_sibling.py 98.97% <0.00%> (+<0.01%) ⬆️
datalad/support/tests/test_annexrepo.py 95.30% <0.00%> (+<0.01%) ⬆️
datalad/support/annexrepo.py 86.57% <0.00%> (+0.02%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f0876e...31d6d83. Read the comment docs.

`diff` treats `fr=None` as a diff-from-scratch. However, when recursion
into subdatasets it breaks this paradigm by translating this into
`fr=PRE_INIT_COMMIT_SHA`. This can lead to a needless call to
`AnnexRepo.get_content_annexinfo()`. This changes rectifies that by
using `None` also for subdatasets.

Moreover, `push` uses `diff(fr=since)`, which can also be `None`. This
change also avoid an equivalent needless call for a top-level dataset.
Copy link
Member

@bpoldrack bpoldrack left a comment

Seems correct to me, but would prefer more explicit comment to help our future selves.

@@ -329,7 +328,9 @@ def _diff_ds(ds, fr, to, constant_refs, recursion_level, origpaths, untracked,
init=diff_state,
eval_availability=annexinfo in ('availability', 'all'),
ref=to)
if fr != to:
# if `fr` is None, we compare against a preinit state, and
# nothing needs to be done
Copy link
Member

@bpoldrack bpoldrack May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth being a bit more explicit why "nothing needs to be done" (we just got all we need to know, since current state of affairs is the diff to preinit). Took me a while to realize - diff is just not the easiest thing to have a quick look at.
That is if I'm not mistaken, of course. If I am wrong then it's even more important ;-)

Copy link
Member Author

@mih mih May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. "preinit" is before the repository existed. That state is "nothing" and it is identical across all datasets. So in order to figure everything out about this state, nothing needs to be done. I don't know how to change that comment to address your question. Feel free to push a better comment.

kyleam
kyleam approved these changes May 18, 2020
Copy link
Collaborator

@kyleam kyleam left a comment

Thanks. I don't spot any issues with this. @bpoldrack, I'll leave the merge to you in case you plan on tweaking the comment.

@yarikoptic
Copy link
Member

yarikoptic commented May 18, 2020

Sorry for being anal and stating the obvious: without a test checking for the desired behavior, whatever is enhanced can get unenhanced in the next round of RFing or BFing and we would not even mention.

@mih
Copy link
Member Author

mih commented May 18, 2020

Sorry for being anal and stating the obvious: without a test checking for the desired behavior, whatever is enhanced can get unenhanced in the next round of RFing or BFing and we would not even mention.

That is a universal truth. I nevertheless see no point in testing (with unittests) that a particular change still improves the performance. I don't even know how to do that. But I am sure that I have a bunch of issue to work on before I want to learn about it ;-)

@yarikoptic
Copy link
Member

yarikoptic commented May 18, 2020

I don't even know how to do that

If you point me to a test which exercises this specific case, I can try: mock diff and see that it isn't called with parameters we expect it to not be called with

@mih
Copy link
Member Author

mih commented May 18, 2020

I cannot follow. This code change has the only purpose to achieve the identical result, faster. So if you want to test the effectiveness of the change, you need to test performance difference.

@yarikoptic
Copy link
Member

yarikoptic commented May 19, 2020

I cannot follow. This code change has the only purpose to achieve the identical result, faster. So if you want to test the effectiveness of the change, you need to test performance difference.

Title of the PR says "ENH: Avoid needless calls with diff(fr=None, annex=!None)", so by "avoid" I assumed the change would cause some diff invocations should no longer happen whenever they were happening before. And that is what I thought to test.

@kyleam
Copy link
Collaborator

kyleam commented May 19, 2020

Title of the PR says "ENH: Avoid needless calls with diff(fr=None, annex=!None)", so by "avoid" I assumed the change would cause some diff invocations should no longer happen whenever they were happening before.

A needless call to get_content_annexinfo is avoided. So, you could mock that and assert that it's called N times rather than N+1. I personally don't like the idea of testing these internal details as a proxy for performance. However, you seem to have strong preference here, so I'd say let's add the test and then we can drop it later if it becomes a headache.

kyleam added a commit to mih/datalad that referenced this pull request May 19, 2020
fa00823 (ENH: Avoid needless calls with diff(fr=None, annex=!None))
reduced two get_content_annexinfo() calls to one.  Add a test checking
the number of calls in hopes of catching a change that reverts
fa00823.

Ref: datalad#4544 (comment)
@kyleam
Copy link
Collaborator

kyleam commented May 19, 2020

However, you seem to have strong preference here, so I'd say let's add the test and then we can drop it later if it becomes a headache.

I've pushed a test that I hope covers what you had in mind.

@mih
Copy link
Member Author

mih commented May 19, 2020

If anyhow possible, please make it a standalone test. We already know that push isn't yet doing what it should. There is near certainty that internals will change. We also know that performance isn't good. Whatever this test will achieve, it is not likely that it will last till 0.14. It would be good to not have to disentangle it from an unrelated test.

@kyleam
Copy link
Collaborator

kyleam commented May 19, 2020

@mih A few minutes ago I pushed a test in 6fbf52b. It's small and separate, so I don't think it should be too problematic.

@mih
Copy link
Member Author

mih commented May 19, 2020

I saw it just now. Thx @kyleam that works for me.

kyleam added a commit to mih/datalad that referenced this pull request May 19, 2020
fa00823 (ENH: Avoid needless calls with diff(fr=None, annex=!None))
reduced two get_content_annexinfo() calls to one.  Add a test checking
the number of calls in hopes of catching a change that reverts
fa00823.

Ref: datalad#4544 (comment)
fa00823 (ENH: Avoid needless calls with diff(fr=None, annex=!None))
reduced two get_content_annexinfo() calls to one.  Add a test checking
the number of calls in hopes of catching a change that reverts
fa00823.

Ref: datalad#4544 (comment)
Copy link
Member

@yarikoptic yarikoptic left a comment

Thank you @kyleam!

@mih mih merged commit 3bc1038 into datalad:master May 20, 2020
4 checks passed
@mih mih deleted the enh-diff branch May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants