Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: push: Correct reported estimate of transfer size #4821

Merged
merged 1 commit into from Aug 20, 2020

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Aug 12, 2020

When calculating the size of annex data to be transferred (which is
used as the total in the progress bar), avoid counting the bytes of a
key more than once when multiple files point to the same key.

Fixes #4800.

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #4821 into maint will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##            maint    #4821   +/-   ##
=======================================
  Coverage   89.65%   89.65%           
=======================================
  Files         288      288           
  Lines       40351    40371   +20     
=======================================
+ Hits        36176    36196   +20     
  Misses       4175     4175           
Impacted Files Coverage Δ
datalad/core/distributed/push.py 89.58% <100.00%> (+0.50%) ⬆️
datalad/core/distributed/tests/test_push.py 97.92% <100.00%> (+0.06%) ⬆️
datalad/support/tests/test_cookies.py 86.66% <0.00%> (-13.34%) ⬇️
datalad/support/tests/test_annexrepo.py 96.58% <0.00%> (-0.01%) ⬇️
datalad/distribution/create_sibling.py 85.59% <0.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58cd9bf...96ea343. Read the comment docs.

@kyleam kyleam marked this pull request as draft Aug 14, 2020
To copy annex data, push() writes the paths of interest to a temporary
file and then feeds that to 'git annex copy --batch ...'.  Some of
these paths may point to the same keys, which results in a few issues.

  * The reported size of the transfer is inaccurate because bytes are
    counted per-path rather than per-key.

  * Unless --force is 'checkdatapresent' or 'all', the result records
    do not (may not?) include the extra same-key files in the output
    because 'git annex copy' doesn't emit redundant files when --fast
    is specified (at least in some cases).

  * When multiple jobs are requested, 'git annex copy' fires up
    workers for these redundant files whose content will be
    transferred by another worker.  (I think redundant _transfer_ is
    guarded against.)

Avoid these issues by tracking seen keys and redundant files when
preparing the input for 'git annex copy'.  This allows us to pass only
non-redundant files, to calculate the total transfer size correctly,
and to consistently yield results for redundant files by tacking on
notneeded results after the main copy results.

Fixes datalad#4800.
@kyleam kyleam marked this pull request as ready for review Aug 15, 2020
Copy link
Member

@yarikoptic yarikoptic left a comment

I reviewed the last commit and didn't realize that those comments posted as is without staging into review. In general I approve but IMHO could be tuned up according to the comments

@kyleam kyleam requested a review from mih Aug 17, 2020
mih
mih approved these changes Aug 18, 2020
Copy link
Member

@mih mih left a comment

LGTM. Thx much!

@kyleam
Copy link
Collaborator Author

kyleam commented Aug 18, 2020

@yarikoptic:

Could have used defaultdict to start with which IMHO is cleaner

I'd like to stick with the OrderedDict here so that the records are reported in a reliable order across the python versions we support. (dicts preserve insertion order starting with 3.6. At first it was marked as an implementation detail, but now it's guaranteed.) I don't consider setdefault as particularly dirty.

I would've avoided hand crafting a message and just yielded them all adjusting the path in the record yourself in above for j in res['stdout_json']: loop. All files are equal so I don't even see why some need special message here. A debug log message might though come handy to discover that transfer had "duplicate" keys

I prefer the current approach. I think it accurately reflects which files we asked annex to transfer and which we filtered out. It also avoids "had dups?" lookups for each transferred item.

Given you've said you're okay with it as is, I'll plan to leave it be.

@kyleam kyleam merged commit 0ad7bc7 into datalad:maint Aug 20, 2020
4 checks passed
@kyleam kyleam deleted the push-dedup-bytes branch Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants