Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: Bypass submodule add when adding subdatasets #4793

Merged
merged 3 commits into from Sep 1, 2020

Conversation

mih
Copy link
Member

@mih mih commented Aug 2, 2020

Instead go manual, avoiding various checks, because we already
know inside save that a particular directory is considered untracked
by Git.

This approach is substantially faster. Already with 50 subdatasets
the speed-up is 2x.

Quick'n'dirty benchmark:

% rm -rf * .git* .datalad; datalad create . && \
  ( for i in $(seq 50); do datalad create $i; done ) && \
  time datalad save && datalad status

Current master

datalad save  3.70s user 1.50s system 103% cpu 4.998 total

This PR

datalad save  1.68s user 1.02s system 108% cpu 2.499 total

It may well be that it needs more safety checks that would eat up that
margin. But in the context of
#2880 (comment)
(save thousands of new subdatasets) even much smaller benefits would
still translate to substantial absolute speed-ups.

Thx again to @kyleam for digging out that update-index trick.

TODO:

  • consolidate config and .gitmodules manipulation into a single action for all added submodules at once
  • RF helpers into documented standalone functions with tests

@mih mih added the performance Improve performance of an existing feature label Aug 2, 2020
@mih mih force-pushed the opt-subdsadd branch 3 times, most recently from dd37109 to 925c5dc Compare August 3, 2020 07:27
@mih
Copy link
Member Author

mih commented Aug 3, 2020

This works now. The only test failure is:

======================================================================
1649ERROR: datalad.distribution.tests.test_install.test_install_recursive
1650----------------------------------------------------------------------
1651Traceback (most recent call last):
1652  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 134, in run
1653    self.runTest(result)
1654  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 152, in runTest
1655    test(result)
1656AssertionError: Test was too slow (took 31.0307s, threshold was 30.0000s)

it is unclear to me how it can get slower (I have not yet observed a scenario where things got slower), but there are some optimization that can still be done.

@mih
Copy link
Member Author

mih commented Aug 3, 2020

Update: Consolidation on applying all .gitmodules and .git/config changes at once speeds things up by another 40%, leading to a 3x overall speed-up for 50 subdatasets. That factor also holds for 500 subdatasets (15.2s vs. 54.7s).

@mih mih force-pushed the opt-subdsadd branch 2 times, most recently from 3ecb4b2 to 3cf42d8 Compare August 3, 2020 13:50
Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like a good direction to go.

# prevent inner unquotes quotes
v = v.replace('"', '\\"')
v = '"{}"'.format(v)
return v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find man git-config confusing on this point, but I think " needs to be escaped in a value even if the value isn't in quotes.

demo
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
git version
git init

git config a.b 'with"quote'
tail -n2 .git/config
git config a.b

sed -i 's/\\//' .git/config
tail -n2 .git/config
git config a.b
git version 2.28.0.203.gce1f2e0ef1
Initialized empty Git repository in /tmp/dl-zNwW32g/.git/
[a]
	b = with\"quote
with"quote
[a]
	b = with"quote
fatal: bad config line 7 in file .git/config

For this sort of helper, I think it'd be very useful to pull it out and have dedicated unit tests with a range of values, confirming that they can be read back with git config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! Will change accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now.

write_submodule_config(
gmf, i['rpath'], gmprops)
write_submodule_config(
gcf, i['rpath'], dict(active='true', url=i['url']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this switch to writing the .git/config entries on our own, there's a change in how the url is represented: before it was an absolute path and now it is relative. I think that should be fine and for our purposes probably preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I had the same feeling of "should be better". Related: #4025 (comment)

Copy link
Member Author

@mih mih Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case the issue from #3538 remains here, because something earlier makes the path absolute:

% dl create raw
% dl create bids
% cd bids
bids % datalad install -d . -s ../raw sourcedata --reckless
bids % cat .gitmodules
[submodule "sourcedata"]
        path = sourcedata
        url = /tmp/d1/bids/../raw
        datalad-id = ca96f420-2f90-4f9c-8c46-0e6213dcfd99

datalad/support/gitrepo.py Outdated Show resolved Hide resolved
Instead go manual, avoiding various checks, because we already
know inside `save` that a particular directory is considered untracked
by Git.

This approach is substantially faster. Already with 50 subdatasets
the speed-up is 2x.

Quick'n'dirty benchmark:

```
% rm -rf * .git* .datalad; datalad create . && \
  ( for i in $(seq 50); do datalad create $i; done ) && \
  time datalad save && datalad status

Current master

```
datalad save  3.70s user 1.50s system 103% cpu 4.998 total
```

This PR

```
datalad save  1.68s user 1.02s system 108% cpu 2.499 total
```

It may well be that it needs more safety checks that would eat up that
margin. But in the context of
datalad#2880 (comment)
(save thousands of new subdatasets) even much smaller benefits would
still translate to substantial absolute speed-ups.
When saving many new subdatasets this is substantially faster than
incremental changes via normal manipulators. In particular the straight
unchecked addition of text to the config files, rather than a
variable-wise manuipulation makes a big difference in the case of a
large number of registered subdatasets.
@mih
Copy link
Member Author

mih commented Sep 1, 2020

Pushed update, rebased on current master

@codecov
Copy link

codecov bot commented Sep 1, 2020

Codecov Report

Merging #4793 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4793      +/-   ##
==========================================
- Coverage   89.69%   89.69%   -0.01%     
==========================================
  Files         289      289              
  Lines       40480    40527      +47     
==========================================
+ Hits        36310    36352      +42     
- Misses       4170     4175       +5     
Impacted Files Coverage Δ
datalad/config.py 96.72% <100.00%> (+0.13%) ⬆️
datalad/support/gitrepo.py 89.40% <100.00%> (+0.17%) ⬆️
datalad/tests/test_config.py 100.00% <100.00%> (ø)
datalad/downloaders/tests/test_http.py 89.92% <0.00%> (-1.23%) ⬇️
datalad/downloaders/http.py 84.55% <0.00%> (-0.39%) ⬇️
datalad/downloaders/base.py 78.57% <0.00%> (-0.36%) ⬇️
datalad/interface/tests/test_unlock.py 97.93% <0.00%> (+2.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b416b4...c1a0cbf. Read the comment docs.

@mih
Copy link
Member Author

mih commented Sep 1, 2020

Related: #4853

@mih mih added the merge-if-ok OP considers this work done, and requests review/merge label Sep 1, 2020
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

@kyleam kyleam merged commit 023bfc2 into datalad:master Sep 1, 2020
@mih
Copy link
Member Author

mih commented Sep 1, 2020

Thx all!

@mih mih deleted the opt-subdsadd branch September 1, 2020 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants