Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: Half the number of for-each-ref calls in GitRepo.is_with_annex() #3768

Merged
merged 2 commits into from
Oct 11, 2019

Conversation

mih
Copy link
Member

@mih mih commented Oct 11, 2019

This addresses some aspect of gh-3766

It also removes an unused feature to limit the check to remote branches.
Even the crawler doesn't use it.

It also removes an explicit check for a direct mode related branch name,
that doesn't seem to be relevant, even for direct mode repos, and
@bpoldrack oldrack also had no memory on when exactly this could happen. Given
that we do not support it anymore, I decided to remove the additional
complexity.

Here is the updated profile (compare to #3766 (comment))

image

       before           after         ratio
     [6ad8d02f]       [91bc32fa]
     <bm/merge-target>       <bm/pr>   
       1.75±0.07s       1.46±0.05s    ~0.83  api.supers.time_createadd
-      2.78±0.06s       1.89±0.04s     0.68  api.supers.time_createadd_to_dataset
       10.2±0.06s       7.55±0.02s    ~0.74  api.supers.time_installr
-        286±10ms          208±7ms     0.73  api.supers.time_ls
-      3.08±0.07s       2.17±0.01s     0.70  api.supers.time_ls_recursive
-      3.23±0.03s       2.37±0.02s     0.73  api.supers.time_ls_recursive_long_all
-        965±10ms          873±7ms     0.90  api.supers.time_status
-      2.01±0.05s       1.65±0.03s     0.82  api.supers.time_status_recursive
-         250±7ms          156±3ms     0.62  api.supers.time_subdatasets
-        795±20ms         537±10ms     0.68  api.supers.time_subdatasets_recursive
-         220±2ms          150±3ms     0.68  api.supers.time_subdatasets_recursive_first
       7.75±0.08s        4.92±0.1s    ~0.64  api.supers.time_uninstall
-        890±20ms         756±10ms     0.85  api.testds.time_create_test_dataset1
-      5.03±0.03s       4.31±0.05s     0.86  api.testds.time_create_test_dataset2x2
      2.22±0.06ms       2.28±0.1ms     1.03  core.runner.time_echo
      2.43±0.04ms      2.36±0.04ms     0.97  core.runner.time_echo_gitrunner
-            1.03             0.86     0.83  core.runner.track_overhead_100ms
+           63.58            85.15     1.34  core.runner.track_overhead_echo
-            6.55            -1.46    -0.22  core.runner.track_overhead_heavyout
-             3.9             3.22     0.83  core.runner.track_overhead_heavyout_online_process
-           28.16            25.04     0.89  core.runner.track_overhead_heavyout_online_through
         620±20ms         607±10ms     0.98  core.startup.time_help_np
          148±2ms          143±5ms     0.97  core.startup.time_import
          559±6ms         534±20ms     0.96  core.startup.time_import_api
       17.3±0.6ms         17.4±1ms     1.00  repo.gitrepo.time_get_content_info
       7.85±0.4ms       8.12±0.5ms     1.03  support.path.get_parent_paths.time_allsubmods_toplevel
       6.66±0.2ms       7.51±0.5ms    ~1.13  support.path.get_parent_paths.time_allsubmods_toplevel_only
         349±10ns         352±20ns     1.01  support.path.get_parent_paths.time_no_submods
       5.95±0.2ms       5.90±0.3ms     0.99  support.path.get_parent_paths.time_one_submod_subdir
       6.01±0.2ms       6.30±0.5ms     1.05  support.path.get_parent_paths.time_one_submod_toplevel
        33.9±0.2s       32.7±0.07s     0.97  usecases.study_forrest.time_make_studyforrest_mockup

This addresses some aspect of dataladgh-3766

It also removes an unused feature to limit the check to remote branches.
Even the crawler doesn't use it.

It also removes an explicit check for a direct mode related branch name,
that doesn't seem to be relevant, even for direct mode repos, and
@bproldrack also had no memory on when exactly this could happen. Given
that we do not support it anymore, I decided to remove the additional
complexity.
@codecov
Copy link

codecov bot commented Oct 11, 2019

Codecov Report

Merging #3768 into master will increase coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3768      +/-   ##
==========================================
+ Coverage   80.67%   80.78%   +0.11%     
==========================================
  Files         273      273              
  Lines       35962    35962              
==========================================
+ Hits        29011    29051      +40     
+ Misses       6951     6911      -40
Impacted Files Coverage Δ
datalad/support/gitrepo.py 84.18% <100%> (ø) ⬆️
datalad/support/network.py 81.48% <0%> (-0.24%) ⬇️
datalad/downloaders/tests/test_http.py 60.29% <0%> (+2.2%) ⬆️
datalad/interface/run_procedure.py 86.95% <0%> (+19.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ad8d02...e2f2f61. Read the comment docs.

@mih
Copy link
Member Author

mih commented Oct 11, 2019

New profile, after second OPT

image

@yarikoptic
Copy link
Member

yarikoptic commented Oct 11, 2019

I would be more interested in benchmarks, but it seems their setup is broken after moving over to github actions

@mih
Copy link
Member Author

mih commented Oct 11, 2019

I would be more interested in benchmarks, but it seems their setup is broken after moving over to github actions

Please share what leads you to this conclusion. Benchmarks are copied above, the last commit hasn't brought significant change on top of that.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these commits look like clear improvements to me (and move the benchmarks closer to the original state). FWIW here are my local benchmarks of this PR against the commit before the for_each_ref_ merge

PR vs 481fcf9
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7
·· Installing 481fcf94 <pre-foreachref> into virtualenv-py3.7.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For DataLad commit 481fcf94 <pre-foreachref>:
[  0.00%] ·· Benchmarking virtualenv-py3.7
[  8.33%] ··· Setting up common.py:123                                                                 ok
[  8.33%] ··· Running (api.supers.time_subdatasets--)...
[ 33.33%] ··· api.supers.time_subdatasets                                                     19.7±0.06ms
[ 41.67%] ··· api.supers.time_subdatasets_recursive^C
[ 41.67%] ·· Interrupted.
(datalad) [hylob: datalad]% asv run -b subdatasets && asv compare -m hylob pre-foreachref mih/opt-foreachref
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7
·· Installing e2f2f616 <pull/origin/3768> into virtualenv-py3.7.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For DataLad commit e2f2f616 <pull/origin/3768>:
[  0.00%] ·· Benchmarking virtualenv-py3.7
[  8.33%] ··· Setting up common.py:123                                                                 ok
[  8.33%] ··· Running (api.supers.time_subdatasets--)...
[ 33.33%] ··· api.supers.time_subdatasets                                                      33.6±0.2ms
[ 41.67%] ··· api.supers.time_subdatasets_recursive                                               169±4ms
[ 50.00%] ··· api.supers.time_subdatasets_recursive_first                                      33.4±0.2ms
[ 50.00%] · For DataLad commit 481fcf94 <pre-foreachref>:
[ 50.00%] ·· Building for virtualenv-py3.7..
[ 50.00%] ·· Benchmarking virtualenv-py3.7
[ 58.33%] ··· Setting up common.py:123                                                                 ok
[ 58.33%] ··· Running (api.supers.time_subdatasets--)...
[ 83.33%] ··· api.supers.time_subdatasets                                                      19.8±0.2ms
[ 91.67%] ··· api.supers.time_subdatasets_recursive                                              114±10ms
[100.00%] ··· api.supers.time_subdatasets_recursive_first                                     19.3±0.08ms

All benchmarks:

       before           after         ratio
     [481fcf94]       [e2f2f616]
     <pre-foreachref>       <pull/origin/3768>
+      19.8±0.2ms       33.6±0.2ms     1.69  api.supers.time_subdatasets
+        114±10ms          169±4ms     1.47  api.supers.time_subdatasets_recursive
+     19.3±0.08ms       33.4±0.2ms     1.73  api.supers.time_subdatasets_recursive_first

"""
return any(
b['refname:strip=2'] == 'git-annex' or b['refname:strip=2'].endswith('/git-annex')
for b in self.for_each_ref_(fields='refname:strip=2', pattern=['refs/heads', 'refs/remotes'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't care about the actual output, this could be further simplified to

diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index 2cee19aee..71a6d84ae 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -1000,8 +1000,7 @@ def is_with_annex(self):
         """Report if GitRepo (assumed) has (remotes with) a git-annex branch
         """
         return any(
-            b['refname:strip=2'] == 'git-annex' or b['refname:strip=2'].endswith('/git-annex')
-            for b in self.for_each_ref_(fields='refname:strip=2', pattern=['refs/heads', 'refs/remotes'])
+            self.for_each_ref_(fields='refname', pattern=['refs/*/git-annex'])
         )
 
     @classmethod

We're offloading more of the processing to git, so that shouldn't hurt performance, but, at least with our benchmarks, I didn't see any improvements with this change.

@kyleam
Copy link
Contributor

kyleam commented Oct 11, 2019

appveyor failure is the sporadic hanging of test_autoenabled_remote_msg (gh-3617).

@yarikoptic
Copy link
Member

Please share what leads you to this conclusion. Benchmarks are copied above, the last commit hasn't brought significant change on top of that.

oh, they are indeed there. I have concluded that because when I looked on the phone I did

  • switch to desktop mode (no Actions in mobile view)
  • go to actions for that run
  • switch to benchmarks run
  • click on "Run benchmarks" expandable
  • see only
Installing collected packages: asv
Successfully installed asv-0.4.1
· No information stored about machine 'fv-az88'. I know about nothing.
  

I will now ask you some questions about this machine to identify it in the benchmarks.

1. machine:  A unique name to identify this machine in the results.
   May be anything, as long as it is unique across all the machines
   used to benchmark this project.  NOTE: If changed from the default,
   it will no longer match the hostname of this machine, and you may
   need to explicitly use the --machine argument to asv.

which suggested that something was off, and it was not obvious that I need to scroll down for a bit. (on the phone things are small). I will just keep in mind that later on. Sorry for the noise.

@mih
Copy link
Member Author

mih commented Oct 11, 2019

Thx! Merging

@mih mih merged commit 2ce07ba into datalad:master Oct 11, 2019
@mih mih deleted the opt-foreachref branch October 11, 2019 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants