-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generator-style runner with application in BatchedCommand and AnnexRepo #6244
Generator-style runner with application in BatchedCommand and AnnexRepo #6244
Conversation
Add a new runner that supports timeouts and generator based subprocess stdout and stderr data reporting.
642d5bb
to
9f066ef
Compare
mhhh the windows tests seem to timeout... |
Yes, they do. Thanks. :-) I am currently investigating |
Codecov Report
@@ Coverage Diff @@
## master #6244 +/- ##
===========================================
- Coverage 89.66% 60.07% -29.60%
===========================================
Files 323 143 -180
Lines 41930 20322 -21608
===========================================
- Hits 37597 12208 -25389
- Misses 4333 8114 +3781
Continue to review full report at Codecov.
|
The shiniest green! |
FWIW -- for |
Thanks, fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monster PR ... in every sense! ;-)
This was my first complete pass over the code. I left a bunch of notes. If I got things right, we have:
- pretty complete test coverage of the runner updates
- generator-capability usage in AnnexRepo.call_annex_items_(), which implies usage in
export_archive_ora
cfg_noannex
call_annex_oneline()
(s|g)et_preferred_content()
(s|g)et_group_wanted()
copy_file
aggregate
unannex()
(pretty much unused)get_annexed_files()
(tests only)get_contentlocation()
(archive
special remote)
more-or-less comprehensive tests for a.BatchedAnnexCommand
-style usage, without actually hooking it into the mainline code- via
BatchedAnnex
:find()
add_url_to_file()
drop_key()
whereis()
info()
is_available()
get_metadata()
Worth keeping an eye on the benchmarks. The current sample has a bunch of near-threshold items (in both directions). None of which seem closely related to the above functionality list:
2.26±0.04s 2.60±0.03s ~1.15 api.supers.time_ls_recursive_long_all
1.23±0.07s 1.35±0.05s 1.10 api.supers.time_status
369±40ms 333±30ms ~0.90 api.supers.time_uninstall
180±10ms 162±8ms ~0.90 core.startup.time_import
3.46±0.3ms 4.15±0.3ms ~1.20 core.witlessrunner.time_echo_gitrunner_fullcapture
606±40ms 549±30ms ~0.91 plugins.addurls.addurls1.time_addurls('*')
17.6±2ms 16.0±0.8ms ~0.91 repo.gitrepo.time_get_content_info
7.67±0.5ms 6.35±0.3ms ~0.83 support.path.get_parent_paths.time_allsubmods_toplevel_only
I will now proceed with more hands-on testing. Please let me know if my picture from above is missing key points or got them wrong. Thx!
|
||
remaining_content = line_splitter.finish_processing() | ||
if remaining_content is not None: | ||
yield remaining_content + os.linesep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: This entire diff block feels like a pattern that we are likely to see repeated frequently. It probably makes sense to pull it out of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christian-monch call_annex_items_()
and call_git_items_()
were pretty much identical methods before this change. It would make sense to approach that one in this PR too. I think it would give a better perspective on the comment I gave above re the placement of this pattern.
Moreover, hooking call_git_items_()
to the generator runner would expose it to (in addition):
drop
file_has_content()
is_under_annex()
remove()
get_branch_commits_()
get_special_remotes()
diffstatus()
status
diff
get_staged_paths()
get_branch_commits()
get_gitattributes()
for_each_ref()
clone
get
is_with_annex()
get_(remote_)branches()
get_tags()
call_git_oneline()
update
Taken together, this is a vast chunk of datalad functionality, with -- from my POV -- rather limited additional effort. Via status
and diff
a good number of other high-level commands are coming in too. It would give us a very concrete idea on how this performs in practice across a wide range of scenarios. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- more-or-less comprehensive tests for a BatchedAnnexCommand-style usage, without actually hooking it into the mainline code.
Maybe I misunderstood your comment, but BatchedAnnex
is based on BatchedCommand
, which in turn uses the new runner. So all BatchedAnnex
-operations that are executed are eventually processed by the new runner.
Worth keeping an eye on the benchmarks. The current sample has a bunch of near-threshold items (in both directions). None of which seem closely related to the above functionality list
Indeed. Looking at the benchmarks was actually one reason to rework the generator-runner approach and remote the "blocking OS-threads". I am surprised about the 1.10 to 1.20 times and have not seen that large regressions in the previous runs. Usually, times were between 0.98 and 1.02. There might be two reasons for the fluctuations in time:
- the benchmark is always run against the latest master and this branch is not regularly rebased, so non-PR related code also contributes to the result.
- different loads on the test systems.
@christian-monch call_annex_items_() and call_git_items_() were pretty much identical methods before this change. It would make sense to approach that one in this PR too. I think it would give a better perspective on the comment I gave above re the placement of this pattern.
I can certainly do it in this PR. I just want to avoid the situation where we are again piling up so many commits that the PR becomes too monstrous. ;-) Also, I am not a datalad power user, therefore I might not detect problems that are not caught by the tests. It might be useful to have more people working with this PR before increasing its size too much. Having said that, I agree that it makes a lot of sense the convert call_git_items_()
, and I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood your comment, but BatchedAnnex is based on BatchedCommand, which in turn uses the new runner. So all BatchedAnnex-operations that are executed are eventually processed by the new runner.
No, you did not misunderstand. I misread! Thx for pointing this out. I will update the usage overview...
Update: I corrected the list above! It doesn't make a huge difference, but my initial utilization estimation was certainly incorrect. Thx for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re benchmarks: I see ~20% difference report flukes quite frequently. I just wanted to leave a note that the last report at that time had these stats. Nothing to worry about for now, I think.
Co-authored-by: Michael Hanke <michael.hanke@gmail.com>
This PR provides a new threaded runner implementation that supports timeouts and generator-based subprocess-communication, Which means stdout and stderr data are available from a generator. Our standard protocol-approach is also supported with generators, i.e. every data packet from stdout and stderr is passed through an instance of
WitlessProtocol
(or a subclass), by callingWitlessProtocol.pipe_data_received()
.This PR:
GeneratorMixIn
-class.BatchedCommand
The PR uses and demonstrates the generator-capabilities in
AnnexRepo
and inBatchedCommand
, it:AnnexRepo._call_annex_items_()
. To support the new_call_annex_items_()
the methodAnnexRepo._call_annex()
has been extended to support generator-style runner, i.e. it uses a generator-styleGitRunner.run_on_filelist_chunks_items_()
.BatchedCommand
based on a generator-style runner, which consolidates our codebase and demonstrates the application of the generator-style runner interactionBatchedCommand
Although there are quite a number of changes in this PR, the modifications are mainly encapsulated in the runner code and in
BatchedCommand
. Both have a slightly extended, backward compatible API. That means the new implementations are drop-in replacements.The "application"-code that changed is mostly in four dozen lines in
datalad/support/annexrepo.py
. Due to the structure of the existing code, large parts ofAnnexRepo
are not generator-based, i.e. they expect and return lists or tuples. This required to "unwind" the generator-style results in the functions that return to legacy code. I think this PR lays the groundwork for us to convert our code iteratively to "all the way" generator-based code (where it is appropriate).