Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator-style runner with application in BatchedCommand and AnnexRepo #6244

Merged

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Nov 24, 2021

This PR provides a new threaded runner implementation that supports timeouts and generator-based subprocess-communication, Which means stdout and stderr data are available from a generator. Our standard protocol-approach is also supported with generators, i.e. every data packet from stdout and stderr is passed through an instance of WitlessProtocol (or a subclass), by calling WitlessProtocol.pipe_data_received().

This PR:

  • Allows using a queue.Queue() as stdin (in addition to previously supported types), for convenient stdin writing.
  • Implements generator-style run() semantics if the provided protocol is derived from GeneratorMixIn-class.
  • Adds timeout capabilities for output file traffic and for process exit
  • Adds basic protocol tests
  • Adds tests for runner exceptions
  • Adds documentation for the threaded runners
  • Adds tests for BatchedCommand

The PR uses and demonstrates the generator-capabilities in AnnexRepo and in BatchedCommand, it:

  • Adds a GeneratorAnnexJsonProtocol to demonstrate how to send data to the generator.
  • Supports a generator-style runner in AnnexRepo._call_annex_items_(). To support the new _call_annex_items_() the method AnnexRepo._call_annex() has been extended to support generator-style runner, i.e. it uses a generator-style GitRunner.run_on_filelist_chunks_items_().
  • Implements BatchedCommand based on a generator-style runner, which consolidates our codebase and demonstrates the application of the generator-style runner interaction
  • Adds documentation for BatchedCommand

Although there are quite a number of changes in this PR, the modifications are mainly encapsulated in the runner code and in BatchedCommand. Both have a slightly extended, backward compatible API. That means the new implementations are drop-in replacements.

The "application"-code that changed is mostly in four dozen lines in datalad/support/annexrepo.py. Due to the structure of the existing code, large parts of AnnexRepo are not generator-based, i.e. they expect and return lists or tuples. This required to "unwind" the generator-style results in the functions that return to legacy code. I think this PR lays the groundwork for us to convert our code iteratively to "all the way" generator-based code (where it is appropriate).

This PR is the cleaned-up, improved, and optimized version of PR #6125 which became convoluted. All reviewer comments on PR #6125 have been considered and their resolution is part of this PR

@christian-monch christian-monch added semver-documentation Changes only affect the documentation, no impact on version semver-internal Changes only affect the internal API semver-minor Increment the minor version when merged labels Nov 24, 2021
@christian-monch christian-monch requested review from mih, yarikoptic and bpoldrack and removed request for yarikoptic November 24, 2021 19:29
@christian-monch christian-monch changed the title enhance the runner with timeouts and generators Generator-style runner with application in BatchedCommand and AnnexRepo Nov 24, 2021
Add a new runner that supports timeouts and generator based
subprocess stdout and stderr data reporting.
@adswa
Copy link
Member

adswa commented Nov 25, 2021

mhhh the windows tests seem to timeout...

@christian-monch
Copy link
Contributor Author

mhhh the windows tests seem to timeout...

Yes, they do. Thanks. :-) I am currently investigating

@codecov
Copy link

codecov bot commented Nov 25, 2021

Codecov Report

Merging #6244 (fb8a4c4) into master (6b4d556) will decrease coverage by 29.59%.
The diff coverage is 86.15%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #6244       +/-   ##
===========================================
- Coverage   89.66%   60.07%   -29.60%     
===========================================
  Files         323      143      -180     
  Lines       41930    20322    -21608     
===========================================
- Hits        37597    12208    -25389     
- Misses       4333     8114     +3781     
Impacted Files Coverage Δ
datalad/runner/utils.py 80.00% <77.77%> (-20.00%) ⬇️
datalad/cmd.py 73.51% <79.51%> (-18.25%) ⬇️
datalad/runner/gitrunner.py 82.89% <80.00%> (-7.39%) ⬇️
datalad/runner/runnerthreads.py 86.44% <86.44%> (ø)
datalad/runner/nonasyncrunner.py 90.55% <90.08%> (-9.45%) ⬇️
datalad/runner/protocol.py 91.17% <90.90%> (-8.83%) ⬇️
datalad/support/annexrepo.py 85.68% <94.11%> (-5.44%) ⬇️
datalad/runner/exception.py 92.30% <100.00%> (-3.78%) ⬇️
datalad/runner/runner.py 100.00% <100.00%> (ø)
datalad/local/rerun.py 13.39% <0.00%> (-83.01%) ⬇️
... and 291 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b4d556...fb8a4c4. Read the comment docs.

@mih
Copy link
Member

mih commented Nov 26, 2021

The shiniest green!

@yarikoptic
Copy link
Member

FWIW -- for semver- label better to choose just one here (semver-minor) since they are the ones deciding in which section of the changelog this PR should be described. Not sure what should happen whenever there are multiple labels (should it get duplicate in 3 sections?)

@christian-monch christian-monch removed semver-internal Changes only affect the internal API semver-documentation Changes only affect the documentation, no impact on version labels Nov 30, 2021
@christian-monch
Copy link
Contributor Author

FWIW -- for semver- label better to choose just one here (semver-minor) since they are the ones deciding in which section of the changelog this PR should be described. Not sure what should happen whenever there are multiple labels (should it get duplicate in 3 sections?)

Thanks, fixed

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monster PR ... in every sense! ;-)

This was my first complete pass over the code. I left a bunch of notes. If I got things right, we have:

  • pretty complete test coverage of the runner updates
  • generator-capability usage in AnnexRepo.call_annex_items_(), which implies usage in
    • export_archive_ora
    • cfg_noannex
    • call_annex_oneline()
      • (s|g)et_preferred_content()
      • (s|g)et_group_wanted()
      • copy_file
      • aggregate
    • unannex() (pretty much unused)
    • get_annexed_files() (tests only)
    • get_contentlocation() (archive special remote)
  • more-or-less comprehensive tests for a BatchedAnnexCommand-style usage, without actually hooking it into the mainline code.
  • via BatchedAnnex:
    • find()
    • add_url_to_file()
    • drop_key()
    • whereis()
    • info()
    • is_available()
    • get_metadata()

Worth keeping an eye on the benchmarks. The current sample has a bunch of near-threshold items (in both directions). None of which seem closely related to the above functionality list:

2.26±0.04s   2.60±0.03s   ~1.15 api.supers.time_ls_recursive_long_all
1.23±0.07s   1.35±0.05s    1.10 api.supers.time_status
369±40ms     333±30ms     ~0.90 api.supers.time_uninstall
180±10ms     162±8ms      ~0.90 core.startup.time_import
3.46±0.3ms   4.15±0.3ms   ~1.20 core.witlessrunner.time_echo_gitrunner_fullcapture
606±40ms     549±30ms     ~0.91 plugins.addurls.addurls1.time_addurls('*')
17.6±2ms     16.0±0.8ms   ~0.91 repo.gitrepo.time_get_content_info
7.67±0.5ms   6.35±0.3ms   ~0.83 support.path.get_parent_paths.time_allsubmods_toplevel_only

I will now proceed with more hands-on testing. Please let me know if my picture from above is missing key points or got them wrong. Thx!


remaining_content = line_splitter.finish_processing()
if remaining_content is not None:
yield remaining_content + os.linesep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This entire diff block feels like a pattern that we are likely to see repeated frequently. It probably makes sense to pull it out of here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christian-monch call_annex_items_() and call_git_items_() were pretty much identical methods before this change. It would make sense to approach that one in this PR too. I think it would give a better perspective on the comment I gave above re the placement of this pattern.

Moreover, hooking call_git_items_() to the generator runner would expose it to (in addition):

  • drop
  • file_has_content()
  • is_under_annex()
  • remove()
  • get_branch_commits_()
  • get_special_remotes()
  • diffstatus()
    • status
    • diff
  • get_staged_paths()
  • get_branch_commits()
  • get_gitattributes()
  • for_each_ref()
    • clone
    • get
    • is_with_annex()
    • get_(remote_)branches()
    • get_tags()
  • call_git_oneline()
    • update

Taken together, this is a vast chunk of datalad functionality, with -- from my POV -- rather limited additional effort. Via status and diff a good number of other high-level commands are coming in too. It would give us a very concrete idea on how this performs in practice across a wide range of scenarios. WDYT?

Copy link
Contributor Author

@christian-monch christian-monch Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • more-or-less comprehensive tests for a BatchedAnnexCommand-style usage, without actually hooking it into the mainline code.

Maybe I misunderstood your comment, but BatchedAnnex is based on BatchedCommand, which in turn uses the new runner. So all BatchedAnnex-operations that are executed are eventually processed by the new runner.

Worth keeping an eye on the benchmarks. The current sample has a bunch of near-threshold items (in both directions). None of which seem closely related to the above functionality list

Indeed. Looking at the benchmarks was actually one reason to rework the generator-runner approach and remote the "blocking OS-threads". I am surprised about the 1.10 to 1.20 times and have not seen that large regressions in the previous runs. Usually, times were between 0.98 and 1.02. There might be two reasons for the fluctuations in time:

  1. the benchmark is always run against the latest master and this branch is not regularly rebased, so non-PR related code also contributes to the result.
  2. different loads on the test systems.

@christian-monch call_annex_items_() and call_git_items_() were pretty much identical methods before this change. It would make sense to approach that one in this PR too. I think it would give a better perspective on the comment I gave above re the placement of this pattern.

I can certainly do it in this PR. I just want to avoid the situation where we are again piling up so many commits that the PR becomes too monstrous. ;-) Also, I am not a datalad power user, therefore I might not detect problems that are not caught by the tests. It might be useful to have more people working with this PR before increasing its size too much. Having said that, I agree that it makes a lot of sense the convert call_git_items_(), and I will do that.

Copy link
Member

@mih mih Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood your comment, but BatchedAnnex is based on BatchedCommand, which in turn uses the new runner. So all BatchedAnnex-operations that are executed are eventually processed by the new runner.

No, you did not misunderstand. I misread! Thx for pointing this out. I will update the usage overview...

Update: I corrected the list above! It doesn't make a huge difference, but my initial utilization estimation was certainly incorrect. Thx for pointing this out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re benchmarks: I see ~20% difference report flukes quite frequently. I just wanted to leave a note that the last report at that time had these stats. Nothing to worry about for now, I think.

docs/source/design/threaded_runner.rst Show resolved Hide resolved
docs/source/design/threaded_runner.rst Outdated Show resolved Hide resolved
Co-authored-by: Michael Hanke <michael.hanke@gmail.com>
@christian-monch christian-monch merged commit 6fae464 into datalad:master Dec 2, 2021
@christian-monch christian-monch deleted the enh-generator-runner branch December 2, 2021 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants