Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: Use WitlessRunner for config manager #4699

Merged
merged 6 commits into from Sep 1, 2020
Merged

Conversation

mih
Copy link
Member

@mih mih commented Jul 13, 2020

Includes 109ac04 from #4703 that exposed a critical flaw in GitWitlessRunner and a limitation in WitlessRunner

@yarikoptic
Copy link
Member

Do you expect any positive or negative effect on run time? config reading is in the core of all operations, that is why I wonder. According to benchmarks run on github actions -- no effect either way pretty much.

@mih
Copy link
Member Author

mih commented Jul 14, 2020

I have no strong expectations either way. I have two motivations: 1. It would be nice to have all code use a single runner. 2. I want to find the suspected bug reflected in #4698 by exercising this runner extremely often, on something very simple.

@mih
Copy link
Member Author

mih commented Jul 14, 2020

I have long suspected it, but never saw it that clearly. Here is running datalad.tests.test_config.test_bare as of 7ab8955, once individually, and once as part of the entire test module. The test is altered to print some key info:

diff --git a/datalad/tests/test_config.py b/datalad/tests/test_config.py
index 7a078cef0..1d9fefe1d 100644
--- a/datalad/tests/test_config.py
+++ b/datalad/tests/test_config.py
@@ -520,6 +520,10 @@ def test_bare(path):
     # can we handle a bare repo?
     gr = GitRepo(path, create=True, bare=True)
     # any sensible (and also our CI) test environment(s) should have this
+    print('CFG', gr.config._store)
+    print('RUNNERCWD', gr.config._runner.cwd)
+    print('REPOPATH', gr.path)
+    print('CFGFILES', gr.config._cfgfiles)
     assert_in('user.name', gr.config)
     # not set something that wasn't there
     obscure_key = 'sec.reallyobscurename!@@.key'

Standalone output (test passes):

CFG {'user.name': 'DataLad Tester', 'user.email': 'test@example.com', 'core.repositoryformatversion': '0', 'core.filemode': 'true', 'core.bare': 'true', 'datalad.datasets.topurl': 'http://datasets-tests.datalad.org/', 'datalad.log.level': '100'}
RUNNERCWD /tmp/datalad_temp_test_bare61yq6hus
REPOPATH /tmp/datalad_temp_test_bare61yq6hus
CFGFILES {'/tmp/datalad_temp_test_bare61yq6hus/.git/config', '/tmp/datalad_temp_rp6ifvn3/.gitconfig', '/tmp/datalad_temp_test_bare61yq6hus/.datalad/config', '/tmp/datalad_temp_test_bare61yq6hus/config'}

Moduletest output (test fails):

CFG {'datalad.datasets.topurl': 'http://datasets-tests.datalad.org/', 'datalad.log.level': '100', 'datalad.crazy.cfg': 'impossibletoguess', 'datalad.crazy.override': 'fromenv'}
RUNNERCWD /tmp/datalad_temp_test_bareddv6au6o
REPOPATH /tmp/datalad_temp_test_bareddv6au6o
CFGFILES {'/tmp/datalad_temp_test_bareddv6au6o/.git/config', '/tmp/datalad_temp_test_bareddv6au6o/.datalad/config'}

what we are seeing here is the state of a ConfigManager instance that doesn't even exist in the context of that specific test. It has residuals from various other tests (e.g. test_from_env, test_something).

I believe that this has at least partly to do with the runner (WitlessRunner in this case) keeping a copy of the environment that was present when it was initialized. It is unclear to me how exactly configuration can leak into other config manager instances, and whether this is due to some change in this PR. However, I have seen this kind of magic change repeatedly in other contexts, and I am afraid it was already present in some form, somewhere, before this PR.

Update: It reads different config files, updated the output above

@bpoldrack
Copy link
Member

I suspect it as nothing to do with the runner itself, but with our flyweights and with_tempfile. *Repo and ConfigManager should be new instances on a new path and we assume this throughout the test battery. However, with_tempfile does clean up after a test, which means, that the very same path is available for another tempfile creation later on. If those decorators therefore manage to create the same temporary repository location in several tests, the respective *Repo/ConfigManager instance might still exist and when it is accessed, re-evaluation would figure: Thing is still valid, since there's "still" a repo at that path.

Just a theory by now, but it seems more likely to me ATM.

@mih
Copy link
Member Author

mih commented Jul 14, 2020

Push the fix from #4703 -- works locally.

@bpoldrack
Copy link
Member

Push the fix from #4703 -- works locally.

I'm not convinced. It's plausible that this fix mitigates the issue, of course (and that fix might be desirable either way). But why does it have an effect in case of - say datalad.tests.test_config.test_bare? It should be a brandnew instance in that test. The fact that it isn't is the underlying issue from my POV.

Do you want to try, whether I'm right or should I dig into it independently?

@mih
Copy link
Member Author

mih commented Jul 14, 2020

@bpoldrack Independent is better. I do not debate that there could be more issues (in all likelyhood there are), but #4703 definitely fixes one of them.

@bpoldrack
Copy link
Member

Ok. Tested and it looks like I'm wrong. Ran the entire test battery w/ a patch to record the paths with_tempfile is delivering. Slightly more than 100k different ones, not a single double usage of a such a path. Since it should be rather random this isn't exactly proving, but it seems unlikely that I was on the right track.

However, still not clear why it's the same instance, so need to keep digging.

@bpoldrack
Copy link
Member

I'm trying to reproduce the issues you pointed to in previous post about how datalad.tests.test_config.test_bare w/o the env fix for WitlessRunner in order to find the underlying problems I suspect to be there.
However, looking into it this:

what we are seeing here is the state of a ConfigManager instance that doesn't even exist in the context of that specific test. It has residuals from various other tests (e.g. test_from_env, test_something).

appears to be an incorrect assessment, I think. Picking up those configs from test_from_env, test_something actually is correct behavior of ConfigManager. It's those tests that do the wrong thing. They don't temporarily patch the environment for their own purpose but change the environment of the running process. They should actually use some with patch ... instead of assigning to os.environ.

@mih
Copy link
Member Author

mih commented Jul 14, 2020

appears to be an incorrect assessment, I think. Picking up those configs from test_from_env, test_something actually is correct behavior of ConfigManager. It's those tests that do the wrong thing. They don't temporarily patch the environment for their own purpose but change the environment of the running process. They should actually use some with patch ... instead of assigning to os.environ.

Sounds reasonable ... and orthogonal to this PR. Do you want to open a separate one?

@bpoldrack
Copy link
Member

Sounds reasonable ... and orthogonal to this PR. Do you want to open a separate one?

#4707

kyleam added a commit to kyleam/datalad that referenced this pull request Aug 26, 2020
run_async_cmd() waits for a Future that is marked as done in
WitlessProtocol.process_exited() and then collects the output---which
was received via WitlessProtocol.pipe_data_received()---and returns
it.  However, if the command exits quickly, process_exited() may be
called before pipe_data_received(), in which case run_async_cmd()
returns empty output.  Some of our calls to git seem fast enough to
trigger this issue (dataladgh-4773).

This appears to be a bug in asyncio [1] and can be triggered using a
Protocol class that is identical to one in the Python documentation
[2] aside from using a faster command.  Even if this is resolved
upstream, though, we need a workaround for the Python versions that we
support.

Attempt to reliably capture the output by waiting for
`transport._wait()`.  While there may be more proper ways to solve
this, it's the only workaround I've been able to come up.  The
following suggests that it might do the trick:

  * create_subprocess_{exec,shell} don't seem to suffer from this
    problem.  The Process object they return has a .wait method that
    waits for `transport._wait`.  However, those functions use a very
    different protocol than WitlessProtocol, so it's possible that
    that wait isn't the key difference.  Still, it suggests it's a
    safe thing to do.

  * An open PR, dataladgh-4699, switches the ConfigManager over to using the
    WitlessRunner and is triggering a spread of failures that is
    plausibly due to losing output of the quick 'git config' calls.
    Merging the change here into that PR resulted in a passing Travis
    build [3].

  * With the script posted at the Python bug report, I've been unable
    to trigger any dropped output on two runs with 10000 iterations if
    I add `await asyncio.ensure_future(transport._wait())`.  Without
    that line, I've consistently been able to trigger it with many
    fewer iterations.

  * In a reproducer script (included in this PR) that involves Datalad
    and `git commit -- non-existing-file` (inspired by dataladgh-4773), I've
    been unable to trigger the error in hundreds of iterations with
    the changes in this series.  Before these changes, I've been able
    to consistently trigger it using fewer than 100 iterations.

This reverts 2955ba1 (TST: Skip known failure in
test_AnnexRepo_commit, 2020-08-04), as it should resolve the source of
that flaky test failure.

[1] Filed at https://bugs.python.org/issue41594
[2] https://docs.python.org/3/library/asyncio-protocol.html#loop-subprocess-exec-and-subprocessprotocol
[3] https://travis-ci.org/github/datalad/datalad/builds/721449200

Closes datalad#4773
@mih
Copy link
Member Author

mih commented Aug 27, 2020

Rebased after merge of #4835

@mih
Copy link
Member Author

mih commented Aug 27, 2020

All good on linux now! Wonderful!

Windows test failure point to the fact that WitlessRunner operates with system encoding by default, whereas the old Runner forces utf8. I will prepare a change that makes this flexible in WitlessRunner and configure it to be always utf8 when interacting with git-config

… default

This makes it possible to force UTF-8 encoding for a WitlessProtocol,
like the old Runner is doing it for its processing.
…bscure

On Windows we would have some codepage being default, but the content of
a git-config setting can easily use UTF-8 encoding (and it is somewhat
safe to assume that UTF-8 is the correct encoding for git-config
output).
@yarikoptic
Copy link
Member

Note: benchmarks seems to suggest up to 20% penalty

@mih
Copy link
Member Author

mih commented Aug 27, 2020

Note: benchmarks seems to suggest up to 20% penalty

I have learned to largely ignore the benchmarks. However, I also see a 5% slowdown (stddev=3%) in the the config manager test runtime.

Please keep in mind that the focus of this PR is WitlessRunner not ConfigManager. Exposing such differences is the aim.

@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #4699 into master will decrease coverage by 0.01%.
The diff coverage is 96.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4699      +/-   ##
==========================================
- Coverage   89.69%   89.68%   -0.02%     
==========================================
  Files         289      289              
  Lines       40480    40481       +1     
==========================================
- Hits        36310    36305       -5     
- Misses       4170     4176       +6     
Impacted Files Coverage Δ
datalad/support/external_versions.py 96.12% <93.75%> (ø)
datalad/cmd.py 91.90% <100.00%> (-0.38%) ⬇️
datalad/config.py 96.60% <100.00%> (+0.01%) ⬆️
datalad/support/tests/test_external_versions.py 91.41% <100.00%> (ø)
datalad/downloaders/tests/test_http.py 89.92% <0.00%> (-1.23%) ⬇️
datalad/downloaders/http.py 84.55% <0.00%> (-0.39%) ⬇️
datalad/downloaders/base.py 78.57% <0.00%> (-0.36%) ⬇️
datalad/support/gitrepo.py 89.30% <0.00%> (+0.07%) ⬆️
datalad/interface/tests/test_unlock.py 97.93% <0.00%> (+2.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b416b4...f380d55. Read the comment docs.

@mih
Copy link
Member Author

mih commented Sep 1, 2020

Merged present master to incorporate recent ConfigManager optimizations.

@mih
Copy link
Member Author

mih commented Sep 1, 2020

As they will likely be of interest, here are the benchmarks:

2020-09-01T09:49:09.7917807Z All benchmarks:
2020-09-01T09:49:09.7918442Z 
2020-09-01T09:49:09.7918827Z        before           after         ratio
2020-09-01T09:49:09.7919523Z      [7b416b40]       [72bf70ca]
2020-09-01T09:49:09.7920374Z      <bm/merge-target>       <bm/pr>   
2020-09-01T09:49:09.7921006Z        1.30±0.01s       1.32±0.02s     1.01  api.supers.time_createadd
2020-09-01T09:49:09.7921765Z        1.27±0.01s       1.28±0.01s     1.01  api.supers.time_createadd_to_dataset
2020-09-01T09:49:09.7922146Z            failed           failed      n/a  api.supers.time_diff
2020-09-01T09:49:09.7922604Z            failed           failed      n/a  api.supers.time_diff_recursive
2020-09-01T09:49:09.7923147Z        6.40±0.01s       6.43±0.02s     1.00  api.supers.time_installr
2020-09-01T09:49:09.7923701Z           223±3ms          226±3ms     1.01  api.supers.time_ls
2020-09-01T09:49:09.7924257Z        2.14±0.02s       2.17±0.01s     1.01  api.supers.time_ls_recursive
2020-09-01T09:49:09.7924831Z        2.34±0.02s       2.35±0.03s     1.00  api.supers.time_ls_recursive_long_all
2020-09-01T09:49:09.7925223Z            failed           failed      n/a  api.supers.time_remove
2020-09-01T09:49:09.7925736Z        1.05±0.02s       1.07±0.02s     1.02  api.supers.time_status
2020-09-01T09:49:09.7926305Z        1.80±0.01s       1.82±0.03s     1.01  api.supers.time_status_recursive
2020-09-01T09:49:09.7927032Z          97.2±2ms         98.7±3ms     1.02  api.supers.time_subdatasets
2020-09-01T09:49:09.7927941Z           520±3ms         537±20ms     1.03  api.supers.time_subdatasets_recursive
2020-09-01T09:49:09.7928537Z           100±2ms          100±5ms     1.00  api.supers.time_subdatasets_recursive_first
2020-09-01T09:49:09.7929126Z        4.01±0.03s       4.09±0.05s     1.02  api.supers.time_uninstall
2020-09-01T09:49:09.7929709Z           640±9ms          641±4ms     1.00  api.testds.time_create_test_dataset1
2020-09-01T09:49:09.7930447Z        3.92±0.02s       3.88±0.01s     0.99  api.testds.time_create_test_dataset2x2
2020-09-01T09:49:09.7931257Z       2.15±0.04ms       2.11±0.2ms     0.98  core.runner.time_echo
2020-09-01T09:49:09.7931913Z       2.31±0.08ms       2.39±0.2ms     1.04  core.runner.time_echo_gitrunner
2020-09-01T09:49:09.7932684Z          795±20ms          782±7ms     0.98  core.startup.time_help_np
2020-09-01T09:49:09.7933263Z           141±3ms          138±4ms     0.98  core.startup.time_import
2020-09-01T09:49:09.7933844Z           525±8ms          527±5ms     1.00  core.startup.time_import_api
2020-09-01T09:49:09.7934590Z          32.8±1ms       34.3±0.9ms     1.05  repo.gitrepo.time_get_content_info
2020-09-01T09:49:09.7935171Z        7.34±0.7ms       7.05±0.8ms     0.96  support.path.get_parent_paths.time_allsubmods_toplevel
2020-09-01T09:49:09.7935795Z        6.25±0.3ms       6.11±0.3ms     0.98  support.path.get_parent_paths.time_allsubmods_toplevel_only
2020-09-01T09:49:09.7936399Z           295±8ns         333±40ns    ~1.13  support.path.get_parent_paths.time_no_submods
2020-09-01T09:49:09.7937084Z        5.48±0.2ms       5.31±0.3ms     0.97  support.path.get_parent_paths.time_one_submod_subdir
2020-09-01T09:49:09.7937688Z        5.45±0.3ms       5.43±0.2ms     1.00  support.path.get_parent_paths.time_one_submod_toplevel
2020-09-01T09:49:09.7938281Z        31.8±0.09s        32.4±0.2s     1.02  usecases.study_forrest.time_make_studyforrest_mockup

Looks "same enough" to me.

@mih mih added the merge-if-ok OP considers this work done, and requests review/merge label Sep 1, 2020
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't see anything wrong with this one.

Copy link
Collaborator

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mih:

I have two motivations: 1. It would be nice to have all code use a single runner. 2. I want to find the suspected bug reflected in #4698 by exercising this runner extremely often, on something very simple.

I very much agree with one, and two is no longer relevant, so I'm in favor of merging this.

@mih
Copy link
Member Author

mih commented Sep 1, 2020

Wonderful, thx for the feedback!

@mih mih merged commit a5ed5f5 into datalad:master Sep 1, 2020
2 checks passed
@mih mih deleted the rf-witlessconfig branch September 1, 2020 14:48
@kyleam kyleam mentioned this pull request Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants