Surface fetch/pull/push kill_after_timeout and reset default to None #1340

sroet · 2021-09-10T11:59:04Z

(hopefully) closes #1339
This PR does as of e058c4c :

renames timeout to kill_after_timeout and set default to None instead of 10 s
Surfaces kill_after_timeout to user facing functions (remote. pull/push/fetch)
refactor AutoInterrupt so we can ._terminate() the underlying process and grab status codes from terminated processes
Add a class variable AutoInterrupt._status_code_if_terminate; if non-zero it overrides any status code from the process in _terminate. This can be used to prevent race conditions in tests, as described in this comment

Some history of the commits (while trying a different strategy):

aafb300 changes the default timeout to 60 s
1d26515 allows push, pull and fetch keyword timeout that will be propagated down
febd4fe alters some tests to use the old 10 s timeout for fetch and pull
4113d01 also alters a test call to pull to add a timeout of 10s
c55a8e3 fixes the test suite
7df33f3 sets the default timeout to None

if the current default is too lose, users can now give their own timeout

sroet · 2021-09-10T12:07:43Z

alright, the test suite runs locally and gives the following coverage:

---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name                                Stmts   Miss  Cover
-------------------------------------------------------
git/__init__.py                        42      8    81%
git/cmd.py                            525     98    81%
git/compat.py                          46     15    67%
git/config.py                         439     48    89%
git/db.py                              27      1    96%
git/diff.py                           268     10    96%
git/exc.py                             63      3    95%
git/index/__init__.py                   2      0   100%
git/index/base.py                     498     32    94%
git/index/fun.py                      179      7    96%
git/index/typ.py                       68      3    96%
git/index/util.py                      50      5    90%
git/objects/__init__.py                13      0   100%
git/objects/base.py                    91     10    89%
git/objects/blob.py                    17      0   100%
git/objects/commit.py                 276     28    90%
git/objects/fun.py                    101      5    95%
git/objects/submodule/__init__.py       0      0   100%
git/objects/submodule/base.py         583     46    92%
git/objects/submodule/root.py         137     12    91%
git/objects/submodule/util.py          51      9    82%
git/objects/tag.py                     47     12    74%
git/objects/tree.py                   181     73    60%
git/objects/util.py                   232     32    86%
git/refs/__init__.py                    6      0   100%
git/refs/head.py                       98      6    94%
git/refs/log.py                       150     10    93%
git/refs/reference.py                  55      3    95%
git/refs/remote.py                     32      2    94%
git/refs/symbolic.py                  338     18    95%
git/refs/tag.py                        47      6    87%
git/remote.py                         427     55    87%
git/repo/__init__.py                    1      0   100%
git/repo/base.py                      552     36    93%
git/repo/fun.py                       201     20    90%
git/types.py                           44     10    77%
git/util.py                           539    121    78%
-------------------------------------------------------
TOTAL                                6426    744    88%


Results (64.31s):
     378 passed
      42 skipped

Byron · 2021-09-10T14:35:41Z

Thanks a lot for the initiative. With the offending release yanked I believe we win some time to wait for @Yobmod to chime in on this.

My preference, if I understand what's happening correctly, is to set the default timeout to None and not alter the previous behaviour, as the code as is can't tell if it's just a long-running process or a stalled one. Thus I would prefer offering this up to callers only.

Yobmod · 2021-09-10T14:53:52Z

Hi, I think setting the timeout default to None works better, rather than a larger number.
Thread.join(None) would then wait as long as needed (I think?), but if people want a timeout, it's still there to use.

Or it could be removed altogether.
It got added when the tests started hanging in a 3.10 beta. I couldn't find the problem, but it doesn't happen with release candidates any more.

dwhswenson · 2021-09-10T15:05:25Z

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion between cmd.execute_kwargs['kill_after_timeout'] and this timeout. As a user, without a careful read of the docs I would expect the behavior of remote.fetch(timeout=60) to be that of 'kill_after_timeout'.

sroet · 2021-09-10T15:18:15Z

@Yobmod, @dwhswenson ; I have reverted the default to None, please let me know what you think.

@Byron, I will not be able to push to this branch for the next ~60 hour, so feel free to push to it if you want this to be finished up before monday

Byron · 2021-09-11T06:03:57Z

@sroet Thanks for your help with this and the incredibly quick turnaround time, and for everyone to share their thoughts to make this better.

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion between cmd.execute_kwargs['kill_after_timeout'] and this timeout. As a user, without a careful read of the docs I would expect the behavior of remote.fetch(timeout=60) to be that of 'kill_after_timeout'.

To me it seems that kill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally. Since we have tests for this maybe you could test this hypothesis. Indeed it would be great to avoid two timeout parameters who have the same intention, but only one of them seems to be solving the issue at its root.

The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the timeout parameter around just for the case the git process isn't responding to signals.

… 'fetch'

sroet · 2021-09-13T16:26:40Z

This took a bit more work than expected, but should now work with the keyword kill_after_timeout for the user facing functions

The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the timeout parameter around just for the case the git process isn't responding to signals.

That did not really work out, as most user facing functions (fetch, pull, and push) set as_process=True so that code path is never hit, instead this if statement is hit which returns an AutoInterrupt object. Looking at that class it actually knows how to kill underlying processes (in the __del__ method) so I decided to factor that code out to a private function _terminate (and fixing wait if self.proc=None).

To me it seems that kill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally.

The reading threads had no way of handling the closing of the streams, so added a check to the exception.

It should now raises like this:

>>> a = git.Repo('.')
>>> a.remotes.origin.fetch(kill_after_timeout=0.01)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sander/github_files/GitPython/git/remote.py", line 868, in fetch
    kill_after_timeout=kill_after_timeout)
  File "/home/sander/github_files/GitPython/git/remote.py", line 732, in _get_fetch_info_from_stderr
    proc.wait(stderr=stderr_text)
  File "/home/sander/github_files/GitPython/git/cmd.py", line 490, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-15)
  cmdline: git fetch -v origin
  stderr: 'error: process killed because it timed out. kill_after_timeout=0.01 seconds'

When writing the tests, I saw that the error handling code of push does something different than for fetch and pull

fetch/pull always error out if there is any error lines in std_err(it does not care if there is output as that is piped to None)
push only errors out if there are error lines in std_err and no output to std_out

@Byron Is this mismatch intentional/wanted?

I will update the initial post to indicate the current status

Byron · 2021-09-13T23:37:26Z

Thanks a million for all the effort you are putting in!

@Byron Is this mismatch intentional/wanted?

It's based on the observation on how these processes usually work and communicate errors. However, if it feels out-of-place or wrong there is no reason not to improve it while you are here. It wouldn't have been the first time that there are long-standing issues in the code-base that somehow never surfaced. While I encourage you for these fixes I also warn that the test-suite might not catch all the subtle behaviour either, so it's probably better to leave as much unchanged as possible while fixing the issue at hand.

Note that CI is failing because of the linter and white-space related errors. It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

dwhswenson · 2021-09-14T02:01:20Z

It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

See https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/, in particular:

Once a collaborator with write access has reviewed the changes, they can approve the workflows to run for the current commit only. If the pull request is updated to a new commit, a new approval will be required.

IMO, that last is a bad choice by GitHub. I understand the desire to avoid abuse by crypto-miners, but there should be an option for the approval to be for the duration of the PR.

Byron · 2021-09-14T02:36:10Z

Thanks for the heads-up, I wasn't aware of this change.

[…] but there should be an option for the approval to be for the duration of the PR.

Couldn't agree more. Here it seems they are over-steering towards abuse-avoidance and trade convenience of using their platform. Instead, if legitimate PRs turn out to become abusive later, it's something I'd happily report - after it happened.
Maybe additional options will be implemented some day, I do have hope.

sroet · 2021-09-14T10:35:02Z

@bryon, sorry, I should have checked mypy and flake8 localy. they now return:

(gitpy) ***@***:~/github_files/GitPython$ flake8
0
(gitpy) ***@***:~/github_files/GitPython$ mypy -p git
Success: no issues found in 37 source files

sroet · 2021-09-14T12:03:27Z

Looking at the test failure for 9678226, it seems like the CI environment is too quick fetching after doing a pull to reliably triggering the case, only doing a pull should still hit most codepaths...

sroet · 2021-09-14T12:12:55Z

Hmm, it started failing locally with timeout=0 , so reset it to a small value. Hopefully this should be small enough to at least let CI fail out, otherwise I would like a reviewer to see if the test also fails on their local system...

sroet · 2021-09-14T12:20:23Z

A bit of info on why the error is not raised but still shows up in the log as:

WARNING  git.remote:remote.py:734 Error lines received while fetching: error: process killed because it timed out. kill_after_timeout=0 seconds

This is because there is a race condition: in the time it takes between the main thread triggering the timeout (which gives the log) and telling the process to terminate (which should trigger the error by setting the status to not 0), the process can complete correctly. In this case, I decided of letting the code recover instead of erroring out.

Byron · 2021-09-14T12:42:54Z

It's interesting as CI still fails even though the timeout is small. Can there be a 'too-small' due to same race elsewhere?

Maybe the CI is onto something here and indicates some fixable issue, and for now I'd be inclined to believe that.

sroet · 2021-09-14T13:24:49Z

@Byron the issue lies in the fact that the CI python thread is too slow compared to the git process, I can force CI behavior locally by altering the lines:

    for t in threads:                                                           
        t.join(timeout=kill_after_timeout)                                      
        if t.is_alive():                                                        
            if isinstance(process, Git.AutoInterrupt):                                                                          
                process._terminate()

to

    for t in threads:                                                           
        t.join(timeout=kill_after_timeout)                                      
        if t.is_alive():                                                        
            if isinstance(process, Git.AutoInterrupt):                          
                #TODO debug code                                                
                import time                                                     
                time.sleep(1)                                                   
                process._terminate()

What happens is that inside the AutoInterrupt._terminate() function the following lines exist:

            # did the process finish already so we have a return code ?         
            try:                                                                
                if proc.poll() is not None:                                     
                    self.status = proc.poll()                                   
                    return None

which allows the process status to be 0 even if the timeout is triggered initially.

and even later it still does:

            # try to kill it                                                    
            try:                                                                
                proc.terminate()                                                
                self.status = proc.wait()    # ensure process goes away

Which also allows the process to terminate gracefully if the kill signal is not fast enough.

Now, I can force set self.status to -15 inside the _terminate function which make sure we always trigger the error whenever _terminate() has been called, even if the process completed between python triggering a timeout and getting around to actually kill the process.

@Byron is that (not allowing the process to recover) the preferred solution here?

[EDIT]: addition; The GitCommandError is only raised (inside wait()) if AutoInterrupt.status != 0

Byron · 2021-09-14T23:47:03Z

@Byron is that (not allowing the process to recover) the preferred solution here?

A robust implementation is preferred any time, and it seems like a system that can recover to states that are changing is the better option.

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

…let tests use it

sroet · 2021-09-15T09:58:14Z

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

Added a class variable to AutoInterrupt that sets the statuscode inside _terminate if it non-zero, and updated the tests to use it. It makes the test pass locally, with the sleep(1). So I am pretty confident it catches the race condition in tests and still allows for the robust implementation, please have another look

Byron

Fantastic, it looks like this PR is ready to be merged. I'd love to have another reviewer though, maybe @Yobmod could take a final look?

I'd be happy to try a new release when merged, too.

per1234 · 2021-09-20T11:07:49Z

Maybe additional options will be implemented some day, I do have hope.

@Byron they already have been:
https://github.blog/changelog/2021-07-01-github-actions-new-settings-for-maintainers/
Switching to the "Require approval for first-time contributors who are new to GitHub" setting has almost completely eliminated the need to approve workflows in the repositories I maintain.

Byron · 2021-09-20T12:31:21Z

Thank you sooo much! That drove me crazy 🤦‍♂️

sroet · 2021-09-20T12:40:14Z

Thank you sooo much! That drove me crazy

Can imagine, thanks for dealing with the CI on this one!

sroet added 4 commits September 10, 2021 13:36

change default fetch timeout to 60 s

aafb300

allow for timeout propagation

1d26515

add test timeout with the old 10 s timeout

febd4fe

also test a call to 'push' with 10s timeout

4113d01

sroet changed the title ~~Update fetch timeout~~ [WIP] Update fetch timeout Sep 10, 2021

propagate kwargs in do_test_fetch

c55a8e3

sroet changed the title ~~[WIP] Update fetch timeout~~ Update fetch timeout Sep 10, 2021

sroet changed the title ~~Update fetch timeout~~ Surface fetch/pull/push timeout and set default to 60 s Sep 10, 2021

sroet mentioned this pull request Sep 10, 2021

Temporary pin for GitPython openpathsampling/openpathsampling#1069

Closed

dwhswenson mentioned this pull request Sep 10, 2021

[TODO] Unpin GitPython openpathsampling/openpathsampling#1070

Closed

Byron mentioned this pull request Sep 10, 2021

default thread timeout too short for CI #1339

Closed

Byron requested a review from Yobmod September 10, 2021 14:32

Byron added this to the v3.1.24 - Bugfixes milestone Sep 10, 2021

reset default timeout to None

7df33f3

sroet changed the title ~~Surface fetch/pull/push timeout and set default to 60 s~~ Surface fetch/pull/push timeout and reset default to None Sep 10, 2021

update docstring

b555764

sroet added 3 commits September 13, 2021 17:05

reuse kill_after_timeout kwarg

6b358f9

update tests and add a comment about different behaviour of 'push' vs…

3a81850

… 'fetch'

go for pytest.raises and test that the functions run

d41d537

sroet changed the title ~~Surface fetch/pull/push timeout and reset default to None~~ Surface fetch/pull/push kill_after_timeout and reset default to None Sep 13, 2021

sroet added 2 commits September 14, 2021 12:27

make flake8 and mypy happy

5531323

fix typo's

da68bb1

sroet added 3 commits September 14, 2021 13:30

make test timeout stricter

9678226

fetch is also to quick on CI, only test pull

a9e43e5

two spaces before comments

5081881

set timeout to a non-zero value

083039a

Add a way to force status codes inside AutoInterrupt._terminate, and …

e058c4c

…let tests use it

Byron self-requested a review September 15, 2021 23:41

Byron approved these changes Sep 15, 2021

View reviewed changes

Byron mentioned this pull request Sep 18, 2021

gitpython 3.1.20 not on pypi.org #1345

Closed

Byron merged commit aa50766 into gitpython-developers:main Sep 18, 2021

sroet deleted the update_fetch_timeout branch September 20, 2021 12:31

Uh oh!

Surface fetch/pull/push kill_after_timeout and reset default to None #1340

Surface fetch/pull/push kill_after_timeout and reset default to None #1340

Uh oh!

Conversation

sroet commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sroet commented Sep 10, 2021

Uh oh!

Byron commented Sep 10, 2021

Uh oh!

Yobmod commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwhswenson commented Sep 10, 2021

Uh oh!

sroet commented Sep 10, 2021

Uh oh!

Byron commented Sep 11, 2021

Uh oh!

sroet commented Sep 13, 2021

Uh oh!

Byron commented Sep 13, 2021

Uh oh!

dwhswenson commented Sep 14, 2021

Uh oh!

Byron commented Sep 14, 2021

Uh oh!

sroet commented Sep 14, 2021

Uh oh!

sroet commented Sep 14, 2021

Uh oh!

sroet commented Sep 14, 2021

Uh oh!

sroet commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Sep 14, 2021

Uh oh!

sroet commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Sep 14, 2021

Uh oh!

sroet commented Sep 15, 2021

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

per1234 commented Sep 20, 2021

Uh oh!

Byron commented Sep 20, 2021

Uh oh!

sroet commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sroet commented Sep 10, 2021 •

edited

Loading

Yobmod commented Sep 10, 2021 •

edited

Loading

sroet commented Sep 14, 2021 •

edited

Loading

sroet commented Sep 14, 2021 •

edited

Loading

sroet commented Sep 20, 2021 •

edited

Loading