Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface fetch/pull/push kill_after_timeout and reset default to None #1340

Merged
merged 17 commits into from
Sep 18, 2021

Conversation

sroet
Copy link
Contributor

@sroet sroet commented Sep 10, 2021

(hopefully) closes #1339
This PR does as of e058c4c :

  • renames timeout to kill_after_timeout and set default to None instead of 10 s
  • Surfaces kill_after_timeout to user facing functions (remote. pull/push/fetch)
  • refactor AutoInterrupt so we can ._terminate() the underlying process and grab status codes from terminated processes
  • Add a class variable AutoInterrupt._status_code_if_terminate; if non-zero it overrides any status code from the process in _terminate. This can be used to prevent race conditions in tests, as described in this comment

Some history of the commits (while trying a different strategy):

aafb300 changes the default timeout to 60 s
1d26515 allows push, pull and fetch keyword timeout that will be propagated down
febd4fe alters some tests to use the old 10 s timeout for fetch and pull
4113d01 also alters a test call to pull to add a timeout of 10s
c55a8e3 fixes the test suite
7df33f3 sets the default timeout to None

if the current default is too lose, users can now give their own timeout

@sroet sroet changed the title Update fetch timeout [WIP] Update fetch timeout Sep 10, 2021
@sroet
Copy link
Contributor Author

sroet commented Sep 10, 2021

alright, the test suite runs locally and gives the following coverage:

---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name                                Stmts   Miss  Cover
-------------------------------------------------------
git/__init__.py                        42      8    81%
git/cmd.py                            525     98    81%
git/compat.py                          46     15    67%
git/config.py                         439     48    89%
git/db.py                              27      1    96%
git/diff.py                           268     10    96%
git/exc.py                             63      3    95%
git/index/__init__.py                   2      0   100%
git/index/base.py                     498     32    94%
git/index/fun.py                      179      7    96%
git/index/typ.py                       68      3    96%
git/index/util.py                      50      5    90%
git/objects/__init__.py                13      0   100%
git/objects/base.py                    91     10    89%
git/objects/blob.py                    17      0   100%
git/objects/commit.py                 276     28    90%
git/objects/fun.py                    101      5    95%
git/objects/submodule/__init__.py       0      0   100%
git/objects/submodule/base.py         583     46    92%
git/objects/submodule/root.py         137     12    91%
git/objects/submodule/util.py          51      9    82%
git/objects/tag.py                     47     12    74%
git/objects/tree.py                   181     73    60%
git/objects/util.py                   232     32    86%
git/refs/__init__.py                    6      0   100%
git/refs/head.py                       98      6    94%
git/refs/log.py                       150     10    93%
git/refs/reference.py                  55      3    95%
git/refs/remote.py                     32      2    94%
git/refs/symbolic.py                  338     18    95%
git/refs/tag.py                        47      6    87%
git/remote.py                         427     55    87%
git/repo/__init__.py                    1      0   100%
git/repo/base.py                      552     36    93%
git/repo/fun.py                       201     20    90%
git/types.py                           44     10    77%
git/util.py                           539    121    78%
-------------------------------------------------------
TOTAL                                6426    744    88%


Results (64.31s):
     378 passed
      42 skipped

@sroet sroet changed the title [WIP] Update fetch timeout Update fetch timeout Sep 10, 2021
@sroet sroet changed the title Update fetch timeout Surface fetch/pull/push timeout and set default to 60 s Sep 10, 2021
@Byron Byron requested a review from Yobmod September 10, 2021 14:32
@Byron Byron added this to the v3.1.24 - Bugfixes milestone Sep 10, 2021
@Byron
Copy link
Member

Byron commented Sep 10, 2021

Thanks a lot for the initiative. With the offending release yanked I believe we win some time to wait for @Yobmod to chime in on this.

My preference, if I understand what's happening correctly, is to set the default timeout to None and not alter the previous behaviour, as the code as is can't tell if it's just a long-running process or a stalled one. Thus I would prefer offering this up to callers only.

@Yobmod
Copy link
Contributor

Yobmod commented Sep 10, 2021

Hi, I think setting the timeout default to None works better, rather than a larger number.
Thread.join(None) would then wait as long as needed (I think?), but if people want a timeout, it's still there to use.

Or it could be removed altogether.
It got added when the tests started hanging in a 3.10 beta. I couldn't find the problem, but it doesn't happen with release candidates any more.

@dwhswenson
Copy link

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion between cmd.execute_kwargs['kill_after_timeout'] and this timeout. As a user, without a careful read of the docs I would expect the behavior of remote.fetch(timeout=60) to be that of 'kill_after_timeout'.

@sroet sroet changed the title Surface fetch/pull/push timeout and set default to 60 s Surface fetch/pull/push timeout and reset default to None Sep 10, 2021
@sroet
Copy link
Contributor Author

sroet commented Sep 10, 2021

@Yobmod, @dwhswenson ; I have reverted the default to None, please let me know what you think.

@Byron, I will not be able to push to this branch for the next ~60 hour, so feel free to push to it if you want this to be finished up before monday

@Byron
Copy link
Member

Byron commented Sep 11, 2021

@sroet Thanks for your help with this and the incredibly quick turnaround time, and for everyone to share their thoughts to make this better.

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion between cmd.execute_kwargs['kill_after_timeout'] and this timeout. As a user, without a careful read of the docs I would expect the behavior of remote.fetch(timeout=60) to be that of 'kill_after_timeout'.

To me it seems that kill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally. Since we have tests for this maybe you could test this hypothesis. Indeed it would be great to avoid two timeout parameters who have the same intention, but only one of them seems to be solving the issue at its root.

The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the timeout parameter around just for the case the git process isn't responding to signals.

@sroet
Copy link
Contributor Author

sroet commented Sep 13, 2021

This took a bit more work than expected, but should now work with the keyword kill_after_timeout for the user facing functions

The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the timeout parameter around just for the case the git process isn't responding to signals.

That did not really work out, as most user facing functions (fetch, pull, and push) set as_process=True so that code path is never hit, instead this if statement is hit which returns an AutoInterrupt object. Looking at that class it actually knows how to kill underlying processes (in the __del__ method) so I decided to factor that code out to a private function _terminate (and fixing wait if self.proc=None).

To me it seems that kill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally.

The reading threads had no way of handling the closing of the streams, so added a check to the exception.

It should now raises like this:

>>> a = git.Repo('.')
>>> a.remotes.origin.fetch(kill_after_timeout=0.01)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sander/github_files/GitPython/git/remote.py", line 868, in fetch
    kill_after_timeout=kill_after_timeout)
  File "/home/sander/github_files/GitPython/git/remote.py", line 732, in _get_fetch_info_from_stderr
    proc.wait(stderr=stderr_text)
  File "/home/sander/github_files/GitPython/git/cmd.py", line 490, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-15)
  cmdline: git fetch -v origin
  stderr: 'error: process killed because it timed out. kill_after_timeout=0.01 seconds'

When writing the tests, I saw that the error handling code of push does something different than for fetch and pull

  • fetch/pull always error out if there is any error lines in std_err(it does not care if there is output as that is piped to None)
  • push only errors out if there are error lines in std_err and no output to std_out

@Byron Is this mismatch intentional/wanted?

I will update the initial post to indicate the current status

@sroet sroet changed the title Surface fetch/pull/push timeout and reset default to None Surface fetch/pull/push kill_after_timeout and reset default to None Sep 13, 2021
@Byron
Copy link
Member

Byron commented Sep 13, 2021

Thanks a million for all the effort you are putting in!

@Byron Is this mismatch intentional/wanted?

It's based on the observation on how these processes usually work and communicate errors. However, if it feels out-of-place or wrong there is no reason not to improve it while you are here. It wouldn't have been the first time that there are long-standing issues in the code-base that somehow never surfaced. While I encourage you for these fixes I also warn that the test-suite might not catch all the subtle behaviour either, so it's probably better to leave as much unchanged as possible while fixing the issue at hand.

Note that CI is failing because of the linter and white-space related errors. It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

@dwhswenson
Copy link

It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

See https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/, in particular:

Once a collaborator with write access has reviewed the changes, they can approve the workflows to run for the current commit only. If the pull request is updated to a new commit, a new approval will be required.

IMO, that last is a bad choice by GitHub. I understand the desire to avoid abuse by crypto-miners, but there should be an option for the approval to be for the duration of the PR.

@Byron
Copy link
Member

Byron commented Sep 14, 2021

Thanks for the heads-up, I wasn't aware of this change.

[…] but there should be an option for the approval to be for the duration of the PR.

Couldn't agree more. Here it seems they are over-steering towards abuse-avoidance and trade convenience of using their platform. Instead, if legitimate PRs turn out to become abusive later, it's something I'd happily report - after it happened.
Maybe additional options will be implemented some day, I do have hope.

@sroet
Copy link
Contributor Author

sroet commented Sep 14, 2021

@bryon, sorry, I should have checked mypy and flake8 localy. they now return:

(gitpy) ***@***:~/github_files/GitPython$ flake8
0
(gitpy) ***@***:~/github_files/GitPython$ mypy -p git
Success: no issues found in 37 source files

@sroet
Copy link
Contributor Author

sroet commented Sep 14, 2021

Looking at the test failure for 9678226, it seems like the CI environment is too quick fetching after doing a pull to reliably triggering the case, only doing a pull should still hit most codepaths...

@sroet
Copy link
Contributor Author

sroet commented Sep 14, 2021

Hmm, it started failing locally with timeout=0 , so reset it to a small value. Hopefully this should be small enough to at least let CI fail out, otherwise I would like a reviewer to see if the test also fails on their local system...

@sroet
Copy link
Contributor Author

sroet commented Sep 14, 2021

A bit of info on why the error is not raised but still shows up in the log as:

WARNING  git.remote:remote.py:734 Error lines received while fetching: error: process killed because it timed out. kill_after_timeout=0 seconds

This is because there is a race condition: in the time it takes between the main thread triggering the timeout (which gives the log) and telling the process to terminate (which should trigger the error by setting the status to not 0), the process can complete correctly. In this case, I decided of letting the code recover instead of erroring out.

@Byron
Copy link
Member

Byron commented Sep 14, 2021

It's interesting as CI still fails even though the timeout is small. Can there be a 'too-small' due to same race elsewhere?

Maybe the CI is onto something here and indicates some fixable issue, and for now I'd be inclined to believe that.

@sroet
Copy link
Contributor Author

sroet commented Sep 14, 2021

@Byron the issue lies in the fact that the CI python thread is too slow compared to the git process, I can force CI behavior locally by altering the lines:

    for t in threads:                                                           
        t.join(timeout=kill_after_timeout)                                      
        if t.is_alive():                                                        
            if isinstance(process, Git.AutoInterrupt):                                                                          
                process._terminate()                                            

to

    for t in threads:                                                           
        t.join(timeout=kill_after_timeout)                                      
        if t.is_alive():                                                        
            if isinstance(process, Git.AutoInterrupt):                          
                #TODO debug code                                                
                import time                                                     
                time.sleep(1)                                                   
                process._terminate()                                            

What happens is that inside the AutoInterrupt._terminate() function the following lines exist:

            # did the process finish already so we have a return code ?         
            try:                                                                
                if proc.poll() is not None:                                     
                    self.status = proc.poll()                                   
                    return None   

which allows the process status to be 0 even if the timeout is triggered initially.

and even later it still does:

            # try to kill it                                                    
            try:                                                                
                proc.terminate()                                                
                self.status = proc.wait()    # ensure process goes away    

Which also allows the process to terminate gracefully if the kill signal is not fast enough.

Now, I can force set self.status to -15 inside the _terminate function which make sure we always trigger the error whenever _terminate() has been called, even if the process completed between python triggering a timeout and getting around to actually kill the process.

@Byron is that (not allowing the process to recover) the preferred solution here?

[EDIT]: addition; The GitCommandError is only raised (inside wait()) if AutoInterrupt.status != 0

@Byron
Copy link
Member

Byron commented Sep 14, 2021

@Byron is that (not allowing the process to recover) the preferred solution here?

A robust implementation is preferred any time, and it seems like a system that can recover to states that are changing is the better option.

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

@sroet
Copy link
Contributor Author

sroet commented Sep 15, 2021

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

Added a class variable to AutoInterrupt that sets the statuscode inside _terminate if it non-zero, and updated the tests to use it. It makes the test pass locally, with the sleep(1). So I am pretty confident it catches the race condition in tests and still allows for the robust implementation, please have another look

@Byron Byron self-requested a review September 15, 2021 23:41
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, it looks like this PR is ready to be merged. I'd love to have another reviewer though, maybe @Yobmod could take a final look?

I'd be happy to try a new release when merged, too.

@Byron Byron merged commit aa50766 into gitpython-developers:main Sep 18, 2021
@per1234
Copy link

per1234 commented Sep 20, 2021

Maybe additional options will be implemented some day, I do have hope.

@Byron they already have been:
https://github.blog/changelog/2021-07-01-github-actions-new-settings-for-maintainers/
Switching to the "Require approval for first-time contributors who are new to GitHub" setting has almost completely eliminated the need to approve workflows in the repositories I maintain.

@Byron
Copy link
Member

Byron commented Sep 20, 2021

Screenshot 2021-09-20 at 20 30 23

Thank you sooo much! That drove me crazy 🤦‍♂️

@sroet sroet deleted the update_fetch_timeout branch September 20, 2021 12:31
@sroet
Copy link
Contributor Author

sroet commented Sep 20, 2021

Thank you sooo much! That drove me crazy

Can imagine, thanks for dealing with the CI on this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

default thread timeout too short for CI
5 participants