Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download error can cause stalling and error looping #3563

Open
rezan opened this issue Apr 3, 2024 · 4 comments
Open

Download error can cause stalling and error looping #3563

rezan opened this issue Apr 3, 2024 · 4 comments
Assignees

Comments

@rezan
Copy link
Contributor

rezan commented Apr 3, 2024

Ran into a situation where a single download error caused an error loop and hung cvmfs2. The cause are these Backoff() calls in VerifyAndFinalize() which sleep between retries:

Backoff(info);

This is all done in the libcurl event loop in MainDownload(). When this happens and a single download sleeps during its retry, all downloads stall out due to the event loop not being run. This then causes all kinds of new errors like timeouts and TooSlow errors. These then trigger more Backoffs() and you are potentially stuck in a forever fail/retry loop depending on the conditions.

Possibly related to #3432 and #3378.

@vvolkl
Copy link
Contributor

vvolkl commented Apr 4, 2024

Thanks for this report! With regards to the connection to the linked issues: Did the cvmfs2 process still respond to SIGKILL when you encountered this bug?

@rezan
Copy link
Contributor Author

rezan commented Apr 4, 2024

Hmm, good question, I didn't test that. However, I can confirm my user process moved into a <defunct> state and couldn't be killed. I think thats cause it was stuck in a read() syscall. I can reproduce this pretty easily, so I will test the cvmfs2 process later and report back.

@rezan
Copy link
Contributor Author

rezan commented Apr 4, 2024

I was able to kill the process. I was then able to re-mount and recover. So no issues there. Note that to trigger this I just have to generate a decent amount of CVMFS load and then kill a proxy server. Thats enough to get 1 thread into a Backoff() and the stall/retry/fail loop starts.

@HereThereBeDragons
Copy link
Contributor

HereThereBeDragons commented Apr 15, 2024

for replication:
have at least 1 private proxy, set the number of retries very high (CVMFS_MAX_RETRIES=1000)

  1. start some downloads
  2. kill the proxy --> you should see that it goes into backoff and replies with "too slowly"
  3. reenable the proxy --> see that you are still in the backoff and never recover

Fix is in #3556 but only for the new introduced parallel decompression as this puts the backoff outside the sequential handling of curl callbacks

Edit:
It does not seem to be the same as #3432 and #3378 - but it should be investigated if #3556 might fix those problems anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants