New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: reduce tbon.tcp_user_timeout #4632
Conversation
Problem: the default time to wait for a TCP ACK from a peer broker before disconnecting is too long. The system default is around 20m, which is a very long time for a free range batch job to be hung up after losing a non-critical node. Use 20s as the default. This can still be overridden via TOML config and broker command line. Update sharness test that checked for a default of 0 (system).
Problem: the default tbon.tcp_user_timeout has changed, but the man page still says it uses the system default. Update man page.
I had forgotten (and was reminded by CI) that this attribute is not supported on el7, and my change didn't account for that. Repushed with minor tweaks so that the logic for ensuring it's an error to tune this when it's not tunable still works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Codecov Report
@@ Coverage Diff @@
## master #4632 +/- ##
==========================================
+ Coverage 83.39% 83.41% +0.01%
==========================================
Files 411 411
Lines 68792 68796 +4
==========================================
+ Hits 57371 57388 +17
+ Misses 11421 11408 -13
|
Restarting bionic builder that failed here:
hopefully just a temporary github issue |
Thanks, I'll set MWP. |
Problem: a batch job takes around 20m to notice that a peer has been turned off.
The
tbon.tcp_user_timeout
can be used to tune this value in the system instance, and we encourage that. But batch jobs get the system default.Since the system default is probably not suitable for flux ever, set our own default. Let's try 20s.