Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: reduce tbon.tcp_user_timeout #4632

Merged
merged 2 commits into from Oct 1, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Oct 1, 2022

Problem: a batch job takes around 20m to notice that a peer has been turned off.

The tbon.tcp_user_timeout can be used to tune this value in the system instance, and we encourage that. But batch jobs get the system default.

Since the system default is probably not suitable for flux ever, set our own default. Let's try 20s.

Problem: the default time to wait for a TCP ACK from a peer
broker before disconnecting is too long.  The system default
is around 20m, which is a very long time for a free range
batch job to be hung up after losing a non-critical node.

Use 20s as the default.  This can still be overridden via TOML
config and broker command line.

Update sharness test that checked for a default of 0 (system).
Problem: the default tbon.tcp_user_timeout has changed, but
the man page still says it uses the system default.

Update man page.
@garlick
Copy link
Member Author

garlick commented Oct 1, 2022

I had forgotten (and was reminded by CI) that this attribute is not supported on el7, and my change didn't account for that. Repushed with minor tweaks so that the logic for ensuring it's an error to tune this when it's not tunable still works.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@codecov
Copy link

codecov bot commented Oct 1, 2022

Codecov Report

Merging #4632 (c11a610) into master (eb5faf5) will increase coverage by 0.01%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master    #4632      +/-   ##
==========================================
+ Coverage   83.39%   83.41%   +0.01%     
==========================================
  Files         411      411              
  Lines       68792    68796       +4     
==========================================
+ Hits        57371    57388      +17     
+ Misses      11421    11408      -13     
Impacted Files Coverage Δ
src/broker/overlay.c 86.41% <87.50%> (+0.16%) ⬆️
src/modules/job-exec/job-exec.c 76.41% <0.00%> (-0.17%) ⬇️
src/cmd/flux-mini.py 92.61% <0.00%> (+0.24%) ⬆️
src/common/libsdprocess/sdprocess.c 69.76% <0.00%> (+0.62%) ⬆️
src/common/libpmi/pmi2.c 89.87% <0.00%> (+1.26%) ⬆️
src/common/libpmi/simple_client.c 87.21% <0.00%> (+1.82%) ⬆️

@garlick
Copy link
Member Author

garlick commented Oct 1, 2022

Restarting bionic builder that failed here:

2022-10-01T14:27:31.2221869Z #7 0.679 Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
2022-10-01T14:27:31.2222855Z #7 0.681 Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.111.133|:443... connected.
2022-10-01T14:27:36.2786536Z #7 0.686 HTTP request sent, awaiting response... 503 Egress is over the account limit.
2022-10-01T14:27:36.2789099Z #7 5.857 2022-10-01 14:27:36 ERROR 503: Egress is over the account limit..

hopefully just a temporary github issue

@garlick
Copy link
Member Author

garlick commented Oct 1, 2022

Thanks, I'll set MWP.

@mergify mergify bot merged commit 656a7e4 into flux-framework:master Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants