New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid thundering herd when SoftwareProcess rebinds its sensors #1318
Avoid thundering herd when SoftwareProcess rebinds its sensors #1318
Conversation
Applies a random delay of up to 10s before an entity calls connectSensors(). Without the delay all entities attempt to reconnect at once, which regularly causes SSH connection refused errors if several of the entities are running on the same machine. This doesn't guarantee to avoid the problem but makes it much less likely.
Brooklyn Central » brooklyn #2058 FAILURE |
Think it's a buildhive issue:
|
@@ -54,7 +56,9 @@ | |||
|
|||
private static final SoftwareProcessDriverLifecycleEffectorTasks LIFECYCLE_TASKS = | |||
new SoftwareProcessDriverLifecycleEffectorTasks(); | |||
|
|||
|
|||
private static final long MAX_REBIND_SENSOR_CONNECT_DELAY = Duration.TEN_SECONDS.toMilliseconds(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a ConfigKey
instead. Either a Long
or Duration
and also a Boolean
to enable and disable the feature.
Set config key to null or to == Duration.ZERO to connect sensors immediately.
@grkvlt I've altered it so that there's one ConfigKey that defaults to Duration.TEN_SECONDS. If the key is set to null or to a value == Duration.ZERO then no delay occurs. Acceptable? |
@sjcorbett Yep, that's what I was thinking of. Love the description of the issue! |
So, I didn't realise thundering herd was the actual name of the problem. The wiki page also references the sleeping barber problem, and of course we also have the byzantine generals - distributed computing is fun! |
Brooklyn Central » brooklyn #2063 SUCCESS |
Avoid thundering herd when SoftwareProcess rebinds its sensors
Merged 🐄 🐄 🐄 🐄 |
Applies a random delay of up to 10s before a rebinding SoftwareProcess entity calls connectSensors(). Without the delay all entities attempt to reconnect at once, which reliably causes SSH connection refused errors if several of the entities are running on the same machine. This doesn't guarantee to prevent refused connections but makes them much less likely.