Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid thundering herd when SoftwareProcess rebinds its sensors #1318

Merged
merged 2 commits into from Apr 11, 2014
Merged

Avoid thundering herd when SoftwareProcess rebinds its sensors #1318

merged 2 commits into from Apr 11, 2014

Conversation

sjcorbett
Copy link
Member

Applies a random delay of up to 10s before a rebinding SoftwareProcess entity calls connectSensors(). Without the delay all entities attempt to reconnect at once, which reliably causes SSH connection refused errors if several of the entities are running on the same machine. This doesn't guarantee to prevent refused connections but makes them much less likely.

Applies a random delay of up to 10s before an entity calls
connectSensors(). Without the delay all entities attempt to reconnect
at once, which regularly causes SSH connection refused errors if
several of the entities are running on the same machine. This doesn't
guarantee to avoid the problem but makes it much less likely.
@buildhive
Copy link

Brooklyn Central » brooklyn #2058 FAILURE
Looks like there's a problem with this pull request
(what's this?)

@sjcorbett
Copy link
Member Author

Think it's a buildhive issue:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running TestSuite
Configuring TestNG with: TestNG652Configurator
2014-04-10 09:59:33,987 WARN  Deprecated use of old-style location construction for brooklyn.location.basic.SshMachineLocation; instead use LocationManager().createLocation(spec)
2014-04-10 09:59:34,237 WARN  Deprecated use of old-style location construction for brooklyn.location.basic.SshMachineLocation; instead use LocationManager().createLocation(spec)
2014-04-10 09:59:35,626 WARN  Deprecated use of old-style entity construction for brooklyn.test.entity.TestApplicationImpl; instead use EntityManager().createEntity(spec)
2014-04-10 09:59:39,129 INFO  TESTNG RUNNING: Suite: "Command line test" containing "26" Tests (config: null)
2014-04-10 09:59:39,132 INFO  BrooklynLeakListener.onStart attempting to terminate all extant ManagementContexts: name=Command line test; includedGroups=[]; excludedGroups=[Integration, Acceptance, Live, WIP]; suiteName=Command line suite; outDir=/scratch/jenkins/workspace/brooklyncentral/brooklyn/software/webapp/target/surefire-reports/Command line suite
2014-04-10 09:59:39,275 INFO  TESTNG INVOKING CONFIGURATION: "Command line test" - @BeforeMethod brooklyn.entity.proxy.AbstractControllerTest.setUp()
ERROR: Failed to parse POMs
hudson.remoting.ChannelClosedException: channel is already closed
    at hudson.remoting.Channel.send(Channel.java:524)
    at hudson.remoting.Request.call(Request.java:129)
    at hudson.remoting.Channel.call(Channel.java:722)
    at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
    at $Proxy54.isAlive(Unknown Source)
    at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:930)
    at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:161)
    at hudson.maven.MavenModuleSetBuild$RunnerImpl.doRun(MavenModuleSetBuild.java:793)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:566)
    at hudson.model.Run.execute(Run.java:1665)
    at hudson.model.Run.run(Run.java:1612)
    at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:479)
    at hudson.model.ResourceController.execute(ResourceController.java:88)
    at hudson.model.Executor.run(Executor.java:246)
Caused by: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2553)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1296)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
    at hudson.remoting.Command.readFrom(Command.java:92)
    at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
Looks like the node went offline during the build. Check the slave log for the details.
details

FATAL: channel is already closed
hudson.remoting.ChannelClosedException: channel is already closed
    at hudson.remoting.Channel.send(Channel.java:524)
    at hudson.remoting.Request.call(Request.java:129)
    at hudson.remoting.Channel.call(Channel.java:722)
    at hudson.Launcher$RemoteLauncher.kill(Launcher.java:887)
    at hudson.Launcher$1.kill(Launcher.java:688)
    at hudson.Launcher$2.kill(Launcher.java:745)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:589)
    at hudson.model.Run.execute(Run.java:1665)
    at hudson.model.Run.run(Run.java:1612)
    at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:479)
    at hudson.model.ResourceController.execute(ResourceController.java:88)
    at hudson.model.Executor.run(Executor.java:246)
Caused by: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2553)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1296)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
    at hudson.remoting.Command.readFrom(Command.java:92)
    at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

@@ -54,7 +56,9 @@

private static final SoftwareProcessDriverLifecycleEffectorTasks LIFECYCLE_TASKS =
new SoftwareProcessDriverLifecycleEffectorTasks();


private static final long MAX_REBIND_SENSOR_CONNECT_DELAY = Duration.TEN_SECONDS.toMilliseconds();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a ConfigKey instead. Either a Long or Duration and also a Boolean to enable and disable the feature.

Set config key to null or to == Duration.ZERO to connect sensors
immediately.
@sjcorbett
Copy link
Member Author

@grkvlt I've altered it so that there's one ConfigKey that defaults to Duration.TEN_SECONDS. If the key is set to null or to a value == Duration.ZERO then no delay occurs. Acceptable?

@grkvlt
Copy link
Member

grkvlt commented Apr 10, 2014

@sjcorbett Yep, that's what I was thinking of. Love the description of the issue!

@grkvlt grkvlt added this to the v 0.7.0 milestone Apr 10, 2014
@grkvlt grkvlt self-assigned this Apr 10, 2014
@grkvlt
Copy link
Member

grkvlt commented Apr 10, 2014

So, I didn't realise thundering herd was the actual name of the problem. The wiki page also references the sleeping barber problem, and of course we also have the byzantine generals - distributed computing is fun!

@buildhive
Copy link

Brooklyn Central » brooklyn #2063 SUCCESS
This pull request looks good
(what's this?)

grkvlt pushed a commit that referenced this pull request Apr 11, 2014
Avoid thundering herd when SoftwareProcess rebinds its sensors
@grkvlt grkvlt merged commit 175a6ec into brooklyncentral:master Apr 11, 2014
@grkvlt
Copy link
Member

grkvlt commented Apr 11, 2014

Merged 🐄 🐄 🐄 🐄

@sjcorbett sjcorbett deleted the ssh-connection-refactor branch May 8, 2014 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants