Skip to content

Stress / long running test of cluster, see #2786 #940

Merged
merged 2 commits into from Jan 8, 2013

6 participants

@patriknw
Akka Project member

This test is intended to be used as long running stress test of cluster related features.

@viktorklang viktorklang commented on an outdated diff Dec 12, 2012
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
+import akka.routing.FromConfig
+import akka.testkit._
+import akka.routing.CurrentRoutees
+import akka.routing.RouterRoutees
+import akka.actor.PoisonPill
+
+object StressMultiJvmSpec extends MultiNodeConfig {
+
+ // Note that this test uses default configuration,
+ // not MultiNodeClusterSpec.clusterConfig
+ commonConfig(ConfigFactory.parseString("""
+ akka.test.cluster-stress-spec {
+ nr-of-nodes-factor = 1
+ nr-of-nodes = 13
+ nr-of-seed-nodes = 3
+ nr-of-nodes-joining-to-seed-initally = 2
@viktorklang
Akka Project member
viktorklang added a note Dec 12, 2012

If we make nr-of-nodes a section we can cut down on the BP a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member
akka-ci commented Dec 12, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/197/

@akka-ci
Akka Project member
akka-ci commented Dec 12, 2012

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/197/

@viktorklang
Akka Project member

Great stuff Patrik!

@patriknw
Akka Project member

pushed today's work

@akka-ci
Akka Project member
akka-ci commented Dec 13, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/202/

@akka-ci
Akka Project member
akka-ci commented Dec 13, 2012

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/202/

@patriknw
Akka Project member

Added some more stuff, especially the interesting join/remove repeat test.
This could be the first version of this test, which could go into master, and we can add tickets for more stuff.
For example I would like that we add something that tests a large actor tree hierarchy (creating and using many actors).
Then we have all the throttling and drop messages part, but that can't be done until we have reliable system messages.

By the way, I'm running this test periodically in moxie cluster.

@rkuhn
Akka Project member
rkuhn commented Dec 14, 2012
@akka-ci
Akka Project member
akka-ci commented Dec 14, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/207/

@akka-ci
Akka Project member
akka-ci commented Dec 14, 2012

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/207/

@patriknw
Akka Project member

Added the test of many actors, in tree structure also.
Squashed commits.

@bantonsson bantonsson commented on the diff Dec 17, 2012
...-cluster/src/main/scala/akka/cluster/ClusterJmx.scala
@@ -140,7 +140,6 @@ private[akka] class ClusterJmx(cluster: Cluster, log: LoggingAdapter) {
* Unregisters the cluster JMX MBean from MBean server.
*/
def unregisterMBean(): Unit = {
- clusterView.close()
@bantonsson
Akka Project member
bantonsson added a note Dec 17, 2012

Nice catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn and 1 other commented on an outdated diff Dec 17, 2012
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
+
+ }
+ }
+
+ /**
+ * Master of routers
+ */
+ class Master(settings: StressMultiJvmSpec.Settings, batchInterval: FiniteDuration, tree: Boolean) extends Actor {
+ val workers = context.actorOf(Props[Worker].withRouter(FromConfig), "workers")
+ val payload = Array.fill(settings.payloadSize)(ThreadLocalRandom.current.nextInt(127).toByte)
+ val retryTimeout = 5.seconds.dilated(context.system)
+ var idCounter = 0L
+ def nextId(): JobId = {
+ idCounter += 1
+ idCounter
+ }
@rkuhn
Akka Project member
rkuhn added a note Dec 17, 2012

I usually use val idCounter = Iterator from 0

@patriknw
Akka Project member
patriknw added a note Dec 18, 2012

nice, will try that here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member
akka-ci commented Dec 17, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/218/

@akka-ci
Akka Project member
akka-ci commented Dec 17, 2012

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/218/

@rkuhn rkuhn and 1 other commented on an outdated diff Dec 18, 2012
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
+ case Begin
+ startTime = System.nanoTime
+ self ! SendBatch
+ context.become(working)
+ case RetryTick
+ }
+
+ def working: Receive = {
+ case Ack(id)
+ outstanding -= id
+ ackCounter += 1
+ if (outstanding.size == settings.workBatchSize / 2)
+ if (batchInterval == Duration.Zero) self ! SendBatch
+ else context.system.scheduler.scheduleOnce(batchInterval, self, SendBatch)
+ case SendBatch
+ if (outstanding.size < settings.workBatchSize)
@rkuhn
Akka Project member
rkuhn added a note Dec 18, 2012

it is not obvious how this could ever be false; maybe explain the mechanics of this test driver in its class comment

@patriknw
Akka Project member
patriknw added a note Dec 18, 2012

ok, will do, it's simple possible flow control
the scheduled SendBatch messages will be ignored if not enough acks received

@patriknw
Akka Project member
patriknw added a note Dec 20, 2012

thought some more about this, you are right, that was a safe net that was not needed, removed, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn commented on the diff Dec 18, 2012
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
+ tree forward (idx, SimpleJob(id, payload))
+ context.become(treeWorker(tree))
+ }
+
+ def treeWorker(tree: ActorRef): Receive = {
+ case SimpleJob(id, payload) sender ! Ack(id)
+ case TreeJob(id, payload, idx, _, _)
+ tree forward (idx, SimpleJob(id, payload))
+ }
+ }
+
+ class TreeNode(level: Int, width: Int) extends Actor {
+ require(level >= 1)
+ def createChild(): Actor = if (level == 1) new Leaf else new TreeNode(level - 1, width)
+ val indexedChildren =
+ 0 until width map { i context.actorOf(Props(createChild()), name = i.toString) } toVector
@rkuhn
Akka Project member
rkuhn added a note Dec 18, 2012

that makes each of the trees local-only (since these Props would not be serializable); just clarifying my understanding that only the top Worker router is going to be spread out across the cluster

@patriknw
Akka Project member
patriknw added a note Dec 18, 2012

yes, that was the plan, and not that unrealistic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn and 1 other commented on an outdated diff Dec 18, 2012
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
+ if (deadline.isOverdue) previous
+ else {
+ val t = title + " round " + counter
+ runOn(roles.take(usedRoles): _*) {
+ phiObserver ! Reset
+ }
+ createResultAggregator(t, expectedResults = usedRoles, includeInHistory = true)
+ val next = within(loopDuration) {
+ reportResult {
+ val previousAddress = previous map { Cluster(_).selfAddress }
+ val next =
+ if (activeRoles contains myself) {
+ val sys = ActorSystem(system.name, system.settings.config)
+ muteLog(sys)
+ Cluster(sys).joinSeedNodes(seedNodes.toIndexedSeq map address)
+ previous foreach { _.shutdown() }
@rkuhn
Akka Project member
rkuhn added a note Dec 18, 2012

why shutdown after start of successor?

@patriknw
Akka Project member
patriknw added a note Dec 18, 2012

For each iteration I want to start one member and shutdown one, therefore I shutdown the one that was started in the previous iteration.
Do you see a problem with that? Is there a better way?

@rkuhn
Akka Project member
rkuhn added a note Dec 18, 2012
@patriknw
Akka Project member
patriknw added a note Dec 18, 2012

yes, that would look better, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn
Akka Project member
rkuhn commented Dec 18, 2012

I’m unable to understand each line at this point, but it looks good nevertheless.

@akka-ci
Akka Project member
akka-ci commented Dec 20, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/242/

@akka-ci
Akka Project member
akka-ci commented Dec 20, 2012

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/242/

@rkuhn
Akka Project member
rkuhn commented Dec 20, 2012

The last commit looks reasonable, but I don’t completely get it. It looks like your change basically makes the heartbeat sender aware of a removed node a little earlier than before, but why should the sender care?

@bantonsson bantonsson and 2 others commented on an outdated diff Dec 20, 2012
...er/src/main/scala/akka/cluster/ClusterHeartbeat.scala
@@ -113,6 +113,7 @@ private[cluster] final class ClusterHeartbeatSender extends Actor with ActorLogg
case HeartbeatTick heartbeat()
case s: CurrentClusterState reset(s)
case MemberUnreachable(m) removeMember(m)
+ case MemberDowned(m) removeMember(m)
@bantonsson
Akka Project member
bantonsson added a note Dec 20, 2012

So it looks like we added back the member (two lines down) when we received a MemberDowned before this fix. Is that correct?

@patriknw
Akka Project member
patriknw added a note Dec 20, 2012

@bantonsson That is correct!
@rkuhn the problem is that we never go from Down to Removed, which will be fixed in another existing ticket

@rkuhn
Akka Project member
rkuhn added a note Dec 20, 2012

ah, okay, good to know

@patriknw
Akka Project member
patriknw added a note Dec 20, 2012

first test run on jenkins was successful, which doesn't mean that it's fixed, but anyway a good start
I'll keep it running
I'll fix review comments also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member
akka-ci commented Dec 20, 2012

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/246/

@akka-ci
Akka Project member
akka-ci commented Dec 20, 2012

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/246/

@bantonsson bantonsson commented on the diff Dec 21, 2012
...luster/src/main/scala/akka/cluster/ClusterEvent.scala
// buffer up the MemberEvents waiting for convergence
memberEvents ++= diffMemberEvents(oldGossip, newGossip)
// if we have convergence then publish the MemberEvents and possibly a LeaderChanged
if (newGossip.convergence) {
val previousConvergedGossip = latestConvergedGossip
latestConvergedGossip = newGossip
- memberEvents foreach publish
+ memberEvents foreach { event
+ event match {
+ case m @ (MemberDowned(_) | MemberRemoved(_))
+ // TODO MemberDowned match should probably be covered by MemberRemoved, see ticket #2788
+ // but right now we don't change Downed to Removed
+ publish(event)
+ // notify DeathWatch about downed node
+ publish(AddressTerminated(m.member.address))
+ case _ publish(event)
+ }
+ }
@bantonsson
Akka Project member
bantonsson added a note Dec 21, 2012

Nice catch! Sorry about the broken merge.

@patriknw
Akka Project member
patriknw added a note Dec 21, 2012

np, it was a mess with fw/back porting, and not back porting this change

@patriknw
Akka Project member
patriknw added a note Dec 21, 2012

now we also have a test that actually catch it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@patriknw
Akka Project member

Alright, I have been running this (with new remoting) a few times on jenkins.akka.io, and it failure rate is ~50%, so we should not merge it in now. I don't have more time today to investigate the failures. They are here: https://jenkins.akka.io:8498/job/akka-multi-node-repeat2/

Apart from the test the commit contains fixes for 3 issues:

  • Avoid adding back members when downed in ClusterHeartbeatSender
  • Avoid duplicate close of ClusterReadView
  • Add back the publish of AddressTerminated when MemberDowned/Removed it was lost in merge of "publish on convergence"
@drewhk
Akka Project member
drewhk commented Dec 21, 2012

Which ones are the failures?

If you find something fishy, please send me, I'll try to investigate.

@patriknw
Akka Project member

look at build 111 and thereafter
https://jenkins.akka.io:8498/job/akka-multi-node-repeat2/

that job is weird, it shows blue bullets even though it fails, so you have to look at the log

@akka-ci
Akka Project member
akka-ci commented Jan 4, 2013

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/271/

@akka-ci
Akka Project member
akka-ci commented Jan 4, 2013

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/271/

patriknw added some commits Dec 12, 2012
@patriknw patriknw Stress / long running test of cluster, see #2786
* akka.cluster.StressSpec
* Configurable number of nodes and duration for each step
* Report metrics and phi periodically to see progress
* Configurable payload size
* Test of various join and remove scenarios
* Test of watch
* Exercise supervision
* Report cluster stats
* Test with many actors in tree structure

Apart from the test this commit also solves some issues:

* Avoid adding back members when downed in ClusterHeartbeatSender
* Avoid duplicate close of ClusterReadView
* Add back the publish of AddressTerminated when MemberDowned/Removed
  it was lost in merge of "publish on convergence", see #2779
f147f4d
@patriknw patriknw Remove LargeClusterSpec, superseded by StressSpec, see #2786 46d376b
@patriknw
Akka Project member
patriknw commented Jan 8, 2013

Merging! Remove LargeClusterSpec, since StressSpec can be used for that as well.

@patriknw patriknw merged commit a0cb4b3 into master Jan 8, 2013
@patriknw patriknw deleted the wip-2786-cluster-stress-patriknw branch Jan 8, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.