Skip to content

Loading…

Stress / long running test of cluster, see #2786 #940

Merged
merged 2 commits into from

6 participants

@patriknw
Akka Project member

This test is intended to be used as long running stress test of cluster related features.

@viktorklang viktorklang commented on an outdated diff
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
((20 lines not shown))
+import akka.routing.FromConfig
+import akka.testkit._
+import akka.routing.CurrentRoutees
+import akka.routing.RouterRoutees
+import akka.actor.PoisonPill
+
+object StressMultiJvmSpec extends MultiNodeConfig {
+
+ // Note that this test uses default configuration,
+ // not MultiNodeClusterSpec.clusterConfig
+ commonConfig(ConfigFactory.parseString("""
+ akka.test.cluster-stress-spec {
+ nr-of-nodes-factor = 1
+ nr-of-nodes = 13
+ nr-of-seed-nodes = 3
+ nr-of-nodes-joining-to-seed-initally = 2
@viktorklang Akka Project member

If we make nr-of-nodes a section we can cut down on the BP a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/197/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/197/

@viktorklang
Akka Project member

Great stuff Patrik!

@patriknw
Akka Project member

pushed today's work

@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/202/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/202/

@patriknw
Akka Project member

Added some more stuff, especially the interesting join/remove repeat test.
This could be the first version of this test, which could go into master, and we can add tickets for more stuff.
For example I would like that we add something that tests a large actor tree hierarchy (creating and using many actors).
Then we have all the throttling and drop messages part, but that can't be done until we have reliable system messages.

By the way, I'm running this test periodically in moxie cluster.

@rkuhn
Akka Project member
@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/207/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/207/

@patriknw
Akka Project member

Added the test of many actors, in tree structure also.
Squashed commits.

@bantonsson bantonsson commented on the diff
...-cluster/src/main/scala/akka/cluster/ClusterJmx.scala
@@ -140,7 +140,6 @@ private[akka] class ClusterJmx(cluster: Cluster, log: LoggingAdapter) {
* Unregisters the cluster JMX MBean from MBean server.
*/
def unregisterMBean(): Unit = {
- clusterView.close()
@bantonsson Akka Project member

Nice catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn commented on an outdated diff
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
((369 lines not shown))
+
+ }
+ }
+
+ /**
+ * Master of routers
+ */
+ class Master(settings: StressMultiJvmSpec.Settings, batchInterval: FiniteDuration, tree: Boolean) extends Actor {
+ val workers = context.actorOf(Props[Worker].withRouter(FromConfig), "workers")
+ val payload = Array.fill(settings.payloadSize)(ThreadLocalRandom.current.nextInt(127).toByte)
+ val retryTimeout = 5.seconds.dilated(context.system)
+ var idCounter = 0L
+ def nextId(): JobId = {
+ idCounter += 1
+ idCounter
+ }
@rkuhn Akka Project member
rkuhn added a note

I usually use val idCounter = Iterator from 0

@patriknw Akka Project member

nice, will try that here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/218/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/218/

@rkuhn rkuhn commented on an outdated diff
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
((399 lines not shown))
+ case Begin
+ startTime = System.nanoTime
+ self ! SendBatch
+ context.become(working)
+ case RetryTick
+ }
+
+ def working: Receive = {
+ case Ack(id)
+ outstanding -= id
+ ackCounter += 1
+ if (outstanding.size == settings.workBatchSize / 2)
+ if (batchInterval == Duration.Zero) self ! SendBatch
+ else context.system.scheduler.scheduleOnce(batchInterval, self, SendBatch)
+ case SendBatch
+ if (outstanding.size < settings.workBatchSize)
@rkuhn Akka Project member
rkuhn added a note

it is not obvious how this could ever be false; maybe explain the mechanics of this test driver in its class comment

@patriknw Akka Project member

ok, will do, it's simple possible flow control
the scheduled SendBatch messages will be ignored if not enough acks received

@patriknw Akka Project member

thought some more about this, you are right, that was a safe net that was not needed, removed, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn commented on the diff
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
((475 lines not shown))
+ tree forward (idx, SimpleJob(id, payload))
+ context.become(treeWorker(tree))
+ }
+
+ def treeWorker(tree: ActorRef): Receive = {
+ case SimpleJob(id, payload) sender ! Ack(id)
+ case TreeJob(id, payload, idx, _, _)
+ tree forward (idx, SimpleJob(id, payload))
+ }
+ }
+
+ class TreeNode(level: Int, width: Int) extends Actor {
+ require(level >= 1)
+ def createChild(): Actor = if (level == 1) new Leaf else new TreeNode(level - 1, width)
+ val indexedChildren =
+ 0 until width map { i context.actorOf(Props(createChild()), name = i.toString) } toVector
@rkuhn Akka Project member
rkuhn added a note

that makes each of the trees local-only (since these Props would not be serializable); just clarifying my understanding that only the top Worker router is going to be spread out across the cluster

@patriknw Akka Project member

yes, that was the plan, and not that unrealistic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn rkuhn commented on an outdated diff
...ter/src/multi-jvm/scala/akka/cluster/StressSpec.scala
((790 lines not shown))
+ if (deadline.isOverdue) previous
+ else {
+ val t = title + " round " + counter
+ runOn(roles.take(usedRoles): _*) {
+ phiObserver ! Reset
+ }
+ createResultAggregator(t, expectedResults = usedRoles, includeInHistory = true)
+ val next = within(loopDuration) {
+ reportResult {
+ val previousAddress = previous map { Cluster(_).selfAddress }
+ val next =
+ if (activeRoles contains myself) {
+ val sys = ActorSystem(system.name, system.settings.config)
+ muteLog(sys)
+ Cluster(sys).joinSeedNodes(seedNodes.toIndexedSeq map address)
+ previous foreach { _.shutdown() }
@rkuhn Akka Project member
rkuhn added a note

why shutdown after start of successor?

@patriknw Akka Project member

For each iteration I want to start one member and shutdown one, therefore I shutdown the one that was started in the previous iteration.
Do you see a problem with that? Is there a better way?

@rkuhn Akka Project member
@patriknw Akka Project member

yes, that would look better, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rkuhn
Akka Project member

I’m unable to understand each line at this point, but it looks good nevertheless.

@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/242/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/242/

@rkuhn
Akka Project member

The last commit looks reasonable, but I don’t completely get it. It looks like your change basically makes the heartbeat sender aware of a removed node a little earlier than before, but why should the sender care?

@bantonsson bantonsson commented on an outdated diff
...er/src/main/scala/akka/cluster/ClusterHeartbeat.scala
@@ -113,6 +113,7 @@ private[cluster] final class ClusterHeartbeatSender extends Actor with ActorLogg
case HeartbeatTick heartbeat()
case s: CurrentClusterState reset(s)
case MemberUnreachable(m) removeMember(m)
+ case MemberDowned(m) removeMember(m)
@bantonsson Akka Project member

So it looks like we added back the member (two lines down) when we received a MemberDowned before this fix. Is that correct?

@patriknw Akka Project member

@bantonsson That is correct!
@rkuhn the problem is that we never go from Down to Removed, which will be fixed in another existing ticket

@rkuhn Akka Project member
rkuhn added a note

ah, okay, good to know

@patriknw Akka Project member

first test run on jenkins was successful, which doesn't mean that it's fixed, but anyway a good start
I'll keep it running
I'll fix review comments also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/246/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Failed - https://jenkins.akka.io/job/akka-pr-validator/246/

@bantonsson bantonsson commented on the diff
...luster/src/main/scala/akka/cluster/ClusterEvent.scala
((10 lines not shown))
// buffer up the MemberEvents waiting for convergence
memberEvents ++= diffMemberEvents(oldGossip, newGossip)
// if we have convergence then publish the MemberEvents and possibly a LeaderChanged
if (newGossip.convergence) {
val previousConvergedGossip = latestConvergedGossip
latestConvergedGossip = newGossip
- memberEvents foreach publish
+ memberEvents foreach { event
+ event match {
+ case m @ (MemberDowned(_) | MemberRemoved(_))
+ // TODO MemberDowned match should probably be covered by MemberRemoved, see ticket #2788
+ // but right now we don't change Downed to Removed
+ publish(event)
+ // notify DeathWatch about downed node
+ publish(AddressTerminated(m.member.address))
+ case _ publish(event)
+ }
+ }
@bantonsson Akka Project member

Nice catch! Sorry about the broken merge.

@patriknw Akka Project member

np, it was a mess with fw/back porting, and not back porting this change

@patriknw Akka Project member

now we also have a test that actually catch it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@patriknw
Akka Project member

Alright, I have been running this (with new remoting) a few times on jenkins.akka.io, and it failure rate is ~50%, so we should not merge it in now. I don't have more time today to investigate the failures. They are here: https://jenkins.akka.io:8498/job/akka-multi-node-repeat2/

Apart from the test the commit contains fixes for 3 issues:

  • Avoid adding back members when downed in ClusterHeartbeatSender
  • Avoid duplicate close of ClusterReadView
  • Add back the publish of AddressTerminated when MemberDowned/Removed it was lost in merge of "publish on convergence"
@drewhk
Akka Project member

Which ones are the failures?

If you find something fishy, please send me, I'll try to investigate.

@patriknw
Akka Project member

look at build 111 and thereafter
https://jenkins.akka.io:8498/job/akka-multi-node-repeat2/

that job is weird, it shows blue bullets even though it fails, so you have to look at the log

@akka-ci
Akka Project member

Started jenkins job akka-pr-validator at https://jenkins.akka.io/job/akka-pr-validator/271/

@akka-ci
Akka Project member

jenkins job akka-pr-validator: Success - https://jenkins.akka.io/job/akka-pr-validator/271/

patriknw added some commits
@patriknw patriknw Stress / long running test of cluster, see #2786
* akka.cluster.StressSpec
* Configurable number of nodes and duration for each step
* Report metrics and phi periodically to see progress
* Configurable payload size
* Test of various join and remove scenarios
* Test of watch
* Exercise supervision
* Report cluster stats
* Test with many actors in tree structure

Apart from the test this commit also solves some issues:

* Avoid adding back members when downed in ClusterHeartbeatSender
* Avoid duplicate close of ClusterReadView
* Add back the publish of AddressTerminated when MemberDowned/Removed
  it was lost in merge of "publish on convergence", see #2779
f147f4d
@patriknw patriknw Remove LargeClusterSpec, superseded by StressSpec, see #2786 46d376b
@patriknw
Akka Project member

Merging! Remove LargeClusterSpec, since StressSpec can be used for that as well.

@patriknw patriknw merged commit a0cb4b3 into master
@patriknw patriknw deleted the wip-2786-cluster-stress-patriknw branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 7, 2013
  1. @patriknw

    Stress / long running test of cluster, see #2786

    patriknw committed
    * akka.cluster.StressSpec
    * Configurable number of nodes and duration for each step
    * Report metrics and phi periodically to see progress
    * Configurable payload size
    * Test of various join and remove scenarios
    * Test of watch
    * Exercise supervision
    * Report cluster stats
    * Test with many actors in tree structure
    
    Apart from the test this commit also solves some issues:
    
    * Avoid adding back members when downed in ClusterHeartbeatSender
    * Avoid duplicate close of ClusterReadView
    * Add back the publish of AddressTerminated when MemberDowned/Removed
      it was lost in merge of "publish on convergence", see #2779
Commits on Jan 8, 2013
  1. @patriknw
Something went wrong with that request. Please try again.