KAFKA-5028: convert kafka controller to a single-threaded event queue model #2816

onurkaraman · 2017-04-06T05:37:31Z

The goal of this ticket is to improve controller maintainability by simplifying the controller's concurrency semantics. The controller code has a lot of shared state between several threads using several concurrency primitives. This makes the code hard to reason about.

This ticket proposes we convert the controller to a single-threaded event queue model. We add a new controller thread which processes events held in an event queue. Note that this does not mean we get rid of all threads used by the controller. We merely delegate all work that interacts with controller local state to this single thread. With only a single thread accessing and modifying the controller local state, we no longer need to worry about concurrent access, which means we can get rid of the various concurrency primitives used throughout the controller.

Performance is expected to match existing behavior since the bulk of the existing controller work today already happens sequentially in the ZkClient’s single ZkEventThread.

onurkaraman · 2017-04-06T05:38:06Z

This PR passed the system tests:
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/826/

asfbot · 2017-04-06T07:13:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/2782/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-06T07:15:06Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2778/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-04-06T07:51:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/2778/
Test PASSed (JDK 8 and Scala 2.12).

junrao

@onurkaraman : Thanks for the patch. A few quick comments.

junrao · 2017-04-07T21:16:06Z

core/src/main/scala/kafka/controller/KafkaController.scala

+  private val partitionModificationsListeners: mutable.Map[String, PartitionModificationsListener] = mutable.Map.empty
+  private val partitionReassignmentListener = new PartitionReassignmentListener(this, controllerEventQueue)
+  private val preferredReplicaElectionListener = new PreferredReplicaElectionListener(this, controllerEventQueue)
+  private val isrChangeNotificationListener = new IsrChangeNotificationListener(this, controllerEventQueue)


There are quite a few listeners. The first thing that they all do is do read the current data from ZK and figure out the changes. Could we at least share that part of the code for all listeners?

Sharing this logic at the listener layer would complicate the concurrency semantics since figuring out what has changed would require looking at controller local state while the listener is being executed from the ZkEventThread. This is what this PR is trying to avoid.

Even if we push the shared logic to the ControllerEvent layer, it doesn't seem like we'd actually simplify things since each listener reads and observes what's changed differently.

junrao · 2017-04-07T21:16:22Z

core/src/main/scala/kafka/controller/KafkaController.scala

      info("Broker %d is ready to serve as the new controller with epoch %d".format(config.brokerId, epoch))
      maybeTriggerPartitionReassignment()
      maybeTriggerPreferredReplicaElection()
+      info("starting the controller scheduler")
+      kafkaScheduler.startup()
+      kafkaScheduler.schedule("controller-metric-task", () => controllerEventQueue.put(UpdateMetrics), period = 10, unit = TimeUnit.SECONDS)


Do we need to update metrics in a scheduler? It seems that those metrics can only change after the processing of each event. If so, we can just update the metrics at the end of each event processing.

Converting UpdateMetrics from an event to a method run at the end of every event uncovered a bug in existing code:
If a topic is created while the replica set is offline, the partition would be defined in partitionReplicaAssignment but not in partitionLeadershipInfo. Computing the PreferredReplicaImbalanceCount during this time will result in NoSuchElementException when looking up the partition on the partitionLeadershipInfo.

newGauge( "PreferredReplicaImbalanceCount", new Gauge[Int] { def value(): Int = { inLock(controllerContext.controllerLock) { if (!isActive) 0 else controllerContext.partitionReplicaAssignment.count { case (topicPartition, replicas) => (controllerContext.partitionLeadershipInfo(topicPartition).leaderAndIsr.leader != replicas.head && (!deleteTopicManager.isTopicQueuedUpForDeletion(topicPartition.topic)) ) } } } } )

This actually breaks tests like AdminTest.testBasicPreferredReplicaElection which creates a topic before starting the cluster.

I think the fix could be to only use partitions that are in both partitionReplicaAssignment and partitionLeadershipInfo when computing PreferredReplicaImbalanceCount.

Good catch. The approach sounds good.

junrao · 2017-04-07T21:16:40Z

core/src/main/scala/kafka/controller/KafkaController.scala

+      kafkaScheduler.startup()
+      kafkaScheduler.schedule("controller-metric-task", () => controllerEventQueue.put(UpdateMetrics), period = 10, unit = TimeUnit.SECONDS)
+      if (config.deleteTopicEnable) {
+        kafkaScheduler.schedule("topic-deletion-progress-check-task", () => controllerEventQueue.put(TopicDeletionProgressCheck),


Hmm, adding scheduled event into the queue just for checking seems over killing. Would it be better to just do this check in main event loop? This applies to other scheduled tasks like leader balancing. If we do this, we can get rid the scheduler thread completely.

It seems that we can probably get rid of this scheduled thread since the topic deletion check only needs to be triggered on other events that we are already tracking.

metric updates are now done after every event in the main thread.

junrao · 2017-04-07T21:16:57Z

core/src/main/scala/kafka/controller/KafkaController.scala

-      info("Controller startup complete")
-    }
+    controllerEventQueue.put(Startup)
+    controllerEventQueue.put(Elect)


Do we need to do this through the queue? It seems the event thread could just do these two steps at the beginning.

Possible, but I'd rather we try to stick to the following pattern if we can:
all actions that modify controller state should be done as a ControllerEvent processed by the ControllerThread.

The issue is that once the Startup event is enqueued, additional events could be added to the queue by ZK event thread. Those new events could in theory show up before the Elect event. So, if there is sth that we really want the ControllerThread to complete at the beginning. The safest thing is probably to do that in the thread.

I ended up merging the actions into Startup to prevent interleavings.

junrao · 2017-04-07T21:17:08Z

core/src/main/scala/kafka/controller/KafkaController.scala

-      }
+    def handleNewSession(): Unit = {
+      controllerEventQueue.put(Resign(getControllerID()))
+      controllerEventQueue.put(Elect)


Hmm, instead of doing this in 2 separate events, could we just do resign and elect in a single event? Ditto in ControllerChangeListener.handleDataDeleted().

I think it's possible but would prefer to not do it in this patch.

Here are the places where Resign and Elect are added to the queue:

KafkaController.startup adds Elect to the queue.

SessionExpirationListener.handleNewSession adds Resign and Elect to the queue.

ControllerChangeListener.handleDataChange adds Resign to the queue.

ControllerChangeListener.handleDataDeleted adds Resign and Elect to the queue.

I tried refactoring such that the above maintains the same behavior as today's KafkaController.startup, SessionExpirationListener.handleNewSession, LeaderChangeListener.handleDataChange, and LeaderChangeListener.handleDataDeleted. I'd like to try to maintain existing behavior in this patch wherever possible to minimize regressions.

If we merge the two into a single event, we'd end up doing some actions that would work but might not make sense:

KafkaController.startup would unnecessarily clear all state even though state is guaranteed to already be empty.

ControllerChangeListener.handleDataChange would run the election algorithm even though handleDataChange getting triggered should indicate that a new controller has already been elected without znode deletion (setData on /controller could have been used). It's not clear to me if there is some merit to doing this, as later controller changes would hopefully get picked up by ZkClient and enqueued.

We want to be a bit careful with adding the two events separately. This is because if another event is enqueued btw the 2 events, we may miss the processing of the event since the controller is not ready.

That's an interesting point.

In the current PR, only the ZkClient's ZkEventThread enqueues Resign and Elect. This happens in ControllerChangeListener.handleDataDeleted and SessionExpirationListener.handleNewSession. Since ZkEventThread can't interleave itself, this leaves us with only two types of threads that can interleave an event between a Resign and Elect:

the scheduler can interleave an UpdateMetrics, TopicDeletionProgressCheck, or AutoPreferredReplicaLeaderElection.

the RequestSendThreads can interleave a TopicDeletionStopReplicaResult from the StopReplicaRequest callback.

I'm not too worried about missing one of the above events interleaved between resignation and election since an elected controller should be able to pick up where things left off by reading zookeeper state.

I am worried however about processing interleaved events while not the controller. For instance, I just noticed a bug in the PR where TopicDeletionProgressCheck and TopicDeletionStopReplicaResult don't first check isActive. I'll update the PR now.

Updated the PR which adds the isActive checks to TopicDeletionProgressCheck and TopicDeletionStopReplicaResult.

A simple way to solve this issue is to change the queue element to be a list of ControllerEvents. That way, one can add multiple events atomically.

Yes this would solve the issue.

Personally, I'd rather we just stick to a thread that processes a queue of events rather than a queue of queues. It seems simpler to reason about.

If interleaves are still a concern, we can either replace Elect and Resign with the merged equivalent or we can just keep Elect and Resign and add another event type that does both.

asfbot · 2017-04-09T19:37:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/2847/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-09T19:40:42Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2843/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-04-09T20:05:41Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/2843/
Test FAILed (JDK 8 and Scala 2.12).

ijuma

Thanks for the PR. Great that we're starting to fix issues in the Controller. I had a quick look and left a couple of simple comments.

One important question: what's our testing strategy for these changes? How do we ensure that we are not regressing?

ijuma · 2017-04-11T12:10:57Z

core/src/main/scala/kafka/controller/KafkaController.scala

-      }
+    def handleNewSession(): Unit = {
+      controllerEventQueue.put(Resign(getControllerID()))
+      controllerEventQueue.put(Elect)


A simple way to solve this issue is to change the queue element to be a list of ControllerEvents. That way, one can add multiple events atomically.

ijuma · 2017-04-11T12:12:23Z

core/src/main/scala/kafka/controller/KafkaController.scala

+
+  case class PartitionModifications(topic: String) extends ControllerEvent {
+    override def process(): Unit = {
+      if (!isActive) return


Is there a reason not to do the !isActive check before we call this method? We did something like that in the previous listener classes.

Yup there is a reason. I also wanted to extract this check but there were two issues with doing so:

most but not all event types should do the isActive check (think Startup, Elect, Resign).

different event types handle the isActive check differently. Most simply return if not active, while others such as the ControlledShutdown prepare a response stating that the controller has moved.

Because of these differences, I couldn't for example naively extract the check out to the ControllerEventThread's doWork. Some options I had considered but ultimately decided against:

pattern match in the ControllerEventThread's doWork checking event types and conditionally call isActive.

categorize the event types into those that should have the isActive check and those that shouldn't. With this categorization, we can let the ControllerEventThread's doWork do the pattern match based on these new types and conditionally run the check.

categorize the event types into those that should have the isActive check and those that shouldn't. With this categorization, we can let these new parent types do the check themselves upfront.

I essentially decided against all of these due to the unique behavior of the ControlledShutdown event.

onurkaraman · 2017-04-11T16:45:26Z

@ijuma in terms of testing, I'm relying on existing unit, integration, and system tests.

All our PRs already show unit/integration test results, and I provided a link to a passing system test I ran when I first opened the PR.

asfbot · 2017-04-12T06:40:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/2904/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-12T06:47:43Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/2899/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-04-12T06:50:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/2900/
Test PASSed (JDK 8 and Scala 2.12).

onurkaraman · 2017-04-14T02:07:41Z

Regarding the earlier comment on testing, I made a separate ticket and PR that adds controller integration tests:
ticket: https://issues.apache.org/jira/browse/KAFKA-5069
PR: #2853

I put the test in a separate PR with the intent of having the integration tests checked in before this PR so we can test for regressions when switching over to the single-threaded event queue model.

asfbot · 2017-04-19T01:42:44Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3006/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-19T02:05:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3001/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-04-19T02:19:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3002/
Test PASSed (JDK 8 and Scala 2.12).

junrao

@onurkaraman : Thanks for rebasing. A few more comments.

junrao · 2017-04-19T00:51:28Z

core/src/main/scala/kafka/controller/KafkaController.scala

-        autoRebalanceScheduler.schedule("partition-rebalance-thread", checkAndTriggerPartitionRebalance,
-          5, config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)
+        kafkaScheduler.schedule("auto-leader-rebalance-task", () => controllerEventQueue.put(AutoPreferredReplicaLeaderElection),
+          delay = 5, period = config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)


There are a couple of issues with modeling the leader balancing tasks as new events in the queue. (1) If the controller event thread is busy for some reason, more than one rebalance event could be queued up, which adds unnecessary load to the event thread. (2) Auto leader balancing is a performance optimization, but is not critical. If there are other real ZK events (e.g., broker down), we want to be able to process those more critical events before leader balancing. If an auto leader balancing event is already in the queue, it will be a bit hard to take it out when a more important event is enqueued.

An alternative approach is to just do the periodic check in the ControllerEventThread.

junrao · 2017-04-19T01:48:48Z

core/src/main/scala/kafka/controller/KafkaController.scala

+      kafkaScheduler.startup()
+      kafkaScheduler.schedule("controller-metric-task", () => controllerEventQueue.put(UpdateMetrics), period = 10, unit = TimeUnit.SECONDS)
+      if (config.deleteTopicEnable) {
+        kafkaScheduler.schedule("topic-deletion-progress-check-task", () => controllerEventQueue.put(TopicDeletionProgressCheck),


It seems that we can probably get rid of this scheduled thread since the topic deletion check only needs to be triggered on other events that we are already tracking.

junrao · 2017-04-19T01:49:03Z

core/src/main/scala/kafka/controller/KafkaController.scala

+    override def doWork(): Unit = {
+      val controllerEvent = controllerEventQueue.take()
+      try {
+        controllerEvent.process()


It's probably better to check if the controller is active here before processing each event, instead of doing that check in every event. The only exception is if the event is reelecting the controller.

I'm also a bit uncomfortable with manually doing the check in every event. It seems easy to introduce problems when adding new events or changing existing ones. If reelecting the controller is really the only case that needs special treatment, then Jun's suggestion makes sense.

Otherwise, we could have a processIfActive method in ControllerEvent and by default process would do the isActive check, but process could be overridden if necessary.

It's not quite just reelection.

ControlledShutdown, Startup, ControllerChange, and Reelect all have custom behavior, either not checking isActive at all or doing something special like providing a callback with a Failure containing a ControllerMovedException in the case of ControlledShutdown.

junrao · 2017-04-19T01:49:50Z

core/src/main/scala/kafka/controller/KafkaController.scala

+    override def process(): Unit = {
+      registerSessionExpirationListener()
+      registerControllerChangeListener()
+      isRunning = true


Now that we are moving to a single threaded model, it seems we don't really need isRunning. If the ControllerEventThread is running, it implies isRunning is true.

You're right. I just removed isRunning.

junrao · 2017-04-19T01:50:07Z

core/src/main/scala/kafka/controller/TopicDeletionManager.scala

-class TopicDeletionManager(controller: KafkaController,
-                           initialTopicsToBeDeleted: Set[String] = Set.empty,
-                           initialTopicsIneligibleForDeletion: Set[String] = Set.empty) extends Logging {
+class TopicDeletionManager(controller: KafkaController, controllerEventQueue: LinkedBlockingQueue[ControllerEvent], initialTopicsToBeDeleted: Set[String] = Set.empty, initialTopicsIneligibleForDeletion: Set[String] = Set.empty) extends Logging {


We need to adjust the comments before the class accordingly. For example, 3.3 is probably no longer valid since preferred replica election can't be mixed with topic deletion any more because of the single threaded model.

junrao · 2017-04-19T15:30:08Z

core/src/main/scala/kafka/controller/KafkaController.scala

+class TopicDeletionListener(protected val controller: KafkaController, controllerEventQueue: LinkedBlockingQueue[ControllerEvent]) extends IZkChildListener with Logging {
+  override def handleChildChange(parentPath: String, currentChilds: java.util.List[String]): Unit = {
+    import scala.collection.JavaConverters._
+    controllerEventQueue.put(controller.TopicDeletion(currentChilds.asScala.toSet))


Hmm, there is a slight change of behavior here. Earlier, if a topic deletion is initiated, it will be processed immediately. Now, the topic deletion will only be processed when the scheduler adds a topic deletion event. This means that topic deletion could be delayed by up to 5 seconds, which is a degradation.

I updated the PR to actually start doing topic deletion work in the TopicDeletion event instead of waiting for the scheduled event to make progress.

junrao · 2017-04-19T15:41:39Z

core/src/main/scala/kafka/controller/KafkaController.scala

-        info("ZK expired, but the current controller id %d is the same as this broker id, skip re-elect".format(config.brokerId))
-      }
+    def handleNewSession(): Unit = {
+      controllerEventQueue.put(Resign(getControllerID()))


Hmm, in all other listeners, we pass in controllerEventQueue. Here, we access controllerEventQueue directly. Does SessionExpirationListener need to be a local Class?

Nope. I just moved it out.

junrao · 2017-04-19T15:56:07Z

core/src/main/scala/kafka/controller/KafkaController.scala

-      info("Controller startup complete")
-    }
+    controllerEventQueue.put(Startup)
+    controllerEventQueue.put(Elect)


The issue is that once the Startup event is enqueued, additional events could be added to the queue by ZK event thread. Those new events could in theory show up before the Elect event. So, if there is sth that we really want the ControllerThread to complete at the beginning. The safest thing is probably to do that in the thread.

junrao

@onurkaraman : Thanks for the patch. A few more comments.

junrao · 2017-04-19T23:23:18Z

core/src/main/scala/kafka/controller/KafkaController.scala

@@ -658,13 +580,13 @@ class KafkaController(val config: KafkaConfig, zkUtils: ZkUtils, val brokerState
    info("Starting preferred replica leader election for partitions %s".format(partitions.mkString(",")))
    try {
      controllerContext.partitionsUndergoingPreferredReplicaElection ++= partitions
-      deleteTopicManager.markTopicIneligibleForDeletion(partitions.map(_.topic))
+      topicDeletionManager.markTopicIneligibleForDeletion(partitions.map(_.topic))


Now that we are moving to a single threaded model, it seems this can be simplified. We know the preferred leader election can't interleave with topic deletion. So, the only thing we need to do is to avoid balancing those topics pending for deletion.

junrao · 2017-04-19T23:25:23Z

core/src/main/scala/kafka/controller/KafkaController.scala

+/**
+  * This is the zookeeper listener that triggers all the state transitions for a replica
+  */
+class BrokerChangeListener(protected val controller: KafkaController, controllerEventQueue: LinkedBlockingQueue[ControllerEvent]) extends IZkChildListener with Logging {


I am a bit worried about passing along the controllerEventQueue to each of the listeners since we may lose track of who can add new events to the queue. Since we are passing in KafkaController anyway, we could probably expose a public method like addToControllerEventQueue() in Controller and avoid passing in controllerEventQueue. The listener will call addToControllerEventQueue() to enqueue events. This way, it's much easier to track who is calling addToControllerEventQueue().

junrao · 2017-04-19T23:43:11Z

core/src/main/scala/kafka/controller/KafkaController.scala

+              controllerContext.partitionsUndergoingPreferredReplicaElection.map(_.topic).contains(topic)
+            val partitionReassignmentInProgress =
+              controllerContext.partitionsBeingReassigned.keySet.map(_.topic).contains(topic)
+            if (preferredReplicaElectionInProgress || partitionReassignmentInProgress)


Again, since we know preferred leader election can't be interleaving, there is not need to check preferredReplicaElectionInProgress.

In general, we probably don't need to maintain ControllerContext.partitionsUndergoingPreferredReplicaElection. It's intended to track partitions undergoing preferred replica election. In a single threaded model, we don't really need that information in ControllerContext.

There's one place where having ControllerContext. partitionsUndergoingPreferredReplicaElection still kind of helps: onControllerFailover.

The current behavior is to first load all zookeeper state into ControllerContext in initializeControllerContext and only later trigger in-progress actions like partition reassignment, preferred replica election, and topic deletion. Without being in ControllerContext, we'd need to pass partitionsUndergoingPreferredReplicaElection back out from initializeControllerContext into maybeTriggerPreferredReplicaElection, which is a bit awkward.

I do like all of the potential simplifications we can get regarding preferred replica election as you state in a few of the comments, but I'd rather we introduce those changes in a separate patch.

Hmm, is there a particular reason that you want to do the simplification in a separate patch? Sometimes we defer cleanups to avoid the rebasing overhead. However, in this case, no one else is touching the controller code. So this shouldn't be an issue. Also, could you preserve the commit history when updating the patch? That will make it easier to review the delta changes.

Again it's to minimize risk. I'd like to keep this PR as much as possible just a mechanical refactoring of existing code and to leave other changes to a separate PR.

Ok, we can do these in a followup patch if you think it's more convenient. I just don't want us to forget about those simplification improvements.

Let's make sure we file subtasks under the umbrella JIRA for things that we decide to do in follow-up PRs. That way, we can make sure we don't forget them.

junrao · 2017-04-20T00:00:18Z

core/src/main/scala/kafka/controller/TopicDeletionManager.scala

@@ -178,7 +148,6 @@ class TopicDeletionManager(controller: KafkaController,
          .format(replicasThatFailedToDelete.mkString(","), topics))
        controller.replicaStateMachine.handleStateChanges(replicasThatFailedToDelete, ReplicaDeletionIneligible)
        markTopicIneligibleForDeletion(topics)
-        resumeTopicDeletionThread()


The comment before markTopicIneligibleForDeletion() needs to be adjusted since we don't need to call that method due to preferred leader election.

junrao · 2017-04-20T00:30:27Z

core/src/main/scala/kafka/controller/KafkaController.scala

+    // shutdown partition state machine
+    partitionStateMachine.shutdown()
+    deregisterTopicChangeListener()
+    partitionModificationsListeners.keys.foreach(deregisterPartitionModificationsListener)


Instead of maintaining partitionModificationsListeners, could we just do the deregistration based on ControllerContext.allTopics like what we did in controllerFailover()?

We need to maintain some sort of mapping from path to PartitionModificationsListener objects somewhere, or at the very least some way of associating an existing IZkDataListener instance to its corresponding path because ZkClient's unsubscribe variants take both the path and corresponding listener instance. For example:

public void unsubscribeDataChanges(String path, IZkDataListener dataListener)

I've considered defining wrapper listener classes comprised of a zkclient, raw zkclient listener, and path such that you can just call MyWrapperListener.subscribe()/unsubscribe() and it'll internally call zkClient.subscribeDataChanges(path, rawZkClientListener) but decided to leave it out of this patch to minimize the changes.

Ok, thanks for the explanation. We can leave the code as it is then.

junrao · 2017-04-20T00:49:27Z

core/src/main/scala/kafka/controller/KafkaController.scala

+    deregisterTopicDeletionListener()
+    // shutdown replica state machine
+    replicaStateMachine.shutdown()
+    deregisterBrokerChangeListener()


Could we de-register all listeners together after line 291?

junrao · 2017-04-20T01:03:51Z

core/src/main/scala/kafka/server/KafkaApis.scala

+          Errors.NONE, partitionsRemaining)
+        requestChannel.sendResponse(new Response(request, new RequestOrResponseSend(request.connectionId, controlledShutdownResponse)))
+      } else {
+        request.requestObj.handleError(controlledShutdownResult.failed.get, requestChannel, request)


request.requestObj can just be controlledShutdownRequest?

asfbot · 2017-04-20T22:10:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3073/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-20T22:22:52Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3068/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-04-20T22:29:48Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3069/
Test FAILed (JDK 8 and Scala 2.12).

onurkaraman · 2017-04-20T22:45:12Z

retest this please

asfbot · 2017-04-20T23:39:38Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3080/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-21T00:00:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3075/
Test FAILed (JDK 7 and Scala 2.10).

onurkaraman · 2017-04-26T00:36:01Z

This PR passed the system tests after rebasing:
https://jenkins.confluent.io/job/system-test-kafka-branch-builder-2/274/

junrao

@onurkaraman : Thanks for the patch. A few more comments. Some of them can be addressed in a followup patch if you prefer. Also, could you rebase?

junrao · 2017-04-26T21:30:33Z

core/src/main/scala/kafka/controller/KafkaController.scala

-    controller.sendUpdateMetadataRequest(liveBrokers, topicAndPartitions)
+  case object AutoPreferredReplicaLeaderElection extends ControllerEvent {
+    override def process(): Unit = {
+      scheduleAutoLeaderRebalanceTask(delay = config.leaderImbalanceCheckIntervalSeconds, unit = TimeUnit.SECONDS)


Hmm, if the controller is not active, we don't want to schedule the leader balancing task right? So we probably want to schedule this after checkAndTriggerPartitionRebalance() is done.

Since we aren't letting the scheduler periodically inject the event into the queue, any single manual event injection we miss means that there will be no later injection until we become re-elected as controller. This is another reason why I wanted the scheduler to do the periodic event injections.

We risk skipping the next AutoPreferredReplicaLeaderElection if we naively put it after checkAndTriggerPartitionRebalance(), as checkAndTriggerPartitionRebalance() could throw an exception.

Putting the schedule line before the checkAndTriggerPartitionRebalance() prevents this from happening.

We can alternatively wrap the call to checkAndTriggerPartitionRebalance() in a try/finally and put the schedule call in the finally block as well.

I'm open to doing the try/finally, keeping the code as is, or reverting the logic to just let the scheduler periodically inject the event.

Do this in a finally clause sounds good.

junrao · 2017-04-26T21:39:36Z

core/src/main/scala/kafka/controller/KafkaController.scala

-    this.logIdent = "[SessionExpirationListener on " + config.brokerId + "], "
+  private def checkAndTriggerPartitionRebalance(): Unit = {
+    trace("checking need to trigger partition rebalance")
+    // get all the active brokers


This comment is not accurate.

removed it.

junrao · 2017-04-26T21:58:38Z

core/src/main/scala/kafka/controller/KafkaController.scala

+     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
+     */
+    if(activeControllerId.get() != -1) {
+      debug("Broker %d has been elected as leader, so stopping the election process.".format(activeControllerId.get()))


Now that we don't have a generic ZookeeperLeaderElector. The log can be changed to "elected as the controller" to make it clear.

junrao · 2017-04-26T21:58:41Z

core/src/main/scala/kafka/controller/KafkaController.scala

+                                                      controllerContext.zkUtils.zkConnection.getZookeeper,
+                                                      controllerContext.zkUtils.isSecure)
+      zkCheckedEphemeral.create()
+      info(config.brokerId + " successfully elected as leader")


Same as the above. Probably change to "elected as the controller". Same in line 1582 and 1585.

junrao · 2017-04-26T23:08:06Z

core/src/main/scala/kafka/controller/KafkaController.scala

+      if (wasActiveBeforeChange && !isActive) {
+        onControllerResignation()
+      }
+      elect()


Hmm, this logic is changed slightly from before. Before, the code is the following. So, if this broker is the controller after reading the controller id from ZK, we actually skip elect().

if (controllerElector.getControllerID() != config.brokerId) { onControllerResignation() inLock(controllerContext.controllerLock) { controllerElector.elect } } else { // This can happen when there are multiple consecutive session expiration and handleNewSession() are called multiple // times. The first call may already register the controller path using the newest ZK session. Therefore, the // controller path will exist in subsequent calls to handleNewSession(). info("ZK expired, but the current controller id %d is the same as this broker id, skip re-elect".format(config.brokerId)) } }

junrao · 2017-04-26T23:35:29Z

core/src/main/scala/kafka/controller/KafkaController.scala

      partitionStateMachine.handleStateChanges(partitions, OnlinePartition, preferredReplicaPartitionLeaderSelector)
    } catch {
      case e: Throwable => error("Error completing preferred replica leader election for partitions %s".format(partitions.mkString(",")), e)
    } finally {
      removePartitionsFromPreferredReplicaElection(partitions, isTriggeredByAutoRebalance)
-      deleteTopicManager.resumeDeletionForTopics(partitions.map(_.topic))
+      topicDeletionManager.resumeDeletionForTopics(partitions.map(_.topic))


Since the preferred leader balancing won't be run concurrently now, it seems that we don't need to resume topic deletion after rebalance since that won't change the failure state of a replica.

junrao · 2017-04-26T23:52:30Z

core/src/main/scala/kafka/controller/KafkaController.scala

    }
  }
+
+  case class TopicDeletionStopReplicaResult(stopReplicaResponseObj: AbstractResponse, replicaId: Int) extends ControllerEvent {


TopicDeletionStopReplicaResult => TopicDeletionStopReplicaResultEvent?

None of the other events end in "Event" so to stay consistent, I'd rather keep it as is.

junrao · 2017-04-27T00:10:36Z

core/src/main/scala/kafka/controller/TopicDeletionManager.scala

- *    (though this is not strictly required since it holds the controller lock for the entire duration from start to end)
+  *   3.1 broker hosting one of the replicas for that topic goes down
+  *   3.2 partition reassignment for partitions of that topic is in progress
+  *   3.3 preferred replica election for partitions of that topic is in progress


3.3 is no longer true since a preferred replica election can't be in process when a topic deletion event is being handled. Ditto in 4.2.

junrao · 2017-04-27T00:18:15Z

core/src/main/scala/kafka/controller/TopicDeletionManager.scala

@@ -156,7 +128,7 @@ class TopicDeletionManager(controller: KafkaController,
      val topicsToResumeDeletion = topics & topicsToBeDeleted
      if(topicsToResumeDeletion.nonEmpty) {
        topicsIneligibleForDeletion --= topicsToResumeDeletion
-        resumeTopicDeletionThread()
+        resumeDeletions()


This can be done in a followup patch, but it doesn't seem that resumeDeletionForTopics() and resumeDeletionForTopics() need to be called due to preferred leader election.

… model The goal of this ticket is to improve controller maintainability by simplifying the controller's concurrency semantics. The controller code has a lot of shared state between several threads using several concurrency primitives. This makes the code hard to reason about. This ticket proposes we convert the controller to a single-threaded event queue model. We add a new controller thread which processes events held in an event queue. Note that this does not mean we get rid of all threads used by the controller. We merely delegate all work that interacts with controller local state to this single thread. With only a single thread accessing and modifying the controller local state, we no longer need to worry about concurrent access, which means we can get rid of the various concurrency primitives used throughout the controller. Performance is expected to match existing behavior since the bulk of the existing controller work today already happens sequentially in the ZkClient’s single ZkEventThread.

…llerChange. merge Resign and Elect into Reelect.

…runs

…method

…ll as directly use the controlledShutdownRequest for handling errors

…anager at the end of preferred replica leader election

…he end in a finally block

asfbot · 2017-04-27T14:19:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/3209/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-04-27T15:17:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/3219/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-04-27T15:19:47Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/3212/
Test PASSed (JDK 7 and Scala 2.10).

junrao · 2017-04-27T15:50:11Z

@onurkaraman : Thanks for the patch. LGTM. Let's address the remaining minor comments in a followup patch.

ijuma · 2017-04-27T16:00:05Z

Yay :)

guozhangwang · 2017-05-11T04:25:54Z

@onurkaraman This is a great improvement! I'm wondering if we can make a pass over the open JIRAs that are related to locking issues in controller side and mark all of them resolved in the next release :)

onurkaraman · 2017-05-11T05:50:33Z

@guozhangwang sounds good.

junrao reviewed Apr 7, 2017

View reviewed changes

onurkaraman force-pushed the KAFKA-5028 branch from d74cad8 to de223ab Compare April 9, 2017 18:45

onurkaraman mentioned this pull request Apr 10, 2017

MINOR: Make LeaderAndIsr immutable case class. #2731

Closed

ijuma reviewed Apr 11, 2017

View reviewed changes

onurkaraman force-pushed the KAFKA-5028 branch from de223ab to 3db3042 Compare April 12, 2017 05:38

onurkaraman force-pushed the KAFKA-5028 branch from 3db3042 to 2bbe99c Compare April 19, 2017 00:45

junrao reviewed Apr 19, 2017

View reviewed changes

junrao reviewed Apr 20, 2017

View reviewed changes

onurkaraman force-pushed the KAFKA-5028 branch from 2bbe99c to 49f4ab9 Compare April 20, 2017 21:14

junrao reviewed Apr 27, 2017

View reviewed changes

onurkaraman added 19 commits April 26, 2017 19:50

remove isRunning variable from KafkaController

001bfd2

move SessionExpirationListener out of KafkaController

41a53e5

move topic deletion out of the scheduler

d0824f1

do not expose controllerEventQueue

c8fa139

do not resume deletions upon TopicDeletionManager shutdown

8974d3d

merge Startup and Elect events into Startup. convert Resign to Contro…

cbd2b2b

…llerChange. merge Resign and Elect into Reelect.

AutoPreferredReplicaLeaderElection should have a fixed delay between …

b71c6d6

…runs

move UpdateMetrics to a method which is run after every event

895b1d6

fix format exception when printing out activeControllerId

3b24ca7

fix NoSuchElementException associated with moving UpdateMetrics to a …

d11b921

…method

cleanup TopicDeletionManager javadocs

6ea12b5

remove protected val from the zkclient listeners

0bceb20

handleControlledShutdownRequest should pattern match on the Try as we…

cf0882e

…ll as directly use the controlledShutdownRequest for handling errors

remove redundant scala package prefixes

68c9fc2

remove extraneous call to resume topic deletion on the TopicDeletionM…

7ef498a

…anager at the end of preferred replica leader election

remove inaccurate comment

4888e4a

cleanup log statements and javadocs

84e3822

move the chained schedule for AutoPreferredReplicaLeaderElection to t…

60d7ec6

…he end in a finally block

onurkaraman force-pushed the KAFKA-5028 branch from b1bee0e to 60d7ec6 Compare April 27, 2017 04:52

asfgit closed this in bb663d0 Apr 27, 2017

KAFKA-5028: convert kafka controller to a single-threaded event queue model #2816

KAFKA-5028: convert kafka controller to a single-threaded event queue model #2816

Conversation

onurkaraman commented Apr 6, 2017

onurkaraman commented Apr 6, 2017

asfbot commented Apr 6, 2017

asfbot commented Apr 6, 2017

asfbot commented Apr 6, 2017

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onurkaraman Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onurkaraman Apr 11, 2017 • edited Loading

Choose a reason for hiding this comment

asfbot commented Apr 9, 2017

asfbot commented Apr 9, 2017

asfbot commented Apr 9, 2017

ijuma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onurkaraman commented Apr 11, 2017

asfbot commented Apr 12, 2017

asfbot commented Apr 12, 2017

asfbot commented Apr 12, 2017

onurkaraman commented Apr 14, 2017

asfbot commented Apr 19, 2017

asfbot commented Apr 19, 2017

asfbot commented Apr 19, 2017

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onurkaraman Apr 20, 2017 •

edited

Loading

onurkaraman Apr 11, 2017 •

edited

Loading