KAFKA-6688. The Trogdor coordinator should track task statuses #4737

cmccabe · 2018-03-20T05:17:26Z

No description provided.

cmccabe · 2018-03-20T16:18:09Z

Test failure is kafka.api.ConsumerBounceTest.testCloseDuringRebalance, not related

apovzner · 2018-03-20T17:21:47Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/NodeManager.java

@@ -192,6 +191,9 @@ public void run() {
                    // agents going down?
                    return;
                }
+                if (log.isTraceEnabled()) {


do we need this if, since we are not doing any calculations for arguments we pass to log.trace?

agentStatus#toString is pretty heavy. If you have a dozen workers going on, it will serialize information about all of them into a string.

apovzner · 2018-03-20T17:24:44Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/NodeManager.java

+                        }
+                        // Notify the TaskManager if the worker state has changed.
+                        if (!worker.state.equals(state)) {
+                            log.info("{}: WATEREMLON: worker state changed to {}", node.name(), state);


what's " WATEREMLON"?

Sorry, I forgot to take that out. Let me fix up these log messages

@cmccabe Looks like the log entry hasn't been updated.

fixed. I wasn't able to search for it because it was misspelled 😞

apovzner · 2018-03-20T18:40:50Z

tools/src/main/java/org/apache/kafka/trogdor/workload/ProduceBenchWorker.java

+        StatusData update() {
+            Histogram.Summary summary = histogram.summarize(percentiles);
+            StatusData statusData = new StatusData(summary.numSamples(), summary.average(),
+                summary.percentiles().get(0).value(),


Would be great to make it a bit more robust w.r.t. adding/removing percentiles form Histogram, i.e. But I see we already do this in other places. I can add JIRA to improve that?

Hmm, Histogram is internally synchronized, so there should be no conflict. Maybe I misunderstood the question

Oh I meant that we explicitly get first three values from percentiles

Actually, it's fine, but I think we should put a comment in StatusUpdater constructor that changing percentiles also need to make sure that it's consistent with json properties in StatusData and how we construct it. The reason is that I just noticed that we actually create these percentiles: this.percentiles = new float[] {0.50f, 0.95f, 0.99f}; but StatusData has @JsonProperty("p90LatencyMs") int p90latencyMs instead of p95.

Good catch, @apovzner . I fixed the discrepancy. I also moved the array into the StatusUpdater class and added a comment.

cmccabe · 2018-03-22T18:24:41Z

retest this please

rajinisivaram

@cmccabe Thanks for the PR. Looks good. Left a few minor comments. There is also an outstanding question from @apovzner about the histogram. Once they are addressed, I can merge this.

rajinisivaram · 2018-04-05T22:37:48Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/NodeManager.java

+                        }
+                        // Notify the TaskManager if the worker state has changed.
+                        if (!worker.state.equals(state)) {
+                            log.info("{}: WATEREMLON: worker state changed to {}", node.name(), state);


@cmccabe Looks like the log entry hasn't been updated.

rajinisivaram · 2018-04-05T22:37:59Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/NodeManager.java

+                            worker.state = state;
+                            taskManager.updateWorkerState(node.name(), worker.id, state);
+                        } else {
+                            log.info("{}: WATEREMLON: worker state was {}, is now {}", node.name(), worker.state, state);


Same as before, update log line?

rajinisivaram · 2018-04-05T22:39:07Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/TaskManager.java

+                task.error.isEmpty() ? "(none)" : task.error);
+        } else if (task.state == ManagedTaskState.RUNNING) {
+            TreeSet<String> activeWorkers = task.activeWorkers();
+            log.info("Node {} stopped.  Stopping task {} on worker(s): {}",


Missing nodeName in the log line?

apovzner · 2018-04-05T23:05:25Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/TaskManager.java

+            for (String workerName : activeWorkers) {
+                nodeManagers.get(workerName).stopWorker(task.id);
+            }
+        }


It looks like before moving this code into a separate method, we handled PENDING ManagedTaskState. Or this state is not possible when we are in this method? Maybe we should check and throw an exception in that case? Also, what if the task is in STOPPING state?

It looks like before moving this code into a separate method, we handled PENDING ManagedTaskState. Or this state is not possible when we are in this method?

Yes, PENDING is impossible here because at that point the worker hasn't been started.

Also, what if the task is in STOPPING state?

There's no additional action needed in that case.

The only transition here is that if one worker fails while a task is RUNNING, the task will transition into STOPPING and the other workers will be stopped. There is nothing to do if the task isn't RUNNING.

Actually, come to think of it, we should probably not start stopping the other tasks unless the first worker stopped with an error.

rajinisivaram · 2018-04-06T08:59:12Z

tools/src/main/java/org/apache/kafka/trogdor/coordinator/TaskManager.java

-            log.info("Node {} stopped.  Stopping task {} on worker(s): {}",
-                task.id, Utils.join(activeWorkers, ", "));
+            log.info("Node {} stopped with error {}.  Stopping task {} on worker(s): {}",
+                nodeName, task.id, Utils.join(activeWorkers, ", "));


We want task.error in the log entry?

apovzner · 2018-04-06T18:49:06Z

Looks like my changes conflicted with yours, but there is also a unit test failure (before the conflict happened):

17:02:00 org.apache.kafka.trogdor.agent.AgentTest > testWorkerCompletions FAILED
17:02:00     java.lang.AssertionError: Condition not met within timeout 15000. Timed out waiting for expected workers {"bar":{"id":"bar","workerState":{"state":"RUNNING","spec":{"class":"org.apache.kafka.trogdor.task.SampleTaskSpec","startMs":0,"durationMs":900000,"nodeToExitMs":{"node01":2},"error":"baz"},"startedMs":0,"status":"started"}},"foo":{"id":"foo","workerState":{"state":"DONE","spec":{"class":"org.apache.kafka.trogdor.task.SampleTaskSpec","startMs":0,"durationMs":900000,"nodeToExitMs":{"node01":1}},"startedMs":0,"doneMs":1,"status":"halted"}}}
17:02:00         at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:276)
17:02:00         at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:254)
17:02:00         at org.apache.kafka.trogdor.common.ExpectedTasks.waitFor(ExpectedTasks.java:176)
17:02:00         at org.apache.kafka.trogdor.agent.AgentTest.testWorkerCompletions(AgentTest.java:239)
17:02:00

cmccabe · 2018-04-06T21:44:59Z

Rebased and fixed failing test

rajinisivaram

@cmccabe Thanks for the updates, LGTM. Will merge after builds complete.

apovzner · 2018-04-07T00:42:47Z

LGTM

rajinisivaram · 2018-04-08T08:34:14Z

@cmccabe Thanks for the PR. Build failure is unrelated, merging to trunk.

…e#4737) Reviewers: Anna Povzner <anna@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>

cmccabe force-pushed the KAFKA-6688 branch from 3440071 to 1f8f12e Compare March 20, 2018 05:19

cmccabe force-pushed the KAFKA-6688 branch from 1f8f12e to 4b81790 Compare March 20, 2018 16:55

apovzner reviewed Mar 20, 2018

View reviewed changes

cmccabe force-pushed the KAFKA-6688 branch from c03ff58 to 594cf1d Compare March 26, 2018 16:29

cmccabe mentioned this pull request Apr 5, 2018

KAFKA-6696 Trogdor should support destroying tasks #4759

Merged

rajinisivaram reviewed Apr 5, 2018

View reviewed changes

apovzner reviewed Apr 5, 2018

View reviewed changes

cmccabe force-pushed the KAFKA-6688 branch 2 times, most recently from 6c8c9cc to b30161a Compare April 6, 2018 05:21

rajinisivaram reviewed Apr 6, 2018

View reviewed changes

KAFKA-6688. The Trogdor coordinator should track task statuses

774a39f

cmccabe force-pushed the KAFKA-6688 branch from b298463 to 774a39f Compare April 6, 2018 21:44

rajinisivaram approved these changes Apr 6, 2018

View reviewed changes

rajinisivaram merged commit 40183e3 into apache:trunk Apr 8, 2018

ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018

KAFKA-6688. The Trogdor coordinator should track task statuses (apach…

27c4dd7

…e#4737) Reviewers: Anna Povzner <anna@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>

cmccabe deleted the KAFKA-6688 branch May 20, 2019 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6688. The Trogdor coordinator should track task statuses #4737

KAFKA-6688. The Trogdor coordinator should track task statuses #4737

cmccabe commented Mar 20, 2018

cmccabe commented Mar 20, 2018

apovzner Mar 20, 2018

cmccabe Mar 21, 2018

apovzner Mar 20, 2018

cmccabe Mar 21, 2018 •

edited

Loading

rajinisivaram Apr 5, 2018

cmccabe Apr 6, 2018

apovzner Mar 20, 2018

cmccabe Mar 21, 2018

apovzner Mar 21, 2018

apovzner Apr 5, 2018 •

edited

Loading

cmccabe Apr 6, 2018

cmccabe commented Mar 22, 2018

rajinisivaram left a comment

rajinisivaram Apr 5, 2018

rajinisivaram Apr 5, 2018

rajinisivaram Apr 5, 2018

apovzner Apr 5, 2018

cmccabe Apr 6, 2018

rajinisivaram Apr 6, 2018

apovzner commented Apr 6, 2018

cmccabe commented Apr 6, 2018

rajinisivaram left a comment

apovzner commented Apr 7, 2018

rajinisivaram commented Apr 8, 2018

KAFKA-6688. The Trogdor coordinator should track task statuses #4737

KAFKA-6688. The Trogdor coordinator should track task statuses #4737

Conversation

cmccabe commented Mar 20, 2018

cmccabe commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmccabe Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apovzner Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmccabe commented Mar 22, 2018

rajinisivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apovzner commented Apr 6, 2018

cmccabe commented Apr 6, 2018

rajinisivaram left a comment

Choose a reason for hiding this comment

apovzner commented Apr 7, 2018

rajinisivaram commented Apr 8, 2018

cmccabe Mar 21, 2018 •

edited

Loading

apovzner Apr 5, 2018 •

edited

Loading