[FLINK-1489] Fixes blocking scheduleOrUpdateConsumers message calls #378

tillrohrmann · 2015-02-09T16:37:27Z

Replaces the blocking calls with futures which in case of an exception let the respective task fail. Furthermore, the PartitionInfos are buffered on the JobManager in case that some of the consumers are not yet scheduled. Once the state of the consumers switched to running, all buffered partition infos are sent to the consumers.

uce · 2015-02-10T12:46:51Z

Very nice. I will have a detailed look later.

@zentol Can you also test it with the Python API? I think you initially noticed the problem.

uce · 2015-02-10T14:31:12Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

+
+				// double check to resolve race conditions
+				if(consumerVertex.getExecutionState() == RUNNING){
+					consumerVertex.sendPartitionInfos();


Just to verify: the double check & send relies on the fact that update messages at the task manager are idempotent, right?

The UpdateTask messages are idempotent in the BufferReader. But my intention was not to send any UpdateTask messages twice. The ConcurrentLinkedQueue should make sure that every element is only dequeued once.

Yeah, true. :-)

uce · 2015-02-10T14:35:03Z

Looks good to me. +1

We chatted about batching update task calls. Did you realize a problem with it or can we open an "improvement" issue for it?

tillrohrmann · 2015-02-11T09:10:18Z

You're right. At the moment there is no aggregation of messages. I'll add it.

uce · 2015-02-11T09:40:31Z

There is a problem: https://travis-ci.org/apache/flink/jobs/50215407

java.lang.IllegalStateException: Consumer state is FINISHED but was expected to be RUNNING.
    at org.apache.flink.runtime.deployment.PartialPartitionInfo.createPartitionInfo(PartialPartitionInfo.java:81)
    at org.apache.flink.runtime.executiongraph.Execution.sendPartitionInfos(Execution.java:581)
    at org.apache.flink.runtime.executiongraph.Execution.switchToRunning(Execution.java:654)
    at org.apache.flink.runtime.executiongraph.Execution.access$100(Execution.java:88)
    at org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:336)
    at akka.dispatch.OnComplete.internal(Future.scala:247)
    at akka.dispatch.OnComplete.internal(Future.scala:244)
    at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
    at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

…s with asynchronous futures. Buffers PartitionInfos at the JobManager in case that the respective consumer has not been scheduled. Conflicts: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala Adds TaskUpdate message aggregation before sending the messages to the TaskManagers

tillrohrmann · 2015-02-11T16:13:20Z

I added the UpdateTask message aggregation. I also had to rework the PartitionInfo creation to make it work with the concurrent task updates. This requires another review of the code before we can merge it.

rmetzger · 2015-02-11T16:20:40Z

Cool. I'm testing this PR on a cluster now.

rmetzger · 2015-02-11T16:25:43Z

The job that was previously failing is fixed with this change.

We should merge this change ASAP, because its kinda impossible right now to seriously use flink 0.9-SNAPSHOT without it.

…s with asynchronous futures. Buffers PartitionInfos at the JobManager in case that the respective consumer has not been scheduled. Conflicts: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala Adds TaskUpdate message aggregation before sending the messages to the TaskManagers This closes apache#378

uce reviewed Feb 10, 2015
View reviewed changes

tillrohrmann force-pushed the fixScheduleOrUpdateConsumers branch from dd6208b to bf94b4f Compare February 11, 2015 16:10

tillrohrmann force-pushed the fixScheduleOrUpdateConsumers branch from bf94b4f to 1827d02 Compare February 11, 2015 16:11

asfgit closed this in aedbacf Feb 11, 2015

tillrohrmann deleted the fixScheduleOrUpdateConsumers branch September 16, 2015 13:06

rmetzger added the component=<none> label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1489] Fixes blocking scheduleOrUpdateConsumers message calls #378

[FLINK-1489] Fixes blocking scheduleOrUpdateConsumers message calls #378

tillrohrmann commented Feb 9, 2015

uce commented Feb 10, 2015

uce Feb 10, 2015

tillrohrmann Feb 10, 2015

uce Feb 10, 2015

uce commented Feb 10, 2015

tillrohrmann commented Feb 11, 2015

uce commented Feb 11, 2015

tillrohrmann commented Feb 11, 2015

rmetzger commented Feb 11, 2015

rmetzger commented Feb 11, 2015

[FLINK-1489] Fixes blocking scheduleOrUpdateConsumers message calls #378

[FLINK-1489] Fixes blocking scheduleOrUpdateConsumers message calls #378

Conversation

tillrohrmann commented Feb 9, 2015

uce commented Feb 10, 2015

uce Feb 10, 2015

Choose a reason for hiding this comment

tillrohrmann Feb 10, 2015

Choose a reason for hiding this comment

uce Feb 10, 2015

Choose a reason for hiding this comment

uce commented Feb 10, 2015

tillrohrmann commented Feb 11, 2015

uce commented Feb 11, 2015

tillrohrmann commented Feb 11, 2015

rmetzger commented Feb 11, 2015

rmetzger commented Feb 11, 2015