[FLINK-5048] [Kafka Consumer] Change thread model of FlinkKafkaConsumer to better handel shutdown/interrupt situations #2789

StephanEwen · 2016-11-11T14:38:48Z

NOTE: Only the second commit is relevant, the first commit only prepares by cleaning up some code in the Flink Kafka Consumers for 0.9 and 0.10

Rational

Prior to this commit, the FlinkKafkaConsumer (0.9 / 0.10) spawns a separate thread that operates Kafka's consumer. That thread was shielded from interrupts, because the Kafka Consumer has not been handling thread interrupts well.

Since that thread was also the thread that emitted records, it would block in the network stack (backpressure) or in chained operators. The later case lead to situations where cancellations got very slow unless that thread would be interrupted (which it could not be).

Core changes

This commit changes the thread model:

A spawned consumer thread polls a batch or records from the KafkaConsumer and pushes the batch of records into a sort of blocking queue
The main thread of the task will pull the record batches from the blocking queue and emit the records.

The "batches" are the fetch batches from Kafka's consumer, there is no additional buffering or so that would impact latency.

The thread-to-thread handover of the records batches is handled by a class Handover which is a size-one blocking queue with the additional ability to gracefully wake up the consumer thread if the main thread decided to shut down. That way we need no interrupts on the KafkaConsumerThread.

This also pulls the KafkaConsumerThread out of the fetcher class for some code cleanup (scope simplifications).
The method calls that were broken between Kafka 0.9 and 0.10 are handled via a "call bridge", which leads to fewer code changes in the fetchers for each method that needs to be adapted.

Tests

This adjusts some tests, but it removes the "short retention IT Cases" for Kafka 0.9 and 0.10 consumers.
While that type of test makes sense for the 0.8 consumer, for the newer ones the tests actually test purely Kafka and no Flink code.

In addition, they are virtually impossible to run stable and fast, because they rely on an artificial slowdown in the KafkaConsumer threads. That type of unhealthy interference is exactly what this patch here prevents ;-)

…he Fetchers for Kafka 0.9/0.10

…er to better handel shutdown/interrupt situations Prior to this commit, the FlinkKafkaConsumer (0.9 / 0.10) spawns a separate thread that operates Kafka's consumer. That thread ws shielded from interrupts, because the Kafka Consumer has not been handling thread interrupts well. Since that thread was also the thread that emitted records, it would block in the network stack (backpressure) or in chained operators. The later case lead to situations where cancellations got very slow unless that thread would be interrupted (which it could not be). This commit changes the thread model: - A spawned consumer thread polls a batch or records from the KafkaConsumer and pushes the batch of records into a blocking queue (size one) - The main thread of the task will pull the record batches from the blocking queue and emit the records.

tzulitai

For now I've only skimmed through the changes, and I really like the proposed changes overall. I think the solution solves shutting down the KafkaConsumer on cancellation quite elegantly. The pre-hotfix code cleanups on the 1st commit seems good to me too.

Only some very minor comments for the first review. I'd like to look into the PR more over the weekend, especially the tests which I haven't looked at yet and some double checks on the cancellation flow. Would probably also need to check that some of the previous behaviours related to offset committing / offset state initialization isn't broken due to the re-scope to the new KafkaConsumerThread.

tzulitai · 2016-11-11T16:58:33Z

.../src/main/java/org/apache/flink/streaming/connectors/kafka/internal/KafkaConsumerThread.java

+public class KafkaConsumerThread extends Thread {
+
+	/** Logger for this consumer */
+	final Logger log;


I left this package-private, because it is accessed by the nested class for the commit callback.
If I make it private, the compiler has to inject a bridge method.

I guess making it private is correct, though, it better documents how it should be used.

tzulitai · 2016-11-11T17:01:24Z

...r-kafka-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Handover.java

+
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+public final class Handover implements Closeable {


Would be great if this class has some Javadoc too ;)

tzulitai · 2016-11-11T17:01:49Z

...r-kafka-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Handover.java

+	}
+
+	@VisibleForTesting
+	Object getLock() {


This method doesn't seem to be used, even in the tests.

tzulitai · 2016-11-11T17:22:25Z

...r-kafka-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Handover.java

+
+	private ConsumerRecords<byte[], byte[]> next;
+	private Throwable error;
+	private boolean wakeup;


Can we rename this to perhaps producerWakeup ? If it only affects the producer side of the handover, the renaming will make it less confusing.

tzulitai · 2016-11-11T17:25:26Z

...r-kafka-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Handover.java

+
+	// ------------------------------------------------------------------------
+
+	public static final class ClosedException extends IllegalStateException {


Not really sure if extending IllegalStateException is good here, because ClosedException will be rethrown to the fetcher even on a normal call to Handover#close().

I understand it's to allow the cancellation process be faster, but somehow I think a normal close() after poll() was called doesn't add up to me as an illegal state.

tzulitai · 2016-11-11T17:33:12Z

...a-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Kafka09Fetcher.java

-
-	/** Reference to the proxy, forwarding exceptions from the fetch thread to the main thread */
-	private volatile ExceptionProxy errorHandler;
+	/** The thread that runs the proper KafkaConsumer and hand the record batches to this fetcher */


nit: 'proper' confused me a bit at first. Perhaps 'actual'?

tzulitai

Great work Stephan, I've gave the changes a second pass, and they look good to me on my side.

The only major suggestions I have about would probably be how we let the new consumer thread / fetcher thread call close() on the handover, and whether or not we should be suppressing the fetcher of further throwing Handover.ClosedException (left comments inline).

Perhaps we should also wait for @rmetzger to review also, as he might notice catches I have overlooked to look out for from the original thread model design.

tzulitai · 2016-11-14T03:23:59Z

...in/java/org/apache/flink/streaming/connectors/kafka/internal/KafkaConsumerCallBridge010.java

+ * This indirection is necessary, because Kafka broke binary compatibility between 0.9 and 0.10,
+ * changing {@code assign(List)} to {@code assign(Collection)}.
+ * 
+ * Because of that, we need to two versions whose compiled code goes against different method signatures.


nit: we need "to" two versions <-- redundant "to".

tzulitai · 2016-11-14T03:25:10Z

...in/java/org/apache/flink/streaming/connectors/kafka/internal/KafkaConsumerCallBridge010.java

+public class KafkaConsumerCallBridge010 extends KafkaConsumerCallBridge {
+
+	@Override
+	public void assignPartitions(KafkaConsumer<?, ?> consumer, List<TopicPartition> topicPartitions) throws Exception {


Does the type parameters for key / value of KafkaConsumer need to be generic? Seems like we will only be using <byte[], byte[]> anyway.

I think generic is nice, because for this method, the key/value types do not matter. That way it is more future proof.

tzulitai · 2016-11-14T03:56:46Z

.../src/test/java/org/apache/flink/streaming/connectors/kafka/KafkaShortRetention010ITCase.java

-import org.junit.Test;
-
-@SuppressWarnings("serial")
-public class KafkaShortRetention010ITCase extends KafkaShortRetentionTestBase {


+1 to remove these tests for 0.9+ connectors.

I think with this removal, we can also completely remove the runFailOnAutoOffsetResetNone() from the KafkaShortRetentionTestBase.

The 0.8 connector runs runFailOnAutoOffsetResetNoneEager() instead of runFailOnAutoOffsetResetNone(). I think this is what we actually should also be doing for 0.9+ connectors, testing only the eager version, because that's a Flink-specific behaviour (just pointing this out, we can add this as a separate future task as this probably requires some some work on 0.9+).

tzulitai · 2016-11-14T04:55:33Z

...a-0.9/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/Kafka09Fetcher.java

@@ -143,133 +123,26 @@ public Kafka09Fetcher(

 	@Override
 	public void runFetchLoop() throws Exception {


We will be throwing all exceptions, even if it's a Handover.ClosedException, correct?

I wonder if it makes sense to suppress Handover.ClosedExceptions to not reach the main task thread, and only restore the interruption state that follows cancel()? So basically, we catch InterruptedException on the whole runFetchLoop() scope.

This was what the exception passing behaviour was like before. Before, when cancel() was called on the fetcher, we won't be throwing any other exceptions, only restoring the interruption state to the main task thread.

To be safe, I think the CloseExceptions should be re-thrown, as should all others.
Just for the case when we overlook something and the consumer thread could close the handover by itself or so. Any abnormal termination of the fetch loop should result in an exception - that is the safest we can do.

Ok, I agree to be safe.

Also, I just realized that "end of stream" shouldn't lead to the ClosedException, only "cancellation", "fetcher error", "consumer error", and (hopefully not) any other stuff we overlooked will. So, basically, like what you said, only abnormal terminations. In that case, let's keep it this way.

tzulitai · 2016-11-14T05:03:17Z

.../src/main/java/org/apache/flink/streaming/connectors/kafka/internal/KafkaConsumerThread.java

+		running = false;
+
+		// this wakes up the consumer if it is blocked handing over records
+		handover.close();


Can we actually call handover.wakeupProducer() here, and call handover.close() in the finally clause of the run() loop?

I don't think it really matters that much on our case, but IMO, this way the cancellation flow between the fetcher loop and the consumer thread will be clearer.

My thinking is that, only the KafkaConsumerThread actually calls close() on the handover and immediately rethrow a Handover.ClosedException to the fetcher thread on blocking handover.pollNext()s. The fetcher thread only calls shutdown() on KafkaConsumerThread, either on cancellation (in which case the pollNext() can still immediately be rethrown either a Handover.ClosedException or InterruptedException, depending on which arrives first) or normal clean exit.

I followed this partly, but kept an eager call to handover.close() just to make the consumer thread cancellation double safe.

tzulitai · 2016-11-14T06:34:24Z

...-kafka-0.9/src/test/java/org/apache/flink/streaming/connectors/kafka/Kafka09FetcherTest.java

@@ -323,8 +329,154 @@ else if (partition.topic().equals("another")) {

 		// check that there were no errors in the fetcher
 		final Throwable caughtError = error.get();
-		if (caughtError != null) {
+		if (caughtError != null && !(caughtError instanceof Handover.ClosedException)) {


Perhaps we should be suppressing the fetcher of throwing Handover.ClosedException, as it doesn't really make sense to the main thread. Please see my above comments.

tzulitai · 2016-11-14T06:46:34Z

...e/src/test/java/org/apache/flink/streaming/connectors/kafka/KafkaShortRetentionTestBase.java

+		Properties consumerProps = new Properties();
+		consumerProps.putAll(standardProps);
+		consumerProps.putAll(secureProps);
+		consumerProps.setProperty("fetch.message.max.bytes", "100");


I think we shouldn't be setting fetch.message.max.bytes here. The config key for this setting has changed across Kafka versions (for 0.9+ it's max.partition.fetch.bytes). The version-specific standardProps already set values for this config.

So, the original props that only contains standardProps and secureProps should be enough for the test to work.

leftover from getting the "short retention" tests to run with the modified source. will undo.

tzulitai · 2016-11-14T06:47:50Z

...e/src/test/java/org/apache/flink/streaming/connectors/kafka/KafkaShortRetentionTestBase.java

 	public void runAutoOffsetResetTest() throws Exception {
 		final String topic = "auto-offset-reset-test";

 		final int parallelism = 1;
 		final int elementsPerPartition = 50000;

 		Properties tprops = new Properties();
-		tprops.setProperty("retention.ms", "250");
+		tprops.setProperty("retention.ms", "100");


Is this change necessary?

tzulitai · 2016-11-14T06:49:47Z

...fka-0.9/src/test/java/org/apache/flink/streaming/connectors/kafka/internal/HandoverTest.java

+		runProducerConsumerTest(500, 2, 2);
+	}
+
+	private void runProducerConsumerTest(int numRecords, int maxProducerDelay, int maxConsumerDelay) throws Exception {


nit: Can we move this private method down to the bottom of the file? Not entirely necessary, just that I have a preference of keeping private methods after the public ones.

tzulitai · 2016-11-14T06:55:52Z

...fka-0.9/src/test/java/org/apache/flink/streaming/connectors/kafka/internal/HandoverTest.java

+	// ------------------------------------------------------------------------
+
+	@SuppressWarnings("unchecked")
+	static ConsumerRecords<byte[], byte[]> createTestRecords() {


Might as well make this private.

StephanEwen · 2016-11-16T14:11:09Z

Thanks for the review, @tzulitai

I would go ahead and merge this, addressing the comments.

StephanEwen · 2016-11-16T15:41:21Z

I would actually like to not change how/when handover.close() is called. It is called more often that necessary (probably), but since it is an idempotent operation, it does not matter.

The code is designed to lead to the quickest wakeup/termination possible in all cases:

cancellation
end of stream
error in the fetcher
error in the consumer

Also note that errors/close do not overwrite each other, which makes it fine if the other is called afterwards in addition.

Also, both the fetcher and the KafkaConsumerThread are written to encapsulate all necessary logic self contained. That means they do not rely on each other to call handover.close() in any situation - that makes the design more robust.

tzulitai · 2016-11-16T16:12:16Z

Also, both the fetcher and the KafkaConsumerThread are written to encapsulate all necessary logic self contained. That means they do not rely on each other to call handover.close() in any situation - that makes the design more robust.

I think that makes sense. My suggestions will definitely make the fetcher thread rely on the KafkaConsumerThread to do correct calls.
Agree to keep it as is :)

StephanEwen · 2016-11-16T19:51:47Z

Manually merged in a66e7ad

StephanEwen added 2 commits November 11, 2016 13:34

[hotfix] [Kafka Consumer] Clean up some code confusion and style in t…

f6cd417

…he Fetchers for Kafka 0.9/0.10

StephanEwen force-pushed the kafka_consumer branch from 9a07865 to 7cc9754 Compare November 11, 2016 14:49

tzulitai requested changes Nov 11, 2016

View reviewed changes

tzulitai requested changes Nov 14, 2016

View reviewed changes

StephanEwen closed this Nov 16, 2016

rmetzger added the component=Connectors/Kafka label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-5048] [Kafka Consumer] Change thread model of FlinkKafkaConsumer to better handel shutdown/interrupt situations #2789

[FLINK-5048] [Kafka Consumer] Change thread model of FlinkKafkaConsumer to better handel shutdown/interrupt situations #2789

StephanEwen commented Nov 11, 2016

tzulitai left a comment

tzulitai Nov 11, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 11, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 11, 2016

tzulitai Nov 11, 2016 •

edited

tzulitai Nov 11, 2016

tzulitai Nov 11, 2016

tzulitai left a comment •

edited

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 16, 2016

tzulitai Nov 14, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

StephanEwen Nov 16, 2016

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

tzulitai Nov 14, 2016

StephanEwen commented Nov 16, 2016

StephanEwen commented Nov 16, 2016

tzulitai commented Nov 16, 2016

StephanEwen commented Nov 16, 2016


		import static org.apache.flink.util.Preconditions.checkNotNull;

		public final class Handover implements Closeable {


		// ------------------------------------------------------------------------

		public static final class ClosedException extends IllegalStateException {

		@@ -143,133 +123,26 @@ public Kafka09Fetcher(

		@Override
		public void runFetchLoop() throws Exception {

[FLINK-5048] [Kafka Consumer] Change thread model of FlinkKafkaConsumer to better handel shutdown/interrupt situations #2789

[FLINK-5048] [Kafka Consumer] Change thread model of FlinkKafkaConsumer to better handel shutdown/interrupt situations #2789

Conversation

StephanEwen commented Nov 11, 2016

Rational

Core changes

Tests

tzulitai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tzulitai Nov 11, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tzulitai left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanEwen commented Nov 16, 2016

StephanEwen commented Nov 16, 2016

tzulitai commented Nov 16, 2016

StephanEwen commented Nov 16, 2016

tzulitai Nov 11, 2016 •

edited

tzulitai left a comment •

edited