[FLINK-24131][datastream/kafka] Reliably shutdown producer #17152

AHeise · 2021-09-05T21:40:38Z

What is the purpose of the change

Producer can leak if there are interruptions during cancellation while the producer tries to orderly abort transactions.

Brief change log

Harden KafkaSinkITCase and active FLIP-147
Fix committer bugs with active FLIP-147 (duplicate SinkWriter#prepareCommit)
Make closing of committer and Kafka writer/committer more reliable by not failing on first exception
3 smaller fixes in KafkaWriter/Committer.

Verifying this change

Added test with concurrent checkpoints to KafkaWriterITCase.

Most fixes cover issues exposed by existing tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2021-09-05T21:44:19Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ddc241c (Sun Sep 05 21:44:18 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-09-05T22:06:05Z

CI report:

f9ce8aa Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

XComp

Thanks @AHeise for your contribution. I added some minor comments below.

XComp · 2021-09-06T07:32:00Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

+                + '\''
+                + ", inTransaction="


Suggested change

+ '\''

+ ", inTransaction="

+ "', inTransaction="

nit

Strange - it's autogenerated.

XComp · 2021-09-06T07:34:25Z

...link-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaCommitter.java

+            } catch (ProducerFencedException e) {
+                // initTransaction has been called on this transaction before
+                LOG.error(
+                        "Transactions {} timed out or was overridden and data has been potentially lost.",


Suggested change

"Transactions {} timed out or was overridden and data has been potentially lost.",

"Transaction {} timed out or was overridden and data has been potentially lost.",

nit

XComp · 2021-09-06T07:36:30Z

...link-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaCommitter.java

+                        "Transaction {} was previously canceled and data has been lost.",
+                        committable,
+                        e);
+            } catch (Exception e) {


Just to be sure we don't miss anything here: Why don't we handle Throwable here anymore?

This was an unintended change but now I intend to keep it. I don't think we can and should handle Errors here (think of OOM).

XComp · 2021-09-06T07:49:20Z

...s/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java


        this.kafkaWriterState = new KafkaWriterState(transactionalIdPrefix);
-        this.lastCheckpointId = sinkInitContext.getRestoredCheckpointId().orElse(-1);
+        this.lastCheckpointId = sinkInitContext.getRestoredCheckpointId().orElse(0);


We change it to 0 because the checkpoint ID counting starts at 1. That knowledge was a bit hidden in StandaloneCheckpointIDCounter:34, ZooKeeperCheckpointIDCounter:86 or KubernetesCheckpointIDCounter:167. Can't we move this value in a single place like CheckpointConfig.DEFAULT_INITIAL_CHECKPOINT_ID and use it here as well like CheckpointConfig.DEFAULT_INITIAL_CHECKPOINT_ID - 1? This would make the intention of this literal 0 more obvious.

Or should we use CheckpointConfig.DEFAULT_CHECKPOINT_ID_OF_IGNORED_IN_FLIGHT_DATA instead? 🤔

It's a good idea to introduce a constant but CheckpointConfig is probably not the correct place.

I added it to CheckpointIDCounter which is non-public, but we should probably expose it some public class if we intend to commit to default = 1 (which I guess quite a few users rely on).

XComp · 2021-09-06T08:22:44Z

...s/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java

-                    if (exception != null && producerAsyncException == null) {
-                        producerAsyncException = exception;
+                    if (exception != null) {
+                        mailboxExecutor.execute(


Can you shortly explain what throwing an RuntimeException on the mailbox executor would trigger?

This throws the exception in the main task thread leading to regular failover. The callback is potentially in a different thread.

Thanks for the clarification

XComp · 2021-09-06T09:26:44Z

...link-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaCommitter.java

+                        committable,
+                        e);
            }
+            recyclable.ifPresent(Recyclable::close);


Why can we close the producer if we might want to retry committing the Committables? 🤔

Could we add a test for the closing to KafkaCommitterTest?

Why can we close the producer if we might want to retry committing the Committables? 🤔

We are not retrying them at this point. Retriable path is short-cutted with continue.

True. IMHO, the code is hard to understand here because of the little continue being hidden in the catch blocks. But I cannot come up with a quick fix. ¯_(ツ)_/¯

XComp · 2021-09-06T09:29:12Z

...-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaCommitterTest.java

+    /** Causes a permanent error by misconfiguration. */
+    @Test
+    public void testRetryCommittableOnFatalError() throws IOException {
+        final KafkaCommitter committer = new KafkaCommitter(new Properties());


Suggested change

/** Causes a permanent error by misconfiguration. */

@Test

public void testRetryCommittableOnFatalError() throws IOException {

final KafkaCommitter committer = new KafkaCommitter(new Properties());

@Test

public void testRetryCommittableOnFatalError() throws IOException {

// Causes a permanent error by misconfiguration.

final KafkaCommitter committer = new KafkaCommitter(new Properties());

nit: the comment actually describes the behaviour caused by the new Properties() parameter value. It doesn't describe the test itself which is totally fine because of the test method being descriptive enough.

XComp · 2021-09-06T09:33:56Z

...-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaCommitterTest.java

-        final KafkaCommitter committer = new KafkaCommitter(new Properties());
-        final short epoch = 0;
+    public void testRetryCommittableOnRetriableError() throws IOException {
+        final KafkaCommitter committer = new KafkaCommitter(getProperties());


Suggested change

final KafkaCommitter committer = new KafkaCommitter(getProperties());

// causes a network error by inactive broker

final KafkaCommitter committer = new KafkaCommitter(getProperties());

nit: I would remove the test method comment. The method name is descriptive enough to show that we're testing the retry here. Why the configuration returned by getProperties() causes the expected behavior is valuable, though, and should be added to the respective line.

XComp · 2021-09-06T09:43:22Z

...ink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaSinkITCase.java

+    @Parameterized.Parameter public int run;
+
+    @Parameterized.Parameters
+    public static Set<Integer> getConfigurations() {


Would it make sense to add some comment here describing the intention of run? run is never used in any of the methods. I guess, the intention is that we want to loop over the tests for some reason? getConfigurations is a quite generic term to describe the intention. Maybe, there's a better name (like getTestRunCounter()) to describe what this method returns?

You should really start reviewing by commit :p This is just a tmp commit for testing.

fair point 👍

XComp · 2021-09-06T09:57:25Z

...ink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaSinkITCase.java

+    }
+
+    @Test
+    public void testRecoveryWithExactlyOnceGuaranteeAndConcurrentCheckpoints() throws Exception {


I would have expected this test to fail without your changes considering that we ran into issues when doing the release testing in FLINK-23850 resulting in FLINK-24151. But reverting all you changes and only applying the KafkaSinkITCase diff didn't result in any failure on my local machine after 80 runs of testRecoveryWithExactlyOnceGuaranteeAndConcurrentCheckpoints

Is this expected? 🤔

Yes, I ran all tests 100 times before and it didn't fail once. You just have different timings on a completely overloaded AZP.

XComp · 2021-09-06T10:27:54Z

Running the FLINK-23850 job on this PR's codebase results in the following error with constant data being added to the Kafka input topic:

2021-09-06 13:07:45
java.lang.RuntimeException: Failed to send data to Kafka
	at org.apache.flink.connector.kafka.sink.KafkaWriter.lambda$null$0(KafkaWriter.java:131)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:789)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:741)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerId are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.

XComp · 2021-09-06T10:32:28Z

FLINK-23850 job:

        Configuration config = new Configuration();
        config.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, true);

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(config);
        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(20, 2000));
        env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        env.setParallelism(6);

        final StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        tableEnv.createTable("T1",
                TableDescriptor.forConnector("kafka")
                        .schema(Schema.newBuilder()
                                .column("pk", DataTypes.STRING().notNull())
                                .column("x", DataTypes.STRING().notNull())
                                .build())
                        .option("topic", "flink-23850-in1")
                        .option("properties.bootstrap.servers", FLINK23850Utils.BOOTSTRAP_SERVERS)
                        .option("value.format", "csv")
                        .option("scan.startup.mode", "earliest-offset")
                        .build());

        final Table resultTable =
                tableEnv.sqlQuery(
                        "SELECT "
                                + "T1.pk, "
                                + "'asd', "
                                + "'foo', "
                                + "'bar' "
                                + "FROM T1");

        tableEnv.createTable("T4",
                TableDescriptor.forConnector("kafka")
                        .schema(Schema.newBuilder()
                                .column("pk", DataTypes.STRING().notNull())
                                .column("some_calculated_value", DataTypes.STRING())
                                .column("pk1", DataTypes.STRING())
                                .column("pk2", DataTypes.STRING())
                                .build())
                        .option("topic", "flink-23850-out")
                        .option("properties.bootstrap.servers", FLINK23850Utils.BOOTSTRAP_SERVERS)
                        .option("value.format", "csv")
                        .option("sink.delivery-guarantee", "exactly-once")
                        .option("sink.transactional-id-prefix", "flink-23850")
                        .option("scan.startup.mode", "earliest-offset")
                        .build());

        resultTable.executeInsert("T4");

XComp · 2021-09-06T11:12:46Z

Here is a tar archive containing the DEBUG logs of both runs and the Flink clusters configuration.
FLINK-24131-2.tar.gz

XComp

Nothing else to add. LGTM 👍

The local test using the FLINK-23850 setup is also passing. An upgrade of the Kafka server to 2.7.1 was necessary. It appears that we ran into issues with Kafka Server 2.4.1 due to some bug on the Kafka side.

… called twice for final commit. With FLIP-147 (final checkpoint) and the respective opt-in option, a sink would invoke prepareCommit twice for the final commit.

…as possible.

…ectly even with Interruptions.

…ed correctly even with Interruptions.

…NT_ID.

This also aligns the transaction ids with checkpoint ids starting at 1.

…in KafkaCommitter.

Removed pending records as it doesn't add anything to KafkaWriter#flush (same post-condition as per JavaDoc) but introduces instabilities because of concurrency.

The metric can only be registered once and should simply take the currentProducer to calculate.

…kaSinkITCase.

rmetzger added the review=description? label Sep 5, 2021

rmetzger added the component=Connectors/Kafka label Sep 5, 2021

XComp requested changes Sep 6, 2021

View reviewed changes

XComp approved these changes Sep 6, 2021

View reviewed changes

Arvid Heise added 11 commits September 7, 2021 07:59

[FLINK-24131][tests] Harden leak check in KafkaSinkITCase.

45f8acd

[FLINK-24131][datastream] Ensure that SinkWriter#prepareCommit is not…

875bdee

… called twice for final commit. With FLIP-147 (final checkpoint) and the respective opt-in option, a sink would invoke prepareCommit twice for the final commit.

[FLINK-24131][datastream] Recommit recovered transactions as quickly …

364979e

…as possible.

[FLINK-24131][datastream] Ensure writer and committer are closed corr…

74a58c5

…ectly even with Interruptions.

[FLINK-24131][connectors/kafka] Ensure kafka writer and producer clos…

af312c2

…ed correctly even with Interruptions.

[FLINK-24131][runtime] Introduce CheckpointIDCounter#INITIAL_CHECKPOI…

6955977

…NT_ID.

[FLINK-24131][connectors/kafka] Improve debuggability of KafkaWriter.

fd75f03

This also aligns the transaction ids with checkpoint ids starting at 1.

[FLINK-24131][connectors/kafka] Improve handling of committer errors …

1810220

…in KafkaCommitter.

[FLINK-24131][connectors/kafka] Improve threading model of KafkaWriter.

0381a78

Removed pending records as it doesn't add anything to KafkaWriter#flush (same post-condition as per JavaDoc) but introduces instabilities because of concurrency.

[FLINK-24131][connectors/kafka] Fix KafkaWriter currentSendTime metric.

c43fdcf

The metric can only be registered once and should simply take the currentProducer to calculate.

[FLINK-24151][connectors/kafka] Add concurrent checkpoint test to Kaf…

f9ce8aa

…kaSinkITCase.

AHeise mentioned this pull request Sep 7, 2021

[FLINK-24131][datastream/kafka] Reliably shutdown producer [1.14] #17171

Merged

AHeise merged commit 85684e4 into apache:master Sep 7, 2021

AHeise deleted the FLINK-24131 branch September 7, 2021 10:20

	"Transactions {} timed out or was overridden and data has been potentially lost.",
	"Transaction {} timed out or was overridden and data has been potentially lost.",

	final KafkaCommitter committer = new KafkaCommitter(getProperties());
	// causes a network error by inactive broker
	final KafkaCommitter committer = new KafkaCommitter(getProperties());

[FLINK-24131][datastream/kafka] Reliably shutdown producer #17152

[FLINK-24131][datastream/kafka] Reliably shutdown producer #17152

Uh oh!

Conversation

AHeise commented Sep 5, 2021

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Sep 5, 2021

Automated Checks

Review Progress

Uh oh!

flinkbot commented Sep 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AHeise Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XComp commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp left a comment

flinkbot commented Sep 5, 2021 •

edited

Loading

AHeise Sep 6, 2021 •

edited

Loading

XComp commented Sep 6, 2021 •

edited

Loading

XComp commented Sep 6, 2021 •

edited

Loading

XComp commented Sep 6, 2021 •

edited

Loading