Skip to content

Conversation

@AHeise
Copy link
Contributor

@AHeise AHeise commented Sep 5, 2021

What is the purpose of the change

Producer can leak if there are interruptions during cancellation while the producer tries to orderly abort transactions.

Brief change log

  • Harden KafkaSinkITCase and active FLIP-147
  • Fix committer bugs with active FLIP-147 (duplicate SinkWriter#prepareCommit)
  • Make closing of committer and Kafka writer/committer more reliable by not failing on first exception
  • 3 smaller fixes in KafkaWriter/Committer.

Verifying this change

Added test with concurrent checkpoints to KafkaWriterITCase.

Most fixes cover issues exposed by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 5, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ddc241c (Sun Sep 05 21:44:18 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 5, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AHeise for your contribution. I added some minor comments below.

Comment on lines 374 to 375
+ '\''
+ ", inTransaction="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ '\''
+ ", inTransaction="
+ "', inTransaction="

nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange - it's autogenerated.

} catch (ProducerFencedException e) {
// initTransaction has been called on this transaction before
LOG.error(
"Transactions {} timed out or was overridden and data has been potentially lost.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Transactions {} timed out or was overridden and data has been potentially lost.",
"Transaction {} timed out or was overridden and data has been potentially lost.",

nit

"Transaction {} was previously canceled and data has been lost.",
committable,
e);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we don't miss anything here: Why don't we handle Throwable here anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an unintended change but now I intend to keep it. I don't think we can and should handle Errors here (think of OOM).


this.kafkaWriterState = new KafkaWriterState(transactionalIdPrefix);
this.lastCheckpointId = sinkInitContext.getRestoredCheckpointId().orElse(-1);
this.lastCheckpointId = sinkInitContext.getRestoredCheckpointId().orElse(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change it to 0 because the checkpoint ID counting starts at 1. That knowledge was a bit hidden in StandaloneCheckpointIDCounter:34, ZooKeeperCheckpointIDCounter:86 or KubernetesCheckpointIDCounter:167. Can't we move this value in a single place like CheckpointConfig.DEFAULT_INITIAL_CHECKPOINT_ID and use it here as well like CheckpointConfig.DEFAULT_INITIAL_CHECKPOINT_ID - 1? This would make the intention of this literal 0 more obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we use CheckpointConfig.DEFAULT_CHECKPOINT_ID_OF_IGNORED_IN_FLIGHT_DATA instead? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to introduce a constant but CheckpointConfig is probably not the correct place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to CheckpointIDCounter which is non-public, but we should probably expose it some public class if we intend to commit to default = 1 (which I guess quite a few users rely on).

if (exception != null && producerAsyncException == null) {
producerAsyncException = exception;
if (exception != null) {
mailboxExecutor.execute(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you shortly explain what throwing an RuntimeException on the mailbox executor would trigger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws the exception in the main task thread leading to regular failover. The callback is potentially in a different thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification

committable,
e);
}
recyclable.ifPresent(Recyclable::close);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we close the producer if we might want to retry committing the Committables? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test for the closing to KafkaCommitterTest?

Copy link
Contributor Author

@AHeise AHeise Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we close the producer if we might want to retry committing the Committables? 🤔

We are not retrying them at this point. Retriable path is short-cutted with continue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. IMHO, the code is hard to understand here because of the little continue being hidden in the catch blocks. But I cannot come up with a quick fix. ¯_(ツ)_/¯

Comment on lines 51 to 54
/** Causes a permanent error by misconfiguration. */
@Test
public void testRetryCommittableOnFatalError() throws IOException {
final KafkaCommitter committer = new KafkaCommitter(new Properties());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Causes a permanent error by misconfiguration. */
@Test
public void testRetryCommittableOnFatalError() throws IOException {
final KafkaCommitter committer = new KafkaCommitter(new Properties());
@Test
public void testRetryCommittableOnFatalError() throws IOException {
// Causes a permanent error by misconfiguration.
final KafkaCommitter committer = new KafkaCommitter(new Properties());

nit: the comment actually describes the behaviour caused by the new Properties() parameter value. It doesn't describe the test itself which is totally fine because of the test method being descriptive enough.

final KafkaCommitter committer = new KafkaCommitter(new Properties());
final short epoch = 0;
public void testRetryCommittableOnRetriableError() throws IOException {
final KafkaCommitter committer = new KafkaCommitter(getProperties());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final KafkaCommitter committer = new KafkaCommitter(getProperties());
// causes a network error by inactive broker
final KafkaCommitter committer = new KafkaCommitter(getProperties());

nit: I would remove the test method comment. The method name is descriptive enough to show that we're testing the retry here. Why the configuration returned by getProperties() causes the expected behavior is valuable, though, and should be added to the respective line.

@Parameterized.Parameter public int run;

@Parameterized.Parameters
public static Set<Integer> getConfigurations() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add some comment here describing the intention of run? run is never used in any of the methods. I guess, the intention is that we want to loop over the tests for some reason? getConfigurations is a quite generic term to describe the intention. Maybe, there's a better name (like getTestRunCounter()) to describe what this method returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should really start reviewing by commit :p This is just a tmp commit for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point 👍

}

@Test
public void testRecoveryWithExactlyOnceGuaranteeAndConcurrentCheckpoints() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this test to fail without your changes considering that we ran into issues when doing the release testing in FLINK-23850 resulting in FLINK-24151. But reverting all you changes and only applying the KafkaSinkITCase diff didn't result in any failure on my local machine after 80 runs of testRecoveryWithExactlyOnceGuaranteeAndConcurrentCheckpoints

Is this expected? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ran all tests 100 times before and it didn't fail once. You just have different timings on a completely overloaded AZP.

@XComp
Copy link
Contributor

XComp commented Sep 6, 2021

Running the FLINK-23850 job on this PR's codebase results in the following error with constant data being added to the Kafka input topic:

2021-09-06 13:07:45
java.lang.RuntimeException: Failed to send data to Kafka
	at org.apache.flink.connector.kafka.sink.KafkaWriter.lambda$null$0(KafkaWriter.java:131)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:789)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:741)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.UnknownProducerIdException: This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producerId are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.

@XComp
Copy link
Contributor

XComp commented Sep 6, 2021

FLINK-23850 job:

        Configuration config = new Configuration();
        config.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, true);

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(config);
        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(20, 2000));
        env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        env.setParallelism(6);

        final StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        tableEnv.createTable("T1",
                TableDescriptor.forConnector("kafka")
                        .schema(Schema.newBuilder()
                                .column("pk", DataTypes.STRING().notNull())
                                .column("x", DataTypes.STRING().notNull())
                                .build())
                        .option("topic", "flink-23850-in1")
                        .option("properties.bootstrap.servers", FLINK23850Utils.BOOTSTRAP_SERVERS)
                        .option("value.format", "csv")
                        .option("scan.startup.mode", "earliest-offset")
                        .build());

        final Table resultTable =
                tableEnv.sqlQuery(
                        "SELECT "
                                + "T1.pk, "
                                + "'asd', "
                                + "'foo', "
                                + "'bar' "
                                + "FROM T1");

        tableEnv.createTable("T4",
                TableDescriptor.forConnector("kafka")
                        .schema(Schema.newBuilder()
                                .column("pk", DataTypes.STRING().notNull())
                                .column("some_calculated_value", DataTypes.STRING())
                                .column("pk1", DataTypes.STRING())
                                .column("pk2", DataTypes.STRING())
                                .build())
                        .option("topic", "flink-23850-out")
                        .option("properties.bootstrap.servers", FLINK23850Utils.BOOTSTRAP_SERVERS)
                        .option("value.format", "csv")
                        .option("sink.delivery-guarantee", "exactly-once")
                        .option("sink.transactional-id-prefix", "flink-23850")
                        .option("scan.startup.mode", "earliest-offset")
                        .build());

        resultTable.executeInsert("T4");

@XComp
Copy link
Contributor

XComp commented Sep 6, 2021

Here is a tar archive containing the DEBUG logs of both runs and the Flink clusters configuration.
FLINK-24131-2.tar.gz

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else to add. LGTM 👍

The local test using the FLINK-23850 setup is also passing. An upgrade of the Kafka server to 2.7.1 was necessary. It appears that we ran into issues with Kafka Server 2.4.1 due to some bug on the Kafka side.

Arvid Heise added 11 commits September 7, 2021 07:59
… called twice for final commit.

With FLIP-147 (final checkpoint) and the respective opt-in option, a sink would invoke prepareCommit twice for the final commit.
This also aligns the transaction ids with checkpoint ids starting at 1.
Removed pending records as it doesn't add anything to KafkaWriter#flush (same post-condition as per JavaDoc) but introduces instabilities because of concurrency.
The metric can only be registered once and should simply take the currentProducer to calculate.
@AHeise AHeise merged commit 85684e4 into apache:master Sep 7, 2021
@AHeise AHeise deleted the FLINK-24131 branch September 7, 2021 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants