Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-12474: Handle failure to write new session keys gracefully #10396

Merged
merged 2 commits into from
Apr 1, 2021

Conversation

C0urante
Copy link
Contributor

Jira

If a distributed worker fails to write (or read back) a new session key to/from the config topic, it dies.

This fix softens the blow a bit by instead restarting the herder tick loop anew and forcing a read to the end of the config topic until the worker is able to successfully read to the end.

At this point, if the worker was able to successfully write a new session key in its first attempt, it will have read that key back from the config topic and will not write a new key during the next tick iteration. If it was not able to write that key at all, it will try again to write a new key (if it is still the leader).

Verified with new unit tests for both cases (failure to write, failure to read back after write).

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@C0urante
Copy link
Contributor Author

@gharris1727 @ncliang either of you care to take a look?

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @C0urante!

A nice targeted change, an informative test, and another failure mode accounted for.

LGTM!

Copy link
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @C0urante. I like the simplicity of this fix, and the two new unit tests. One minor question below, but otherwise looks good.

now
));
} catch (Exception e) {
log.warn("Failed to write new session key to config topic; forcing a read to the end of the config topic before possibly retrying");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worthy of a warning message rather than an info-level message, especially if we think the herder can automatically recover from typical causes of this (e.g., transient network issues, transient broker issues, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; considering the other WARN- and ERROR-level messages that get emitted with this exact code path, it should be fine to downgrade this to INFO.

@rhauch
Copy link
Contributor

rhauch commented Apr 1, 2021

Not sure why the build results are not showing up, but the build passed on JDK 8 and ARM, and failed unrelated tests on JDK 15. https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/job/PR-10396/

Copy link
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @C0urante. LGTM.

@rhauch rhauch merged commit aea059a into apache:trunk Apr 1, 2021
@C0urante C0urante deleted the kafka-12474 branch April 1, 2021 17:40
rhauch pushed a commit that referenced this pull request Apr 1, 2021
)

If a distributed worker fails to write (or read back) a new session key to/from the config topic, it dies. This fix softens the blow a bit by instead restarting the herder tick loop anew and forcing a read to the end of the config topic until the worker is able to successfully read to the end.

At this point, if the worker was able to successfully write a new session key in its first attempt, it will have read that key back from the config topic and will not write a new key during the next tick iteration. If it was not able to write that key at all, it will try again to write a new key (if it is still the leader).

Verified with new unit tests for both cases (failure to write, failure to read back after write).

Author: Chris Egerton <chrise@confluent.io>
Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>
rhauch pushed a commit that referenced this pull request Apr 1, 2021
)

If a distributed worker fails to write (or read back) a new session key to/from the config topic, it dies. This fix softens the blow a bit by instead restarting the herder tick loop anew and forcing a read to the end of the config topic until the worker is able to successfully read to the end.

At this point, if the worker was able to successfully write a new session key in its first attempt, it will have read that key back from the config topic and will not write a new key during the next tick iteration. If it was not able to write that key at all, it will try again to write a new key (if it is still the leader).

Verified with new unit tests for both cases (failure to write, failure to read back after write).

Author: Chris Egerton <chrise@confluent.io>
Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>
rhauch pushed a commit that referenced this pull request Apr 1, 2021
)

If a distributed worker fails to write (or read back) a new session key to/from the config topic, it dies. This fix softens the blow a bit by instead restarting the herder tick loop anew and forcing a read to the end of the config topic until the worker is able to successfully read to the end.

At this point, if the worker was able to successfully write a new session key in its first attempt, it will have read that key back from the config topic and will not write a new key during the next tick iteration. If it was not able to write that key at all, it will try again to write a new key (if it is still the leader).

Verified with new unit tests for both cases (failure to write, failure to read back after write).

Author: Chris Egerton <chrise@confluent.io>
Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>
rhauch pushed a commit that referenced this pull request Apr 1, 2021
)

If a distributed worker fails to write (or read back) a new session key to/from the config topic, it dies. This fix softens the blow a bit by instead restarting the herder tick loop anew and forcing a read to the end of the config topic until the worker is able to successfully read to the end.

At this point, if the worker was able to successfully write a new session key in its first attempt, it will have read that key back from the config topic and will not write a new key during the next tick iteration. If it was not able to write that key at all, it will try again to write a new key (if it is still the leader).

Verified with new unit tests for both cases (failure to write, failure to read back after write).

Author: Chris Egerton <chrise@confluent.io>
Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>
ijuma added a commit to ijuma/kafka that referenced this pull request Apr 4, 2021
…e-allocations-lz4

* apache-github/trunk: (243 commits)
  KAFKA-12590: Remove deprecated kafka.security.auth.Authorizer, SimpleAclAuthorizer and related classes in 3.0 (apache#10450)
  KAFKA-3968: fsync the parent directory of a segment file when the file is created (apache#10405)
  KAFKA-12283: disable flaky testMultipleWorkersRejoining to stabilize build (apache#10408)
  MINOR: remove KTable.to from the docs (apache#10464)
  MONOR: Remove redudant LocalLogManager (apache#10325)
  MINOR: support ImplicitLinkedHashCollection#sort (apache#10456)
  KAFKA-12587 Remove KafkaPrincipal#fromString for 3.0 (apache#10447)
  KAFKA-12426: Missing logic to create partition.metadata files in RaftReplicaManager (apache#10282)
  MINOR: Improve reproducability of raft simulation tests (apache#10422)
  KAFKA-12474: Handle failure to write new session keys gracefully (apache#10396)
  KAFKA-12593: Fix Apache License headers (apache#10452)
  MINOR: Fix typo in MirrorMaker v2 documentation (apache#10433)
  KAFKA-12600: Remove deprecated config value `default` for client config `client.dns.lookup` (apache#10458)
  KAFKA-12952: Remove deprecated LogConfig.Compact (apache#10451)
  Initial commit (apache#10454)
  KAFKA-12575: Eliminate Log.isLogDirOffline boolean attribute (apache#10430)
  KAFKA-8405; Remove deprecated `kafka-preferred-replica-election` command (apache#10443)
  MINOR: Fix docs for end-to-end record latency metrics (apache#10449)
  MINOR Replaced File with Path in LogSegmentData. (apache#10424)
  KAFKA-12583: Upgrade netty to 4.1.62.Final
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants