KAFKA-14307; Controller time-based snapshots #12761

jsancio · 2022-10-17T21:32:27Z

Implement time based snapshot for the controller. The general strategy for this feature is that the controller will use the record-batch's append time to determine if a snapshot should be generated. If the oldest record that has been committed but is not included in the latest snapshot is older than metadata.log.max.snapshot.interval.ms, the controller will trigger a snapshot immediately. This is useful in case the controller was offline for more that metadata.log.max.snapshot.interval.ms milliseconds.

If the oldest record that has been committed but is not included in the latest snapshot is NOT older than metadata.log.max.snapshot.interval.ms, the controller will schedule a maybeGenerateSnapshot deferred task.

It is possible that when the controller wants to generate a new snapshot, either because of time or number of bytes, the controller is currently generating a snapshot. In this case the SnapshotGeneratorManager was changed so that it checks and potentially triggers another snapshot when the currently in-progress snapshot finishes.

To better support this feature the following additional changes were made:

The configuration metadata.log.max.snapshot.interval.ms was added to KafkaConfig with a default value of one hour.
RaftClient was extended to return the latest snapshot id. This snapshot id is used to determine if a given record is included in a snapshot.
Improve the SnapshotReason type to support the inclusion of values in the message.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ntroller-time-snapshot

raft/src/main/java/org/apache/kafka/raft/OffsetAndEpoch.java

…ntroller-time-snapshot

core/src/main/scala/kafka/server/KafkaConfig.scala

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

hachikuji · 2022-11-18T20:01:35Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

            queue.scheduleDeferred(event.name,
                new EarliestDeadlineFunction(time.nanoseconds() + delayNs), event);
        }

+        void handleSnapshotFinished(Optional<Exception> exception) {
+            if (exception.isPresent()) {
+                log.error("Error while generating snapshot {}", generator.lastContainedLogOffset(), exception.get());


If there is a snapshot failure, will we wait for the next interval before retrying? I wonder if we need a metric or something to get marked when this happens?

If there is a snapshot failure, will we wait for the next interval before retrying?
Yeah. The snapshot counters committedBytesSinceLastSnapshot and oldestCommittedLogOnlyAppendTimestamp are updated assuming that snapshot generation cannot fail.

We could add another reason ("snapshot failure") that triggers a snapshot immediately irrespective of the counters. I am concerned that this event/task my starve other controller events. Maybe it is better to just rely on timed snapshot and NoOpRecord to trigger another snapshot in the case of failures.

If there is a snapshot failure, will we wait for the next interval before retrying?

Let me check the KIPs but I don't think we define this metric. I can add it and update one of the KIPs if we agree to it here.

How about this for the controller:
kafka.controller:type=KafkaController,name=MetadataSnapshotGenerationErrors
Incremented anytime the controller fails to generate a snapshot. Reset to zero anytime the controller restarts or a snapshot is successfully generated.

And this for the brokers:
kafka.server:type=broker-metadata-metrics,name=snapshot-generation-errors
Incremented anytime the broker fails to generate a snapshot. Reset to zero anytime the broker restarts or a snapshot is successfully generated.

@hachikuji do you mind if I implement this in a future PR? I created this issue: https://issues.apache.org/jira/browse/KAFKA-14403

Doing it separately sounds fine.

kafka.controller:type=KafkaController,name=MetadataSnapshotGenerationErrors

I am concerned that with a counter that resets at restart, we might have an ever so small likelihood of never incrementing the metric even with the log moving and no snapshots generated e.g. if we keep crashing while generating a snapshot. Ditto for the broker.

What do you think about having a kafka.controller:type=KafkaController,name=LastSnapshotTime metric.
This metric would just emit the timestamp (linux epoch) of when the last snapshot was successfully written.

Okay. Let's move this discussion to the DISCUSSION thread in Kafka dev. I'll send a message next week.

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

hachikuji · 2022-11-18T20:03:47Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

@@ -986,6 +1012,13 @@ public void handleCommit(BatchReader<ApiMessageAndVersion> reader) {
                            batch.appendTimestamp(),
                            committedBytesSinceLastSnapshot + batch.sizeInBytes()
                        );
+
+                        if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) {
+                            oldestCommittedLogOnlyAppendTimestamp = Math.min(


Why "LogOnlyAppendTimestamp" instead of "LogAppendTimestamp"?

It is the oldest timestamp in the log that is not included in a snapshot. I was thinking that oldestCommittedLogAppendTimestamp could mislead the reader if they don't read the description of that variable:

/** * Timestamp for the oldest record that was committed but not included in a snapshot. */ private long oldestCommittedLogOnlyAppendTimestamp = Long.MAX_VALUE;

nit: Maybe overkill, but how about something even more verbose like oldestNonSnapshottedLogAppendTimestamp.
The current name still had me read the description.

I was thinking of a similar name when I was implementing this feature but I felt it was too verbose. I don't have a strong opinion here so I can change it to whatever the reader prefers.

I do think oldestNonSnapshottedLogAppendTimestamp is a little clearer. Perhaps the fact that it is an append timestamp is already clear from context and we could use oldestNonSnapshottedTimestamp? I also don't feel strongly though.

I picked oldestNonSnapshottedTimestamp. Updated the PR.l

metadata/src/main/java/org/apache/kafka/metadata/util/SnapshotReason.java

niket-goel

Thanks for the changes @jsancio ! Left a few minor comments.

raft/src/main/java/org/apache/kafka/raft/RaftClient.java

core/src/main/scala/kafka/server/ControllerServer.scala

core/src/main/scala/kafka/server/KafkaConfig.scala

niket-goel · 2022-11-19T00:15:52Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+
+            // Delete every in-memory snapshot up to the committed offset. They are not needed since this
+            // snapshot generation finished.
+            snapshotRegistry.deleteSnapshotsUpTo(lastCommittedOffset);


Question: Is this OK to do even if the snapshot never succeeded? I think the answer would be yes, but just confirming.

Yes. The controller needs to keep in-memory snapshot older than the last committed offset because a snapshot may be iterating over those timelined values. Once we know that there are no pending snapshots (either because it succeeded or it failed) the controller can delete in-memory snapshots up to the committed offset.

niket-goel · 2022-11-19T00:21:22Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

@@ -986,6 +1012,13 @@ public void handleCommit(BatchReader<ApiMessageAndVersion> reader) {
                            batch.appendTimestamp(),
                            committedBytesSinceLastSnapshot + batch.sizeInBytes()
                        );
+
+                        if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) {
+                            oldestCommittedLogOnlyAppendTimestamp = Math.min(


nit: Maybe overkill, but how about something even more verbose like oldestNonSnapshottedLogAppendTimestamp.
The current name still had me read the description.

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

niket-goel · 2022-11-19T00:44:17Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

-            committedBytesSinceLastSnapshot = 0;
+            if (!snapshotReasons.isEmpty()) {
+                if (!isActiveController()) {
+                    // The active controller creates in-memory snapshot every time an uncommitted


Nit:
Would be a little easier to read if the comment was rephrased to put what is happening here:

// The standby controllers do not create in-memory snapshots every time a // batch gets appended, so we create it now.

I reorder the sentences and improved the wording. I kept both sentences because I think they are important for anyone interested in understanding when the controllers generates and deletes in-memory snapshots.

…ntroller-time-snapshot

hachikuji · 2022-11-21T17:47:46Z

core/src/main/scala/kafka/server/KafkaConfig.scala

+  val MetadataSnapshotMaxIntervalMsDoc = "This is the maximum number of milliseconds to wait to generate a snapshot " +
+    "if there are committed records in the log that are not included in the latest snapshot. A value of zero disables " +
+    s"time based snapshot generation. The default value is ${Defaults.MetadataSnapshotMaxIntervalMs}. To geneate " +
+    s"snapshots based on the number of metadata bytes, see the <code>$MetadataSnapshotMaxNewRecordBytesProp</code> " +


Instead of just referring to the other configuration, I was thinking we could mention that snapshots will be taken when either the interval is reached or the max bytes limit is reached.

Done. Added a sentence to both descriptions explaining this.

core/src/main/scala/kafka/server/KafkaConfig.scala

hachikuji · 2022-11-21T18:05:03Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

@@ -986,6 +1012,13 @@ public void handleCommit(BatchReader<ApiMessageAndVersion> reader) {
                            batch.appendTimestamp(),
                            committedBytesSinceLastSnapshot + batch.sizeInBytes()
                        );
+
+                        if (offset >= raftClient.latestSnapshotId().map(OffsetAndEpoch::offset).orElse(0L)) {
+                            oldestCommittedLogOnlyAppendTimestamp = Math.min(


I do think oldestNonSnapshottedLogAppendTimestamp is a little clearer. Perhaps the fact that it is an append timestamp is already clear from context and we could use oldestNonSnapshottedTimestamp? I also don't feel strongly though.

hachikuji · 2022-11-21T18:07:46Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+                cancelNextGenerateSnapshot();
+            }
+        } else {
+            /* Skip snapshot generation if there is a snaphshot in progress.


nit: I think this would be a little clearer if the if check is inverted:

if (snapshotGeneratorManager.snapshotInProgress()) { /* Skip snapshot generation if there is a snaphshot in progress. ... } else {

hachikuji · 2022-11-21T18:12:07Z

metadata/src/main/java/org/apache/kafka/controller/QuorumController.java

+
+            // The snapshot counters for size-based and time-based snapshots could have changed to cause a new
+            // snapshot to get generated.
+            maybeGenerateSnapshot();


In case there was a failure, does it make sense to back off before retrying?

The quorum controller resets the size-based (committedBytesSinceLastSnapshot) and time-based (oldestNonSnapshottedTimestamp) variables when it starts a snapshot. If the snapshot fails these variables were still reset when snapshot generation started.

This acts as a throttle. Snapshot generation will be trigger as often as what is described in metadata.log.max.record.bytes.between.snapshots and metadata.log.max.snapshot.interval.ms even if the snapshot happens to fail.

hachikuji

Thanks, LGTM

Implement time based snapshot for the controller. The general strategy for this feature is that the controller will use the record-batch's append time to determine if a snapshot should be generated. If the oldest record that has been committed but is not included in the latest snapshot is older than `metadata.log.max.snapshot.interval.ms`, the controller will trigger a snapshot immediately. This is useful in case the controller was offline for more that `metadata.log.max.snapshot.interval.ms` milliseconds. If the oldest record that has been committed but is not included in the latest snapshot is NOT older than `metadata.log.max.snapshot.interval.ms`, the controller will schedule a `maybeGenerateSnapshot` deferred task. It is possible that when the controller wants to generate a new snapshot, either because of time or number of bytes, the controller is currently generating a snapshot. In this case the `SnapshotGeneratorManager` was changed so that it checks and potentially triggers another snapshot when the currently in-progress snapshot finishes. To better support this feature the following additional changes were made: 1. The configuration `metadata.log.max.snapshot.interval.ms` was added to `KafkaConfig` with a default value of one hour. 2. `RaftClient` was extended to return the latest snapshot id. This snapshot id is used to determine if a given record is included in a snapshot. 3. Improve the `SnapshotReason` type to support the inclusion of values in the message. Reviewers: Jason Gustafson <jason@confluent.io>, Niket Goel <niket-goel@users.noreply.github.com>

jsancio added 2 commits October 17, 2022 14:26

KAFKA-14307; Controller time-based snapshots

1208249

KAFKA-14307; Support variable snapshot reasons

74607ca

jsancio force-pushed the kafka-14307-controller-time-snapshot branch from c182211 to 74607ca Compare October 18, 2022 18:29

KAFKA-14307; Add tests for SnapshotReason

53c854f

jsancio force-pushed the kafka-14307-controller-time-snapshot branch from 0c22f88 to 53c854f Compare October 18, 2022 19:07

jsancio added 3 commits October 18, 2022 15:59

Merge remote-tracking branch 'apache-kafka/trunk' into kafka-14307-co…

464e7a3

…ntroller-time-snapshot

KAFKA-14307; Add missing documentation

de7e8ec

KAFKA-14307; Add test for controller and raft client

d9e27c4

jsancio marked this pull request as ready for review October 19, 2022 00:37

hachikuji reviewed Oct 19, 2022

View reviewed changes

raft/src/main/java/org/apache/kafka/raft/OffsetAndEpoch.java Outdated Show resolved Hide resolved

Merge remote-tracking branch 'apache-kafka/trunk' into kafka-14307-co…

32af1e7

…ntroller-time-snapshot

dengziming self-assigned this Nov 4, 2022

hachikuji reviewed Nov 18, 2022

View reviewed changes

jsancio added 2 commits November 18, 2022 15:33

KAFKA-14307; Improve snapshot config documentation

2a1a21a

KAFKA-14307; Fix Scala compile error

057d09c

niket-goel reviewed Nov 19, 2022

View reviewed changes

jsancio added 3 commits November 18, 2022 19:15

KAFKA-14307; Add missing log message param

e510304

KAFKA-14307; Remove extra spaces at end of line

2221264

Merge remote-tracking branch 'apache-kafka/trunk' into kafka-14307-co…

495d876

…ntroller-time-snapshot

hachikuji reviewed Nov 21, 2022

View reviewed changes

jsancio added 2 commits November 21, 2022 12:04

KAFKA-14307; Update configuration description

dbc3aeb

KAFKA-14307; Fix Scala compile error

81d4724

hachikuji approved these changes Nov 21, 2022

View reviewed changes

jsancio merged commit 72b535a into apache:trunk Nov 22, 2022

jsancio deleted the kafka-14307-controller-time-snapshot branch November 22, 2022 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-14307; Controller time-based snapshots #12761

KAFKA-14307; Controller time-based snapshots #12761

jsancio commented Oct 17, 2022 •

edited

hachikuji Nov 18, 2022

jsancio Nov 18, 2022

jsancio Nov 18, 2022

hachikuji Nov 19, 2022

niket-goel Nov 19, 2022 •

edited

jsancio Nov 19, 2022

hachikuji Nov 18, 2022

jsancio Nov 18, 2022 •

edited

niket-goel Nov 19, 2022

jsancio Nov 19, 2022

hachikuji Nov 21, 2022

jsancio Nov 21, 2022

niket-goel left a comment

niket-goel Nov 19, 2022

jsancio Nov 19, 2022

niket-goel Nov 19, 2022

niket-goel Nov 19, 2022

jsancio Nov 19, 2022

hachikuji Nov 21, 2022

jsancio Nov 21, 2022

hachikuji Nov 21, 2022

hachikuji Nov 21, 2022

jsancio Nov 21, 2022

hachikuji Nov 21, 2022

jsancio Nov 21, 2022 •

edited

hachikuji left a comment

KAFKA-14307; Controller time-based snapshots #12761

KAFKA-14307; Controller time-based snapshots #12761

Conversation

jsancio commented Oct 17, 2022 • edited

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niket-goel Nov 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio Nov 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niket-goel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio Nov 21, 2022 • edited

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

jsancio commented Oct 17, 2022 •

edited

niket-goel Nov 19, 2022 •

edited

jsancio Nov 18, 2022 •

edited

jsancio Nov 21, 2022 •

edited