KAFKA-6530: Use actual first offset of message set when rolling log segment #4660

dhruvilshah3 · 2018-03-07T23:55:35Z

Use the exact first offset of message set when rolling log segment. This is possible to do for message format V2 and beyond without any performance penalty, because we have the first offset stored in the header. This augments the fix made in KAFKA-4451 to avoid using the heuristic for V2 and beyond messages.

Added unit tests to simulate cases where segment needs to roll because of overflow in index offsets. Verified that the new segment created in these cases uses the first offset, instead of the heuristic in use previously.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

hachikuji

Thanks for the patch! Left a couple comments.

hachikuji · 2018-03-08T06:07:41Z

core/src/main/scala/kafka/log/Log.scala

@@ -83,7 +84,8 @@ case class LogAppendInfo(var firstOffset: Long,
                         targetCodec: CompressionCodec,
                         shallowCount: Int,
                         validBytes: Int,
-                         offsetsMonotonic: Boolean)
+                         offsetsMonotonic: Boolean,
+                         hasAccurateFirstOffset: Boolean)


I think this is reasonable, but it feels a bit odd to have one parameter indicating whether or not we can trust another parameter, right? The fact that we abuse firstOffset in the first place is a big source of confusion in the code, so I'm wondering if it would be better to replace it with an Option to clearly express the fact that we may or may not have it. Then in places where we need to use an offset, we can write firstOffset.getOrElse(lastOffset), which is more explicit and less likely to cause confusion.

It may even be useful to have separate objects, LeaderLogAppendInfo which is guaranteed to have the first offset, and ReplicaLogAppendInfo, which may or may not have it.

hachikuji · 2018-03-08T06:15:38Z

core/src/main/scala/kafka/log/Log.scala

+        if (batch.magic >= RecordBatch.MAGIC_VALUE_V2) {
+          firstOffset = batch.baseOffset
+          hasAccurateFirstOffset = true
+        } else {


If the magic is v1 and below, then whether or not we have an accurate first offset depends on whether this is a leader append or a replica append. For a leader append, we have the first offset because we are the one who assigns it. This logic happens in Log.append after we have validated the data.

hachikuji

Thanks for the updates. Left a few more comments.

hachikuji · 2018-03-12T20:31:45Z

core/src/main/scala/kafka/log/Log.scala

+                         offsetsMonotonic: Boolean) {
+  def firstOffset_= (firstOffset: Long) {_firstOffset = Some(firstOffset)}
+  def firstOffset: Long = _firstOffset.get
+  def hasAccurateFirstOffset: Boolean = _firstOffset.isDefined


nit: could this just be hasFirstOffset? Or perhaps we could just expose the option as maybeFirstOffset or something like that.

Changed to hasFirstOffset

hachikuji · 2018-03-12T20:41:07Z

core/src/main/scala/kafka/log/Log.scala

    var lastOffset = -1L
    var sourceCodec: CompressionCodec = NoCompressionCodec
    var monotonic = true
    var maxTimestamp = RecordBatch.NO_TIMESTAMP
    var offsetOfMaxTimestamp = -1L
+    var hasAccurateFirstOffset = false


I guess we don't need this anymore?

hachikuji · 2018-03-12T20:41:44Z

core/src/main/scala/kafka/log/Log.scala

-      if (firstOffset < 0)
-        firstOffset = if (batch.magic >= RecordBatch.MAGIC_VALUE_V2) batch.baseOffset else batch.lastOffset
+      // Also indicate whether we have the accurate first offset or not
+      if (firstOffset == (Some(-1L)))


I think you can do firstOffset.contains(-1L)

hachikuji · 2018-03-12T20:50:54Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -748,7 +748,7 @@ class ReplicaManager(val config: KafkaConfig,
          }

          val numAppendedMessages =
-            if (info.firstOffset == -1L || info.lastOffset == -1L)
+            if (!info.hasAccurateFirstOffset || info.firstOffset == -1L || info.lastOffset == -1L)


Might be a little better encapsulation if we move this logic into a method in LogAppendInfo.

hachikuji · 2018-03-12T20:54:25Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -760,7 +760,7 @@ class ReplicaManager(val config: KafkaConfig,
          brokerTopicStats.allTopicsStats.messagesInRate.mark(numAppendedMessages)

          trace("%d bytes written to log %s-%d beginning at offset %d and ending at offset %d"
-            .format(records.sizeInBytes, topicPartition.topic, topicPartition.partition, info.firstOffset, info.lastOffset))
+            .format(records.sizeInBytes, topicPartition.topic, topicPartition.partition, info.firstOrLastOffset, info.lastOffset))


This is the leader path, so I think we should have a first offset. Are we trying to guard the case where there were no messages appended?

In case there are no messages appended, I think firstOffset would be -1. Changed this to firstOffset instead of firstOrLast.

hachikuji · 2018-03-12T21:07:40Z

core/src/main/scala/kafka/log/Log.scala

          segmentBaseOffset = segment.baseOffset,
          relativePositionInSegment = segment.size)

-        segment.append(firstOffset = appendInfo.firstOffset,
+        segment.append(firstOffset = appendInfo.firstOrLastOffset,


As far as I can tell, the first offset here is used only for insertion into the offset index and for logging. I cannot think of a good reason why we should prefer to use the first offset over the last offset. Since the replicas have to use the last offset for the old message format anyway, I wonder if we should try to be consistent. It would also make the logging less confusing.

hachikuji · 2018-03-12T21:11:41Z

core/src/main/scala/kafka/log/Log.scala

@@ -762,7 +766,7 @@ class Log(@volatile var dir: File,
        updateFirstUnstableOffset()

        trace("Appended message set to log %s with first offset: %d, next offset: %d, and messages: %s"


Maybe we should just use the accurate last offset instead of a potentially inaccurate first offset? Or perhaps we could print both of them, but use the Option for the first offset?

Also nit: can we change this log message to use string interpolation?

hachikuji · 2018-03-12T21:12:44Z

core/src/main/scala/kafka/log/Log.scala

@@ -859,12 +863,13 @@ class Log(@volatile var dir: File,
  private def analyzeAndValidateRecords(records: MemoryRecords, isFromClient: Boolean): LogAppendInfo = {
    var shallowMessageCount = 0
    var validBytesCount = 0
-    var firstOffset = -1L
+    var firstOffset: Option[Long] = Some(-1L)


Hmm... I would have expected we'd use None

This is really a placeholder for "firstOffset has not been initialized yet, so go initialize when you see the first message".

I see. Would it be a little more natural to let firstOffset be a Long here, and convert it to Option when we construct LogAppendInfo. I guess I'm a little concerned about Some(-1) somehow leaking into the rest of the code.

hachikuji · 2018-03-12T21:17:02Z

core/src/test/scala/unit/kafka/log/LogCleanerTest.scala

@@ -1086,7 +1086,7 @@ class LogCleanerTest extends JUnitSuite {
    val end = 2
    val offsetSeq = Seq(0L, 7206178L)
    writeToLog(log, (start until end) zip (start until end), offsetSeq)
-    cleaner.buildOffsetMap(log, start, end, map, new CleanerStats())
+    cleaner.buildOffsetMap(log, start, 7206178L + 1L, map, new CleanerStats())


Perhaps we can choose better names for start and end to avoid the confusion. Maybe keyStart and keyEnd or something like that?

dhruvilshah3 · 2018-03-14T21:51:49Z

@hachikuji I think I addressed all review comments. Please take a look when you get a chance.

hachikuji

Thanks for the updates. This is looking good, just a few more small comments.

hachikuji · 2018-03-15T01:41:22Z

core/src/main/scala/kafka/log/Log.scala

-        trace("Appended message set to log %s with first offset: %d, next offset: %d, and messages: %s"
-          .format(this.name, appendInfo.firstOffset, nextOffsetMetadata.messageOffset, validRecords))
+        trace(s"Appended message set to log ${this.name} with last offset: ${appendInfo.lastOffset}, " +
+              s"first offset: ${if (appendInfo.hasFirstOffset) Some(appendInfo.firstOffset) else None}, " +


Seems we could just use firstOffset? Then we wouldn't need hasFirstOffset any longer.

hachikuji · 2018-03-15T01:45:15Z

core/src/main/scala/kafka/log/LogSegment.scala

             largestTimestamp: Long,
             shallowOffsetOfMaxTimestamp: Long,
             records: MemoryRecords): Unit = {
    if (records.sizeInBytes > 0) {
-      trace("Inserting %d bytes at offset %d at position %d with largest timestamp %d at shallow offset %d"
-          .format(records.sizeInBytes, firstOffset, log.sizeInBytes(), largestTimestamp, shallowOffsetOfMaxTimestamp))
+      trace(s"Inserting ${records.sizeInBytes} bytes at end_offset $largestOffset at position ${log.sizeInBytes} " +


nit: do we need the underscore in _? Maybe a space would work? Same below.

hachikuji · 2018-03-15T01:51:39Z

core/src/main/scala/kafka/log/Log.scala

@@ -47,19 +47,19 @@ import java.lang.{Long => JLong}
 import java.util.regex.Pattern

 object LogAppendInfo {
-  val UnknownLogAppendInfo = LogAppendInfo(-1, -1, RecordBatch.NO_TIMESTAMP, -1L, RecordBatch.NO_TIMESTAMP, -1L,
+  val UnknownLogAppendInfo = LogAppendInfo(Some(-1L), -1, RecordBatch.NO_TIMESTAMP, -1L, RecordBatch.NO_TIMESTAMP, -1L,


Could we use None instead so that we can always guarantee that the first offset, if present, is positive? As far as I can tell, outside of test cases, we just have a couple calls to firstOffset.get inside ReplicaManager that we would need to replace with getOrElse(-1).

I would also like to remove this UnknownLogAppendInfo since this sentinel pattern has proven error-prone, but we can leave that for future work.

hachikuji · 2018-03-15T17:48:21Z

core/src/test/scala/other/kafka/StressTestLog.scala

@@ -92,7 +92,7 @@ object StressTestLog {
    @volatile var offset = 0
    override def work() {
      val logAppendInfo = log.appendAsFollower(TestUtils.singletonRecords(offset.toString.getBytes))
-      require(logAppendInfo.firstOffset == offset && logAppendInfo.lastOffset == offset)
+      require(logAppendInfo.firstOrLastOffset == offset && logAppendInfo.lastOffset == offset)


Interesting that this only worked because we were writing single-message batches. I think the expectation might be a little clearer if instead of logAppendInfo.firstOrLastOffset == offset, we wrote logAppendInfo.firstOffset.forall(_ == offset). Then the check is valid even if we used batches with multiple messages.

hachikuji · 2018-03-15T18:03:28Z

core/src/main/scala/kafka/log/Log.scala

-          maxTimestampInMessages = appendInfo.maxTimestamp,
-          maxOffsetInMessages = appendInfo.lastOffset)
+        val segment = maybeRoll(validRecords.sizeInBytes,
+          appendInfo)


nit: enough room on the previous line for this?

hachikuji · 2018-03-15T18:07:24Z

core/src/main/scala/kafka/server/ReplicaManager.scala

@@ -474,7 +474,7 @@ class ReplicaManager(val config: KafkaConfig,
        topicPartition ->
                ProducePartitionStatus(
                  result.info.lastOffset + 1, // required offset
-                  new PartitionResponse(result.error, result.info.firstOffset, result.info.logAppendTime, result.info.logStartOffset)) // response status
+                  new PartitionResponse(result.error, result.info.firstOffset.get, result.info.logAppendTime, result.info.logStartOffset)) // response status


Don't we need to replace this get with getOrElse(-1) since we could be getting the UnknownLogAppendInfo here? Same thing below.

hachikuji · 2018-03-15T18:09:37Z

Note that the failing builds are likely a result of the bug mentioned in ReplicaManager.

hachikuji

LGTM. Thanks for the patch!

…segments

…writer thread crashed.

…ome(-1) in previous upload.

hachikuji · 2018-03-17T06:36:03Z

The two failing tests seem unrelated. I tried to reproduce locally after rebasing the PR, but was unable to do so. Note I rebased and removed a couple unneeded printlns. I will merge after the new builds complete.

hachikuji reviewed Mar 8, 2018

View reviewed changes

hachikuji reviewed Mar 12, 2018

View reviewed changes

hachikuji reviewed Mar 15, 2018

View reviewed changes

hachikuji approved these changes Mar 16, 2018

View reviewed changes

dhruvilshah3 and others added 10 commits March 16, 2018 22:54

Use accurate first message offset (when available) while rolling log …

a3e3981

…segments

Minor changes to comments, etc.

f047c38

Address review comments

9e3e4b3

Addressed review comments.

de56cee

Minor change to address review comment.

d7704f2

Addressed review comment.

7fecbd6

Minor changes to address review comments.

1c40d76

Addressed review comments; fixed an issue in StressTestLog where the …

7d6befd

…writer thread crashed.

Inadvertently changed UnknownLogAppendInfo.firstOffset from None to S…

391ddce

…ome(-1) in previous upload.

Remove a few unneeded printlns

7544af9

hachikuji force-pushed the KAFKA-6530 branch from 11aa44f to 7544af9 Compare March 17, 2018 06:25

hachikuji merged commit ae31ee6 into apache:trunk Mar 17, 2018

dhruvilshah3 deleted the KAFKA-6530 branch May 11, 2018 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6530: Use actual first offset of message set when rolling log segment #4660

KAFKA-6530: Use actual first offset of message set when rolling log segment #4660

dhruvilshah3 commented Mar 7, 2018 •

edited

Loading

hachikuji left a comment

hachikuji Mar 8, 2018 •

edited

Loading

hachikuji Mar 8, 2018

hachikuji left a comment

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

hachikuji Mar 12, 2018

hachikuji Mar 12, 2018

dhruvilshah3 Mar 12, 2018

dhruvilshah3 commented Mar 14, 2018

hachikuji left a comment

hachikuji Mar 15, 2018

hachikuji Mar 15, 2018

hachikuji Mar 15, 2018 •

edited

Loading

hachikuji Mar 15, 2018

hachikuji Mar 15, 2018

hachikuji Mar 15, 2018

hachikuji commented Mar 15, 2018

hachikuji left a comment

hachikuji commented Mar 17, 2018 •

edited

Loading

		@@ -762,7 +766,7 @@ class Log(@volatile var dir: File,
		updateFirstUnstableOffset()

		trace("Appended message set to log %s with first offset: %d, next offset: %d, and messages: %s"

KAFKA-6530: Use actual first offset of message set when rolling log segment #4660

KAFKA-6530: Use actual first offset of message set when rolling log segment #4660

Conversation

dhruvilshah3 commented Mar 7, 2018 • edited Loading

Committer Checklist (excluded from commit message)

hachikuji left a comment

Choose a reason for hiding this comment

hachikuji Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhruvilshah3 commented Mar 14, 2018

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Mar 15, 2018

hachikuji left a comment

Choose a reason for hiding this comment

hachikuji commented Mar 17, 2018 • edited Loading

dhruvilshah3 commented Mar 7, 2018 •

edited

Loading

hachikuji Mar 8, 2018 •

edited

Loading

hachikuji Mar 15, 2018 •

edited

Loading

hachikuji commented Mar 17, 2018 •

edited

Loading