-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-4390: Replace MessageSet usage with client-side alternatives #2140
Conversation
ping @junrao @guozhangwang @ijuma @apurvam This patch modifies the server implementation to use the client-side One quick note on naming. I've renamed the |
Slightly related, slightly tangential: is there a specific reason why we put the new broker-specific java classes under clients/ ? I'm talking about stuff like: |
@onurkaraman Yeah, I've wondered a bit about that also. I'd be OK moving |
@hachikuji, thanks for tackling this. About the naming question, I also found it a bit confusing how we sometimes have an offset and sometimes don't when talking about records. That is, One thing to think about is whether this fits with the other About having the classes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed a couple of things when I scanned the PR. Will do a proper review later.
@@ -427,49 +427,49 @@ public ByteBuffer validate(Object item) { | |||
} | |||
}; | |||
|
|||
public static final Type RECORDS = new Type() { | |||
public static final Type LOGBUFFER = new Type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this is LOGBUFFER
instead of LOG_BUFFER
? Same for the toString
implementation.
@@ -123,7 +123,7 @@ public int timeout() { | |||
return timeout; | |||
} | |||
|
|||
public Map<TopicPartition, MemoryRecords> partitionRecords() { | |||
public Map<TopicPartition, MemoryLogBuffer> partitionRecords() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method be renamed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. There are probably a few of these. I'll do another pass and try to find others.
@becketqin Would be nice to get your feedback also. Put on the coffee and lose yourself in code review! |
Wow, a 5000 line change... I'll take a look this weekend... |
@hachikuji I ran into some other stuff today and didn't finish reading the entire patch. Just some thoughts when I was reading the code. I think "Message" (and "Record" in the new clients) are a well established concept for Kafka. It is indeed a little weird that Not sure if there was any thinking on changing I'll save my other comments until I go through a full pass of the code in case some of them are not valid at all (I already found some...) |
@becketqin, that's a fair point about the rename and introducing a new concept. I have similar concerns and was wondering how we could make things clearer without that. Your suggestion looks promising. |
Does it make sense to separate the renaming from the actual task of this patch? |
@becketqin Thanks for taking a look. I'm not sure I follow why you consider the renaming a conceptual change. The object works the same as before, but I felt the name fit closer to what the object actually represents, which is a range of bytes from the log. The name The suggestion about To give a bit more background, we're trying to generalize the concept of a message set so that 1) it uses a separate schema from individual messages, and 2) it's extended to uncompressed data. This allows us to amortize the cost of additional metadata which is invariant for the messages contained in a message set. I'm happy to provide some additional detail if you're interested (there will be a KIP on the way some time in the next few weeks). @onurkaraman Yeah, we can do that if it makes this patch easier to get in. Let's see what others think. Sigh, any suggestion to reduce lines of code is likely to be popular to all except me. |
I ran system tests on the latest patch and everything looks good: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2016-11-22--001.1479824556--hachikuji--KAFKA4390--24dc7ed/report.html. I will probably continue to add some additional test cases, but I'll leave the rest as is pending further review comments. |
@hachikuji : Will also take a look at the patch. Just a quick note. Could you do some performance test to make sure there is no regression? |
@junrao Thanks for taking a look. Performance testing is next on my plate after I fill in some of the gaps in test coverage. |
f002ac6
to
b646ef5
Compare
Update: I've begun performance testing. I'm seeing a substantial performance degradation on the consumer side. I'll update this PR when I know more. |
I found the cause of the performance regression. When handling a fetch, we must read through the log to find the starting position of a given offset (starting from the position given by the index). To do so, we only need to read the offset and size, but one of my recent commits accidentally changed this behavior to unnecessarily read the full record. I've fixed this in the last commit and now it looks like performance is within 5% of trunk for the producer and consumer. Perhaps still on the slower side though, so I'll continue investigating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hachikuji : Thanks for the patch. Made a pass on this. Some of the issues that I pointed out seem to be existing. We can decide whether to address them here or in followup jiras.
|
||
public void setCreateTime(long timestamp) { | ||
Record record = record(); | ||
if (record.magic() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert record.magic() > 0?
throw new KafkaException(String.format("Size of FileRecords %s has been truncated during write: old size %d, new size %d", file.getAbsolutePath(), size, newSize)); | ||
|
||
long position = start + offset; | ||
long count = Math.min(length, this.size.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent, perhaps this.size and this.channel should just be size and channel?
@Override | ||
public long writeTo(GatheringByteChannel channel, long position, int length) throws IOException { | ||
ByteBuffer dup = buffer.duplicate(); | ||
dup.position(new Long(position).intValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that some of the changes are lost during rebase? For example, there was code in MemoryRecords for setting the buffer limit according to length, and cast position to int instead of creating a Long object.
|
||
for (LogEntry deepEntry : shallowEntry) { | ||
Record deepRecord = deepEntry.record(); | ||
messagesRead += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an existing issue. For uncompressed messages, do we double count messagesRead since we already increased the count in line 98?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I'll fix this and the one below and update the test cases.
if (writeOriginalEntry) { | ||
// There are no messages compacted out and no message format conversion, write the original message set back | ||
shallowEntry.writeTo(buffer); | ||
messagesRetained += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and line 144 don't seem to be correct. It seems that we should add the number of entries in retainedEntries?
@@ -152,7 +152,7 @@ class DelayedFetch(delayMs: Long, | |||
) | |||
|
|||
val fetchPartitionData = logReadResults.map { case (tp, result) => | |||
tp -> FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet) | |||
tp -> FetchPartitionData(result.errorCode, result.hw, result.info.logBuffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import kafka.api.FetchResponsePartitionData
messageTimestampType = TimestampType.LOG_APPEND_TIME, | ||
messageTimestampDiffMaxMs = 1000L) | ||
val validatedLogBuffer = validatedResults.validatedEntries | ||
assertEquals("message set size should not change", logBuffer.deepEntries.asScala.size, validatedLogBuffer.deepEntries.asScala.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"message set size" is bit ambiguous. Perhaps we should say "number of messages"?
} | ||
} | ||
|
||
/* check that offsets are assigned based on byte offset from the given base offset */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's "byte offset".
@@ -70,7 +70,7 @@ class MessageCompressionTest extends JUnitSuite { | |||
testCompressSize(GZIPCompressionCodec, messages, 396) | |||
|
|||
if(isSnappyAvailable) | |||
testCompressSize(SnappyCompressionCodec, messages, 502) | |||
testCompressSize(SnappyCompressionCodec, messages, 1063) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do we have to change the expected size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It puzzled me for a while when writing this code why the size was coming out different only for snappy, but it turns out that we've overridden the block size in the client code, instead of using the default as was done for the server code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We probably don't want the change the buffer size in the server to be the same as the client. We may consider changing the client to be the same as the server. See KAFKA-3704 for details.
convertAndVerify(v, Message.MagicValue_V0, Message.MagicValue_V1) | ||
} else if (v.magicValue == Message.MagicValue_V1) { | ||
convertAndVerify(v, Message.MagicValue_V1, Message.MagicValue_V0) | ||
if (v.codec == NoCompressionCodec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that we only test non-compressed message conversion now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I think the test was broken or at least incomplete since Message.toFormatVersion
only did shallow conversion. When I implemented this in the client code, I forbid shallow-only conversion because it results in bugs like we found in LogCleaner
. We'll probably end up dropping this code after we remove Message.toFormatVersion
as suggested above.
@junrao I really appreciate the thorough review. I've addressed the easier items and left a few replies. I'll get to the rest tomorrow. By the way, in the near future, I'd like to squash commits to make rebasing a bit easier. It hasn't been too much of a problem yet, but it will get harder with more iterations. |
About the naming of Also, I'd like to suggest we separate the renaming out of this PR for the ease of reviewing, if it is still possible to revert it back. |
@guozhangwang |
42b51b2
to
fb79917
Compare
I've gone ahead and squashed commits. You can still find the old commit history here: https://github.com/hachikuji/kafka/tree/KAFKA-4390-UNSQUASHED. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hachikuji : Thanks for the latest patch. Just a few minor comments. Also, could you post the latest performance results? Assuming there is no degradation, the patch LGTM.
* A binary format which consists of a 4 byte size, an 8 byte offset, and the record bytes. See {@link MemoryRecords} | ||
* for the in-memory representation. | ||
* Interface for accessing the records contained in a log. The log itself is represented as a sequence of log entries. | ||
* Each log entry consists of a 4 byte size, an 8 byte offset, and a "shallow" {@link Record record}. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a 4 byte size, an 8 byte offset => an 8 byte offset, a 4 byte size of the record
} | ||
|
||
/** | ||
* Close this batch for no more appends | ||
* Filter this log buffer into the provided ByteBuffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reference to "log buffer" still valid?
} | ||
|
||
/** | ||
* Get the records from this log buffer (note this requires "deep" iteration into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reference to "log buffer " still valid?
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@junrao Thanks for the reviews. I did some testing this evening. I thought I was seeing some performance difference initially in the producer, but it seems within the variance of the test runs. If I were guessing from the results, I'd say the non-compressed path is a tad slower while the compressed path might be a tad faster, but don't put much weight behind either conclusion. In any case, the results seem close enough that I'd recommend merging now. Note that I did add one commit to address a couple minor cleanups and tighten up the iteration code a little. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
public int read() { | ||
if (!buffer.hasRemaining()) { | ||
return -1; | ||
private static class UnderlyingInputStream extends InputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit annoying that we create so much indirection (DataLogInputStream -> ByteBufferInputStream -> UnderlyingInputStream -> ByteBuffer -> byte[]). In an ideal world, we would not bother with InputStream
at all and would just operate at the ByteBuffer
level. However, the gzip case is hard to do that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma Haha, yeah. One of the layers is sort of fake (DataInputStream
should be a mixin), but the point is still valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hachikuji : Thanks for the patch. LGTM. Just a couple of minor comments.
About the performance. We used to optimize the path for recompression on the broker side by implementing a chained ByteBuffer list to avoid copying during buffer overflow for writes. With the patch, we lose such optimization and simply recopies data to a bigger buffer during overflow. This will affect the performance of recompression when the estimated after-compression size is lower. Recompression can happen when (1) producer is old, and (2) the broker compression codec is different from the producer's, both should be uncommon. So, we can probably commit the patch as it is. If the recompression performance is a problem, we can always optimize the code for expanding the buffer in ByteBufferOutputStream later.
this.buffer = buffer; | ||
buffer.position(LOG_OVERHEAD); | ||
this.record = new Record(buffer.slice()); | ||
buffer.position(OFFSET_OFFSET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change the position of buffer? Perhaps we could instead just change the position in the slice passed to Record().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently Record
expects the position of the ByteBuffer
to be at 0. I was tempted to change this assumption, but decided to leave it for now (it's a bit annoying to change all the accessors to assume relative positioning). We could accomplish the same result using mark()
and reset()
if that seems any better.
* A binary format which consists of a 4 byte size, an 8 byte offset, and the record bytes. See {@link MemoryRecords} | ||
* for the in-memory representation. | ||
* Interface for accessing the records contained in a log. The log itself is represented as a sequence of log entries. | ||
* Each log entry consists of a 4 byte size, an 8 byte offset, a 4 byte record size, and a "shallow" {@link Record record}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a 4 byte size," needs to be removed.
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Author: Jason Gustafson <jason@confluent.io> Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jun Rao <junrao@gmail.com> Closes apache#2140 from hachikuji/KAFKA4390
long position = start + offset; | ||
long count = Math.min(length, this.size - offset); | ||
long count = Math.min(length, size.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Math.min(length, size.get() - offset)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dengziming : Thanks for the comment. This does seem like a bug. Would you be interested in submitting a separate PR to have this fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.