Add crc32c-checksum verification on message header-payload #43

rdhabalia · 2016-09-27T23:27:44Z

Motivation

Right now, Client and Broker computes checksum on entire message rather on payload-data only.
Also Pulsar uses XXHashChecksum algorithm to compute checksum and SSE4.2CRC32C checksum uses machine-instruction which is faster than XXHashChecksum.
If client receives checksum error from broker then it keep retries with the same message again rather recomputing checksum and failing message if message is already corrupted.

Modifications

Replace XXHashChecksum with SSE4.2CRC32c checksum.
Compute checksum on payload-data only. So, added checksum: magicByte + checksum-value fields with in message-command.
Client try to recover if it receives checksum error from server and fails message if message is already corrupted.
Right now, default: checksum verification is disabled at client-producer side.

Result

Client and Broker can do SSE4.2CRC32c checksum to identify message corruption.

rdhabalia · 2016-09-27T23:32:18Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+     */
+    protected boolean updateChecksumIfRequire(OpSendMsg op) {
+
+        if (op.cmd instanceof DoubleByteBuf) {


Here, to avoid creation of new ByteBuf we modify same DoubleByteBuf of the message with newly computed checksum.
However, while message creation if we see memory-leak then we create SimpleLeakAwareByteBuf or AdvancedLeakAwareByteBuf (based on ResourceLeak Level) instead DoubleByteBuf. So, should we keep this check.

yahoocla · 2016-09-27T23:32:20Z

CLA is valid!

merlimat · 2016-09-27T23:32:00Z

pulsar-common/src/main/proto/PulsarApi.proto

@@ -51,7 +51,7 @@ message MessageMetadata {
 	optional CompressionType compression = 8 [default = NONE];
 	optional uint32 uncompressed_size = 9 [default = 0];
 	// XXHash64 checksum of the original message payload
-	optional sfixed64 checksum = 10;
+	//optional sfixed64 checksum = 10;


Add comment to mention this field was removed in favor of header+payload checksum

sure.. mentioned in comment.

merlimat · 2016-09-27T23:36:11Z

pulsar-common/src/main/java/com/yahoo/pulsar/common/api/Commands.java

@@ -406,18 +428,23 @@ private static ByteBuf serializeWithSize(BaseCommand.Builder cmdBuilder) {
        return buf;
    }

-    private static ByteBuf serializeCommandSendWithSize(BaseCommand.Builder cmdBuilder, MessageMetadata msgMetadata,
+    private static ByteBuf serializeCommandSendWithSize(BaseCommand.Builder cmdBuilder, boolean includeChecksum, MessageMetadata msgMetadata,


Prefer an enum value instead of boolean to make it easier to read where the function is called.

sure. added enum ChecksumType instead boolean

merlimat · 2016-09-27T23:37:06Z

pulsar-common/src/main/java/com/yahoo/pulsar/common/api/Commands.java

+            if (includeChecksum) {
+                headers.writeShort(magicCrc32c);
+                checksumReaderIndex = headers.writerIndex();
+                headers.writeZero(4); // write dummy checksum int to skip 4 bytes in write index


Can just move the writerIndex 4 bytes

yes.. made change to skip 4 bytes in writerIndex

merlimat · 2016-09-27T23:39:54Z

pulsar-common/src/main/java/com/yahoo/pulsar/common/api/Commands.java

        CommandSendError sendError = sendErrorBuilder.build();
        ByteBuf res = serializeWithSize(BaseCommand.newBuilder().setType(Type.SEND_ERROR).setSendError(sendError));
        sendErrorBuilder.recycle();
        sendError.recycle();
        return res;
    }
+
+
+    public static Long readChecksum(ByteBuf buffer) {


I'd prefer to have 2 methods :

boolean hasChecksum(ByteBuf headersAndPayload); int readChecksum(ByteBuf headersAndPayload);

introduced hasChecksum() again.

merlimat · 2016-09-27T23:50:25Z

pulsar-common/src/main/java/com/yahoo/pulsar/common/api/Commands.java

+                headers.retain();
+                payload.retain();
+                headers.readerIndex(metadataReaderIndex);
+                ByteBuf msgMetadataBuf = DoubleByteBuf.get(headers, payload);


Instead of using a DoubleByteBuf, wouldn't it be easier to compute the checksum incrementally?

Actually there are two things:

Circe java implementation doesn't have incremental checksum support.

public int resume(int current, long address, long length) { throw new UnsupportedOperationException(); }

and client side: payload is type of UnpooledHeapByteBuf. So, we can't use memoryAddress so, java implementation also doesn't have incremental checksum using array.

After #44 gets merged I guess we should make it work with incremental checksum for all kinds of buffers

refactored Code which avoids creation of new ByteBuf and computes checksum on same DoubleByteBuf which doesn't need incremental checksum as well.

merlimat · 2016-09-27T23:53:39Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+     * @param op
+     * @return isUpdated: returns true only if checksum is updated in {@link OpSendMsg}
+     */
+    protected boolean updateChecksumIfRequire(OpSendMsg op) {


Method name is a bit awkward. What about verifyLocalBufferIsNotCorrupted() ?

actually, this method not only verifies checksum but also update checksum if it doesn't match with newly computed checksum. Test-case
Caller

Uhm, in any case it should not "update" the checksum.. If the checksum doesn't match we just need to give error to the application

renamed method name as verifyLocalBufferIsNotCorrupted and it just verifies checksum, if it is different then fails the callback else we will retry send-message again.

merlimat · 2016-09-27T23:54:26Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+                msg.resetReaderIndex();
+            }
+        } else {
+            log.warn("[{}] [{}] Memory leak detected while creating message with id {}", topic, producerName,


When can this happen? And why would that be a memory leak?

I have seen it before in past. I will try to see if I can reproduce.

It seems the very first object of DoubleByteBuf is created as SimpleLeakAwareByteBuf or AdvancedLeakAwareByteBuf based on ResourceLeak Level. And it can be reproduced by starting any test-case and put the break-point at Commands where we create DoubleByteBuf. and at very first time it will create LeakAwareByteBuf.

It is because of following code which netty has:

if ((leakCheckCnt ++ & mask) == 0) { reportLeak(level); return new DefaultResourceLeak(obj); } else { return null; }

At very first time leakCheckCnt will be 0 and it reports leak and it cause to create LeakAwareByteBuf

merged fixed at netty

I don't think the leak is a false positive. The commit in netty doesn't change the substance of the leak detector. By default, when leak detection level is simple, it will pick 1 out 100 allocated buffers and instrument it for leak detection.

Whether you start picking the 1st buffer or the 101st, only changes when you're going to detect the leak.

Now that you've fixed the incremental crc part, we should get rid of the DoubleByteBuf here

yes.. with netty leak-detection my only concern was that there could be a possibility where we create LeakAwareByteBuf and that can fail DoubleByteBuf casting.

even with use of incremental-checksum, we need two buffers (b1, b2) that present into created DoubleByteBuf and to retrieve those buffers, we might have to cast into DoubleByteBuf. so, we may not be able to get rid of DoubleByteBuf right.?

actually, yes.. we can avoid DoubleByteBuf

merlimat · 2016-09-27T23:54:47Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+     * 
+     * @param op
+     */
+    private static void removeChecksum(OpSendMsg op) {


stripChecksum() ?

sure. Renamed method name as stripChecksum()

merlimat · 2016-09-27T23:55:46Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+            temp.skipBytes(cmdSize);
+            boolean hasChecksum = Commands.readChecksum(temp) != null;
+
+            if (hasChecksum) {


if (!hasChecksum) { return; } // strip the checksum ....

sure. addressed this.

merlimat · 2016-09-27T23:56:27Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+        int msgBufSize = op.cmd.readableBytes();
+        // ByteBuf can't use readBytes() to same buffer as it always requires readerIndex < writerIndex while writing
+        // into buffer. So, creating new temp ByteBuf to copy data without checksum.
+        ByteBuf temp = op.cmd.alloc().buffer(msgBufSize, msgBufSize);


We should be able to avoid the copy

modifying same buf to strip checksum without creating new buf.

rdhabalia · 2016-10-06T00:27:53Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+                    int checksum = readChecksum(headerFrame).intValue();
+                    // msg.readerIndex is already at header-payload index, Recompute checksum for headers-payload
+                    int metadataChecksum = computeChecksum(headerFrame);
+                    long computedChecksum = resumeChecksum(metadataChecksum, msg.getSecond());


merged change with incremental-checksum computation.

merlimat · 2016-10-06T03:52:31Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+            if (sequenceId == expectedSequenceId) {
+                boolean corrupted = !verifyLocalBufferIsNotCorrupted(op);
+                if (corrupted) {
+                    op.callback.sendComplete(


the op was just peeked from the queue but not actually removed here

yes.. removing and cleaning op from queue after failing callback.

merlimat · 2016-10-06T20:46:42Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

@@ -81,6 +88,8 @@

    private static final AtomicLongFieldUpdater<ProducerImpl> msgIdGeneratorUpdater = AtomicLongFieldUpdater
            .newUpdater(ProducerImpl.class, "msgIdGenerator");
+    // it prevents client to compute checksum and adding into payload
+    private static boolean checksumEnabled = false;


Why do we need this flag? Shouldn't we enabled the checksum always?

I think it was part of out rollout plan.. we wanted to enable this feature in two phases: rollout broker first and later on enable at client-side.

Client will already check for the v6 protocol version, right?

addressed it..

merlimat · 2016-10-07T21:07:02Z

pulsar-broker/src/main/java/com/yahoo/pulsar/broker/service/Producer.java

+                if (checksum == computedChecksum) {
+                    return true;
+                } else {
+                    log.error("[{}] [{}] Failed to verify checksum", topic, producerName);


We should be able to include message id as well at this point

actually, at this point we haven't persisted the message so, we don't have message-id so, we are not logging message-id.

merlimat · 2016-10-07T21:32:49Z

pulsar-checksum/src/main/java/com/yahoo/pulsar/checksum/utils/Crc32cChecksum.java

    private final static IncrementalIntHash CRC32C_HASH;

    static {
        if (Sse42Crc32C.isSupported()) {
            CRC32C_HASH = new Crc32cSse42Provider().getIncrementalInt(CRC32C);
+            log.info("SSE4.2 CRC32C provider initialized");


Move this to debug level

actually, as it logs only once when broker starts. So, can we keep it "INFO" initially to get confirmation about broker loaded library successfully and it is not computing checksum using slower-software-algo .

Though, this will all print in client lib logs

sure. changed log-level to debug.
However, client-lib should also print it only once, and I think, it would be great if user has transparency to know which version (hw/sw) of checksum is being used by app.

sboobna · 2016-10-07T23:23:30Z

pulsar-client/src/main/java/com/yahoo/pulsar/client/impl/ProducerImpl.java

+            }
+        }
+        // close connection and let producer resend pending-messages
+        cnx.ctx().close();


should we resend messages without closing the connection?

yes.. actually we can do resendMessages(cnx); to resend without closing/disturbing connection.

sboobna

👍

merlimat · 2016-10-11T16:34:50Z

@rdhabalia This looks good to go. Can you rebase to resolve the conflict in SimpleProducerConsumerTest.java ?

merlimat

👍

* Create pulsar-functions module (#1) * Create pulsar-functions module * rename `sdk` package to `api` * Added the first cut of the Java interface for Pulsar functions (#2) * Adhere to rest semantics * Complete the list of functions supported by cli

statsIntervalInSeconds can be 0 to disable log

Signed-off-by: xiaolong.ran ranxiaolong716@gmail.com * Support batch logic for project * add unit test case of event time * add some unit tests case for producer * fix error result type * add unit test case of producer flush * add receiver queue size test logic * support partition consumer receive async * add unit test case of ack timeout * Fix consumer receiving message out of order

) Signed-off-by: tison <wander4096@gmail.com>

rdhabalia commented Sep 27, 2016

View reviewed changes

merlimat reviewed Sep 27, 2016

View reviewed changes

rdhabalia force-pushed the checksum branch 3 times, most recently from fe5bdb2 to dfe114e Compare September 28, 2016 01:46

rdhabalia added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Sep 28, 2016

rdhabalia added this to the 1.15 milestone Sep 28, 2016

rdhabalia self-assigned this Sep 28, 2016

rdhabalia force-pushed the checksum branch 7 times, most recently from ffdec7c to 65388ee Compare September 29, 2016 18:17

rdhabalia force-pushed the checksum branch 8 times, most recently from f858b05 to 3b64d60 Compare October 6, 2016 00:24

rdhabalia commented Oct 6, 2016

View reviewed changes

merlimat reviewed Oct 6, 2016

View reviewed changes

rdhabalia force-pushed the checksum branch 2 times, most recently from 0dc4f28 to 02c6693 Compare October 6, 2016 19:10

merlimat reviewed Oct 7, 2016

View reviewed changes

sboobna reviewed Oct 7, 2016

View reviewed changes

rdhabalia force-pushed the checksum branch 2 times, most recently from cfd3480 to e78c8b9 Compare October 9, 2016 21:00

sboobna approved these changes Oct 10, 2016

View reviewed changes

Add crc32c-checksum verification on message header-payload

51cd914

rdhabalia force-pushed the checksum branch from e78c8b9 to 51cd914 Compare October 11, 2016 17:37

merlimat approved these changes Oct 11, 2016

View reviewed changes

merlimat merged commit 309d753 into apache:master Oct 11, 2016

rdhabalia deleted the checksum branch January 23, 2017 22:10

massakam pushed a commit to massakam/pulsar that referenced this pull request Aug 6, 2020

Merge pull request apache#43 from freeart/master

4c4518e

statsIntervalInSeconds can be 0 to disable log

xiaotongwang1 mentioned this pull request Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

bharanic-dev mentioned this pull request Nov 22, 2021

Deadlock in internalDeleteSubscription in metadata-store callback thread #12929

Closed

ZHr-UChiHa mentioned this pull request Jun 3, 2023

[Bug] python pulsar client create producer thread suspend because of deadlock apache/pulsar-client-python#129

Open

2 tasks

tisonkun added a commit to tisonkun/pulsar that referenced this pull request Jul 12, 2023

[FLINK-31748] Dummy implementation to fix compilation failure (apache#43

bbb636a

) Signed-off-by: tison <wander4096@gmail.com>

Add crc32c-checksum verification on message header-payload #43

Add crc32c-checksum verification on message header-payload #43

Conversation

rdhabalia commented Sep 27, 2016

Motivation

Modifications

Result

Choose a reason for hiding this comment

yahoocla commented Sep 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia Sep 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merlimat Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sboobna left a comment

Choose a reason for hiding this comment

merlimat commented Oct 11, 2016

merlimat left a comment

Choose a reason for hiding this comment

rdhabalia Sep 28, 2016 •

edited

Loading

merlimat Oct 6, 2016 •

edited

Loading