Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers #22810

Closed

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented May 30, 2024

Fixes #22601 #21892 #19460
This PR replaces #22760

Motivation

In Pulsar, there are multiple reported issues where the transferred output gets corrupted and fails with exceptions around invalid reader and writer index. One source of these issues are the ones which occur only when TLS is enabled between clients and Pulsar broker or between Pulsar broker and bookies.

In Pulsar, the sharing of ByteBuf instance happens in this case at least via the broker cache (RangeEntryCacheManagerImpl) and the pending reads manager (PendingReadsManager).

The SslHandler related issue was originally reported in Pulsar in 2018 with #2401 . The fix that time was #2464.
The ByteBuf .copy() method was used to copy the ByteBuf. There hasn't been a similar solution in Bookkeeper or Bookkeeper client to address corruption that is caused by Netty SslHandler.
One of the problems with .copy() is that it's unefficient. I have also created a PR to Netty to make SslHandler not mutate input buffers. The PR is netty/netty#14086 .

The Failed to peek sticky key from the message metadata java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4 exceptions are caused by the SslHandler mutation issue between broker and bookies. It also corrupts the data that gets written to bookkeeper since bookkeeper doesn't check the checksum at writing time, only when it's retrieved from the storage.
java.lang.IndexOutOfBoundsException: readerIndex: 31215, writerIndex: 21324 (expected: 0 <= readerIndex <= writerIndex <= capacity(65536)) type of exceptions on the broker side are also symptoms of the same problem.

The root cause of such exceptions could also be different. A shared Netty ByteBuf must have at least have an independent view created with duplicate, slice or retainedDuplicate if the readerIndex is mutated.
The ByteBuf instance must also be properly shared in a thread safe way. Failing to do that could result in similar symptoms and this PR doesn't fix that.

Modifications

  • Remove the ByteBufPair.CopyingEncoder and make the ByteBufPair.Encoder suitable for both use cases
    • A .retainedSlice() ByteBuf needs to be passed to SslHandler so that it doesn't get mutated. A deep copy isn't required.
    • This solution is also more performant since there will be less memory copies.
  • Workaround an IntelliJ issue where it shows a red mark in PulsarChannelInitializer classes (casting to ChannelHandler fixes the issue).
  • Address the data corruption issue between broker and bookies by using .retainedSlice() for the input ByteBuf.
    • refactor the code to reduce duplication and unnecessary methods

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari
Copy link
Member Author

lhotari commented May 30, 2024

This PR doesn't yet address the root cause although I was already assuming that it does.

@lhotari
Copy link
Member Author

lhotari commented May 30, 2024

It's possible that netty/netty#14086 is needed to address the issues.
I haven't found a workaround to the problems yet.

Tested with with Pulsar v3.2.3 patched to use Netty 4.1.111.Final-SNAPSHOT (without macos or arm64 native libraries)
v3.2.3...lhotari:pulsar:lh-pulsar-3.2.3-netty-4.1.111.Final-SNAPSHOT

git clone -b lh-dont-mutate-sslhandler-input-buffers-by-default https://github.com/lhotari/netty
cd netty
mvn clean install -DskipTests -Dcheckstyle.skip=true
mvn -pl transport-native-unix-common,transport-native-epoll install -DskipTests

cd ..
git clone -b lh-pulsar-3.2.3-netty-4.1.111.Final-SNAPSHOT https://github.com/lhotari/pulsar
cd pulsar
mvn -Pcore-modules,-main clean install -DskipTests -Dspotbugs.skip=true -Dcheckstyle.skip=true
# this is where the distribution is:
cp distribution/server/target/apache-pulsar-3.2.3-bin.tar.gz ~

The reproducer https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env doesn't reproduce the problem with Netty PR 14086 changes.

@lhotari
Copy link
Member Author

lhotari commented Jun 1, 2024

Finally found the root cause of the TLS data corruption and IndexOutOfBounds issues: netty/netty#14086 (comment)

@lhotari
Copy link
Member Author

lhotari commented Jun 3, 2024

I'll resume this PR after Netty 4.1.111.Final has released with the required fixes.

@lhotari
Copy link
Member Author

lhotari commented Jun 11, 2024

Closing this PR since Netty 4.1.111.Final will address the problematic issue: #22892 . I'll create a separate PR to remove the obsolete CopyingEncoder once that is merged.

@lhotari lhotari closed this Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading
1 participant