Skip to content

[fix](brpc) disable SSL BIO buffer #64962

Open
luwei16 wants to merge 1 commit into
apache:masterfrom
luwei16:luwei/fix-brpc-buffer
Open

[fix](brpc) disable SSL BIO buffer #64962
luwei16 wants to merge 1 commit into
apache:masterfrom
luwei16:luwei/fix-brpc-buffer

Conversation

@luwei16

@luwei16 luwei16 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor
Problem Summary:

Doris uses brpc 1.4.0 in thirdparty. When mTLS is enabled, brpc adds
an extra buffered BIO layer after SSL handshake. Large meta-service
get_rowset responses can expose a TLS write issue in this path:
SSL_write() may consume plaintext successfully, but the later BIO_flush()
can hit non-fatal EAGAIN while encrypted bytes are still buffered.

brpc 1.4.0 does not surface that flush EAGAIN as SSL_ERROR_WANT_WRITE to
the outer KeepWrite/EPOLLOUT retry path. The server side may therefore
treat the write as completed while BE receives an incomplete brpc
response frame and eventually times out.

This is hard to reproduce locally because it depends on socket
backpressure during the buffered BIO flush, not just response size.
Production Service/VIP/CNI/node load/conntrack/send queues can make this
timing window easier to hit.

Backport the upstream brpc approach by disabling AddBIOBuffer(...) after
SSL handshake. Without this buffered BIO layer, SSL_write() can surface
SSL_ERROR_WANT_WRITE directly to brpc's existing KeepWrite/EPOLLOUT retry
mechanism. Increasing timeout is not a real fix because the server side
may already have misjudged the write as complete.

### Release note

Fix possible meta-service get_rowset timeout with mTLS when brpc TLS
buffered BIO fails to flush all encrypted bytes under socket
backpressure.

### Check List (For Author)

- Test: No need to test. Thirdparty brpc patch only; the failure
  depends on production socket backpressure timing.
- Behavior changed: Yes. TLS connections no longer use brpc's extra SSL
  buffered BIO layer so WANT_WRITE is handled by the existing retry
  mechanism.
- Does this need documentation: No

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:

Doris uses brpc 1.4.0 in thirdparty. When mTLS is enabled, brpc adds
an extra buffered BIO layer after SSL handshake. Large meta-service
get_rowset responses can expose a TLS write issue in this path:
SSL_write() may consume plaintext successfully, but the later BIO_flush()
can hit non-fatal EAGAIN while encrypted bytes are still buffered.

brpc 1.4.0 does not surface that flush EAGAIN as SSL_ERROR_WANT_WRITE to
the outer KeepWrite/EPOLLOUT retry path. The server side may therefore
treat the write as completed while BE receives an incomplete brpc
response frame and eventually times out.

This is hard to reproduce locally because it depends on socket
backpressure during the buffered BIO flush, not just response size.
Production Service/VIP/CNI/node load/conntrack/send queues can make this
timing window easier to hit.

Backport the upstream brpc approach by disabling AddBIOBuffer(...) after
SSL handshake. Without this buffered BIO layer, SSL_write() can surface
SSL_ERROR_WANT_WRITE directly to brpc's existing KeepWrite/EPOLLOUT retry
mechanism. Increasing timeout is not a real fix because the server side
may already have misjudged the write as complete.

### Release note

Fix possible meta-service get_rowset timeout with mTLS when brpc TLS
buffered BIO fails to flush all encrypted bytes under socket
backpressure.

### Check List (For Author)

- Test: No need to test. Thirdparty brpc patch only; the failure
  depends on production socket backpressure timing.
- Behavior changed: Yes. TLS connections no longer use brpc's extra SSL
  buffered BIO layer so WANT_WRITE is handled by the existing retry
  mechanism.
- Does this need documentation: No
@luwei16 luwei16 force-pushed the luwei/fix-brpc-buffer branch from 86bf13a to fbc0ea3 Compare June 29, 2026 12:22
@luwei16

luwei16 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@luwei16

luwei16 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

run beut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants