tls: fix mutex deadlock risk and harden error/ALPN handling#11458
tls: fix mutex deadlock risk and harden error/ALPN handling#11458
Conversation
- correct TLS handshake error flow to avoid double SSL_get_error() misuse and improve failure reporting - free SSL_CTX on context allocation failure in tls_context_create() - harden ALPN setup: replace old value safely, validate token length, and fix temp buffer cleanup Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
📝 WalkthroughWalkthroughAdjusts TLS session null-checks and frees on allocation failure; adds strict ALPN token and wire-format validation and safer ALPN adoption; improves OpenSSL handshake error classification (WANT_READ/WANT_WRITE vs fatal), syscall/error inspection, and resource cleanup on failures. Changes
Sequence Diagram(s)(No sequence diagrams generated.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8b2ff178c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c8b2ff1 to
23fa833
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/tls/openssl.c`:
- Around line 261-278: The aggregate ALPN wire-format length is stored into a
signed char and can overflow when summing multiple tokens; add a bounds check on
wire_format_alpn_index (the aggregate length) before writing the next token and
before assigning wire_format_alpn[0] so the total never exceeds 255, and ensure
you treat the length as unsigned when passing to SSL_CTX_set_alpn_protos
(referencing wire_format_alpn, wire_format_alpn_index, alpn_token_length,
active_alpn[0], and SSL_CTX_set_alpn_protos); if adding the next token would
overflow (wire_format_alpn_index + alpn_token_length + 1 > 255) return an error,
free allocated buffers as currently done, and only then set the length byte and
call SSL_CTX_set_alpn_protos with the correct unsigned length.
🧹 Nitpick comments (1)
src/tls/openssl.c (1)
1355-1358: Pre-existing:ERR_error_string_nreceivesSSL_ERROR_*constant instead of an OpenSSL error code.The handshake function was correctly fixed to use
ERR_get_error()for error formatting (line 1580), buttls_net_readstill passes theSSL_get_error()return value (ret, anSSL_ERROR_SYSCALLconstant) toERR_error_string_n, which expects anunsigned longfromERR_get_error(). This produces meaningless error messages.Consider applying a similar fix as in the handshake path:
Suggested fix
else if (ret == SSL_ERROR_SYSCALL) { flb_errno(); - ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1); - flb_error("[tls] syscall error: %s", err_buf); + err_code = ERR_get_error(); + if (err_code != 0) { + ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1); + flb_error("[tls] syscall error: %s", err_buf); + } + else { + flb_error("[tls] syscall error: %s", strerror(errno)); + }
23fa833 to
edf947b
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tls/openssl.c (1)
1381-1394:⚠️ Potential issue | 🟡 MinorPre-existing:
ERR_error_string_nreceivesSSL_get_errorresult, not an OpenSSL error code.At line 1383,
retholds theSSL_get_error()return (e.g.,SSL_ERROR_SYSCALL = 5), butERR_error_string_nexpects anunsigned longfromERR_get_error(). This produces a meaningless error string. Thetls_net_writeSYSCALL path (line 1446) handles this correctly withERR_get_error().Since this PR is specifically hardening TLS error handling, consider fixing this path too:
Proposed fix
else if (ret == SSL_ERROR_SYSCALL) { flb_errno(); - ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1); - flb_error("[tls] syscall error: %s", err_buf); + err_code = ERR_get_error(); + if (err_code != 0) { + ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1); + flb_error("[tls] syscall error: %s", err_buf); + } + else { + flb_error("[tls] syscall error: %s", strerror(errno)); + }(Requires declaring
unsigned long err_codeintls_net_read.)
- Make tls_context_alpn_set() atomic:
keep previous ctx->alpn until new ALPN is fully parsed and applied.
- Preserve previous ALPN on allocation/validation/OpenSSL update failures.
- Fix mutex handling in tls_set_ciphers():
always unlock ctx->mutex before returning.
- Remove deadlock risk on cipher-list failure path.
- Fix wrong key-load error log variable (key_file vs crt_file).
- Fix tls_net_read() syscall error reporting:
use ERR_get_error() with ERR_error_string_n(), fallback to strerror(errno).
- Fix tls_net_write() syscall path:
avoid consuming OpenSSL error queue twice.
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
edf947b to
ff289c1
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tls/openssl.c (1)
1404-1417:⚠️ Potential issue | 🟠 Major
else if (ret < 0)is unreachable — SSL errors go unlogged.At line 1374,
retis reassigned toSSL_get_error()which returns values in the range 0–9. The conditionret < 0at line 1404 can never be true, making the new error-logging code at lines 1405–1413 dead code.As a result,
SSL_ERROR_SSL(value 1) andSSL_ERROR_ZERO_RETURN(value 6) both fall through to the bareelseat line 1415, which silently returns -1 without logging. The detailedERR_get_error()reporting was likely intended forSSL_ERROR_SSL.Proposed fix
- else if (ret < 0) { + else { err_code = ERR_get_error(); if (err_code != 0) { ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1); flb_error("[tls] error: %s", err_buf); } else { flb_error("[tls] error: %s", strerror(errno)); } - } - else { + ret = -1; }
|
@cosmo0920 before doing the backport we need more tests on this one |
This PR includes two TLS hardening fixes in OpenSSL usage:
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit