Skip to content

tls: fix mutex deadlock risk and harden error/ALPN handling#11458

Merged
edsiper merged 2 commits intomasterfrom
tls-openssl-fixes
Feb 13, 2026
Merged

tls: fix mutex deadlock risk and harden error/ALPN handling#11458
edsiper merged 2 commits intomasterfrom
tls-openssl-fixes

Conversation

@edsiper
Copy link
Copy Markdown
Member

@edsiper edsiper commented Feb 12, 2026

This PR includes two TLS hardening fixes in OpenSSL usage:

  • src/tls/openssl.c: fixed mutex handling in tls_set_ciphers() to ensure ctx->mutex is always unlocked on all return paths, removing a potential deadlock on cipher-list failures; also corrected the key-load error log to reference the proper variable.
  • src/tls/flb_tls.c: improved TLS error flow and ALPN setup by tightening failure handling, preventing unsafe reuse/replacement behavior, validating token length, and fixing cleanup paths.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes
    • Safer TLS session creation with proper null-checks and guaranteed cleanup on error
    • ALPN validation and safer adoption to prevent invalid protocol configurations
    • More accurate TLS handshake error classification (recoverable vs fatal) and clearer diagnostics
    • Ensured TLS context/cipher and intermediate allocations are freed on failure
    • Corrected certificate/key error logging and more robust protocol parsing

  - correct TLS handshake error flow to avoid double SSL_get_error() misuse and improve failure reporting
  - free SSL_CTX on context allocation failure in tls_context_create()
  - harden ALPN setup: replace old value safely, validate token length, and fix temp buffer cleanup

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Adjusts TLS session null-checks and frees on allocation failure; adds strict ALPN token and wire-format validation and safer ALPN adoption; improves OpenSSL handshake error classification (WANT_READ/WANT_WRITE vs fatal), syscall/error inspection, and resource cleanup on failures.

Changes

Cohort / File(s) Summary
TLS session management
src/tls/flb_tls.c
Validate session->ptr instead of session for NULL; log on failure and free allocated vhost and session before returning -1.
OpenSSL TLS implementation
src/tls/openssl.c
Enforce per-ALPN-token (≤255) and total wire-format bounds; build new_alpn and adopt into ctx->alpn only on success; client-side ALPN now uses SSL_CTX_set_alpn_protos with correct lengths; ensure cleanup of intermediate allocations and free SSL_CTX on tls_context allocation failure; correct key_file error logs; loop over protocol defs with NULL sentinel; check and return on SSL_CTX_set_cipher_list failure; replace direct return-code checks with SSL_get_error (ssl_error) and handle SSL_ERROR_SYSCALL via ERR_peek_error/verify results; set net_error on fatal syscall errors and distinguish WANT_READ/WANT_WRITE transitions.

Sequence Diagram(s)

(No sequence diagrams generated.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

backport to v4.0.x

Suggested reviewers

  • cosmo0920
  • fujimotos

Poem

🐰 I nudge the bytes with careful paws,
ALPN trimmed to proper laws,
Handshakes listen, errors named,
Old allocations gently tamed,
Hoppity hop — the TLS is neat!

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (5 files):

⚔️ cmake/libbacktrace.cmake (content)
⚔️ src/flb_processor.c (content)
⚔️ src/tls/flb_tls.c (content)
⚔️ src/tls/openssl.c (content)
⚔️ tests/runtime_shell/processor_conditional.sh (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: fixing mutex deadlock risk and hardening error/ALPN handling in TLS code, which aligns with the changeset across both flb_tls.c and openssl.c files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tls-openssl-fixes
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch tls-openssl-fixes
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8b2ff178c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tls/openssl.c Outdated
@edsiper
Copy link
Copy Markdown
Member Author

edsiper commented Feb 12, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/tls/openssl.c`:
- Around line 261-278: The aggregate ALPN wire-format length is stored into a
signed char and can overflow when summing multiple tokens; add a bounds check on
wire_format_alpn_index (the aggregate length) before writing the next token and
before assigning wire_format_alpn[0] so the total never exceeds 255, and ensure
you treat the length as unsigned when passing to SSL_CTX_set_alpn_protos
(referencing wire_format_alpn, wire_format_alpn_index, alpn_token_length,
active_alpn[0], and SSL_CTX_set_alpn_protos); if adding the next token would
overflow (wire_format_alpn_index + alpn_token_length + 1 > 255) return an error,
free allocated buffers as currently done, and only then set the length byte and
call SSL_CTX_set_alpn_protos with the correct unsigned length.
🧹 Nitpick comments (1)
src/tls/openssl.c (1)

1355-1358: Pre-existing: ERR_error_string_n receives SSL_ERROR_* constant instead of an OpenSSL error code.

The handshake function was correctly fixed to use ERR_get_error() for error formatting (line 1580), but tls_net_read still passes the SSL_get_error() return value (ret, an SSL_ERROR_SYSCALL constant) to ERR_error_string_n, which expects an unsigned long from ERR_get_error(). This produces meaningless error messages.

Consider applying a similar fix as in the handshake path:

Suggested fix
         else if (ret == SSL_ERROR_SYSCALL) {
             flb_errno();
-            ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
-            flb_error("[tls] syscall error: %s", err_buf);
+            err_code = ERR_get_error();
+            if (err_code != 0) {
+                ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1);
+                flb_error("[tls] syscall error: %s", err_buf);
+            }
+            else {
+                flb_error("[tls] syscall error: %s", strerror(errno));
+            }

Comment thread src/tls/openssl.c
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/tls/openssl.c (1)

1381-1394: ⚠️ Potential issue | 🟡 Minor

Pre-existing: ERR_error_string_n receives SSL_get_error result, not an OpenSSL error code.

At line 1383, ret holds the SSL_get_error() return (e.g., SSL_ERROR_SYSCALL = 5), but ERR_error_string_n expects an unsigned long from ERR_get_error(). This produces a meaningless error string. The tls_net_write SYSCALL path (line 1446) handles this correctly with ERR_get_error().

Since this PR is specifically hardening TLS error handling, consider fixing this path too:

Proposed fix
         else if (ret == SSL_ERROR_SYSCALL) {
             flb_errno();
-            ERR_error_string_n(ret, err_buf, sizeof(err_buf)-1);
-            flb_error("[tls] syscall error: %s", err_buf);
+            err_code = ERR_get_error();
+            if (err_code != 0) {
+                ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1);
+                flb_error("[tls] syscall error: %s", err_buf);
+            }
+            else {
+                flb_error("[tls] syscall error: %s", strerror(errno));
+            }

(Requires declaring unsigned long err_code in tls_net_read.)

  - Make tls_context_alpn_set() atomic:
    keep previous ctx->alpn until new ALPN is fully parsed and applied.
  - Preserve previous ALPN on allocation/validation/OpenSSL update failures.
  - Fix mutex handling in tls_set_ciphers():
    always unlock ctx->mutex before returning.
  - Remove deadlock risk on cipher-list failure path.
  - Fix wrong key-load error log variable (key_file vs crt_file).
  - Fix tls_net_read() syscall error reporting:
    use ERR_get_error() with ERR_error_string_n(), fallback to strerror(errno).
  - Fix tls_net_write() syscall path:
    avoid consuming OpenSSL error queue twice.

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/tls/openssl.c (1)

1404-1417: ⚠️ Potential issue | 🟠 Major

else if (ret < 0) is unreachable — SSL errors go unlogged.

At line 1374, ret is reassigned to SSL_get_error() which returns values in the range 0–9. The condition ret < 0 at line 1404 can never be true, making the new error-logging code at lines 1405–1413 dead code.

As a result, SSL_ERROR_SSL (value 1) and SSL_ERROR_ZERO_RETURN (value 6) both fall through to the bare else at line 1415, which silently returns -1 without logging. The detailed ERR_get_error() reporting was likely intended for SSL_ERROR_SSL.

Proposed fix
-        else if (ret < 0) {
+        else {
             err_code = ERR_get_error();
 
             if (err_code != 0) {
                 ERR_error_string_n(err_code, err_buf, sizeof(err_buf)-1);
                 flb_error("[tls] error: %s", err_buf);
             }
             else {
                 flb_error("[tls] error: %s", strerror(errno));
             }
-        }
-        else {
+
             ret = -1;
         }

@edsiper edsiper merged commit efc80cf into master Feb 13, 2026
67 of 69 checks passed
@edsiper edsiper deleted the tls-openssl-fixes branch February 13, 2026 16:06
@edsiper
Copy link
Copy Markdown
Member Author

edsiper commented Feb 13, 2026

@cosmo0920 before doing the backport we need more tests on this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants