Skip to content

http_server: serialize worker teardown to prevent race conditions#11845

Merged
edsiper merged 9 commits into
masterfrom
cosmo0920-handle-rigidly-http-connections-teardown-and-initialize
May 27, 2026
Merged

http_server: serialize worker teardown to prevent race conditions#11845
edsiper merged 9 commits into
masterfrom
cosmo0920-handle-rigidly-http-connections-teardown-and-initialize

Conversation

@cosmo0920
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 commented May 26, 2026

In our Ci environment, http_server related tasks are sometimes failed with flaky in these days.
So, we need to plug these failures.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Improved thread-safety for downstream registration/removal by adding proper synchronization to avoid races during startup/shutdown.
    • Hardened HTTP server worker lifecycle: atomic shutdown signaling, safer startup/shutdown ordering, and more robust cleanup to prevent race conditions and leaks.
  • Tests

    • Added a test ensuring worker exit callbacks run on the same thread that performed worker initialization.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR protects downstream list mutations with a config mutex, switches HTTP worker shutdown signaling to atomics, hardens worker context init/cleanup and runtime start/stop ordering, and adds tests ensuring worker init/exit callbacks run on the originating worker threads.

Changes

Worker and downstream thread safety

Layer / File(s) Summary
Downstream registration thread safety
src/flb_downstream.c
flb_downstream_setup now conditionally registers downstreams only when config is non-NULL and wraps mk_list_add/removal with config->collectors_mutex to protect concurrent modification.
Worker thread atomic shutdown mechanism
src/http_server/flb_http_server.c
Adds #include <cfl/cfl_atomic.h>, changes worker shutdown flag to uint64_t, and replaces direct flag reads/writes with cfl_atomic_load/cfl_atomic_store.
Worker lifecycle init, TLS and runtime ordering
src/http_server/flb_http_server.c
flb_http_server_worker_context_reset now returns int with error handling; worker startup initializes engine/event-loop and DNS context earlier; worker loop termination uses atomic load; cb_worker_exit is invoked only when server is running; introduces flb_http_server_runtime_stop with join and teardown and defers publishing runtime/session state until reset completes.
Worker thread safety validation tests
tests/internal/http_server.c
Test context records per-worker init thread and server pointer; exit callback verifies pthread_equal() with init thread. Adds test_http_server_worker_exit_runs_on_worker_thread() asserting init/exit counts and zero thread-mismatch.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

backport to v4.1.x, backport to v4.2.x

Suggested reviewers

  • edsiper
  • niedbalski
  • patrick-stephens

Poem

🐇 Threads that hop where moonbeams thread,
Mutexes snug the paths they tread,
Atomics whisper, "time to end",
Callbacks land where they began, my friend,
Tests hum softly — all is well again.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main objective of the changeset: serializing worker teardown to prevent race conditions in the http_server component.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cosmo0920-handle-rigidly-http-connections-teardown-and-initialize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cosmo0920 cosmo0920 requested a deployment to pr May 26, 2026 10:48 — with GitHub Actions Abandoned
@cosmo0920 cosmo0920 force-pushed the cosmo0920-handle-rigidly-http-connections-teardown-and-initialize branch from 7e9d916 to c4fd85e Compare May 26, 2026 11:05
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
This commit ensures session->runtime is assigned immediately upon allocation
so that any startup failures will properly route through flb_http_server_runtime_stop,
which guarantees complete worker teardown and prevents SIGSEGVs
caused by background workers executing against freed structures.

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920 cosmo0920 force-pushed the cosmo0920-handle-rigidly-http-connections-teardown-and-initialize branch from 8f0b40a to 162133f Compare May 27, 2026 08:59
@cosmo0920 cosmo0920 force-pushed the cosmo0920-handle-rigidly-http-connections-teardown-and-initialize branch from 162133f to 6903912 Compare May 27, 2026 09:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccdb841c53

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/http_server/flb_http_server.c
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/flb_downstream.c`:
- Around line 169-171: The insertion into config->downstreams is protected by
pthread_mutex_lock(&config->collectors_mutex) but the removal in
flb_downstream_destroy() is not; wrap the list removal in
flb_downstream_destroy() with the same
pthread_mutex_lock(&config->collectors_mutex) /
pthread_mutex_unlock(&config->collectors_mutex) pairing so
mk_list_del(&stream->base._head, &config->downstreams) is serialized with
mk_list_add (use the exact config->collectors_mutex and mk_list_del call to
modify the deregistration path).

In `@src/http_server/flb_http_server.c`:
- Around line 679-686: Do not publish session->runtime until all per-worker
slots are fully initialized: allocate runtime and runtime->workers, then for
each slot initialize its pthread_mutex_t and pthread_cond_t (loop over
runtime->workers), and only after every init succeeds set runtime->worker_count
and finally assign session->runtime = runtime; if any per-slot init fails, undo
previously initialized mutex/cond, free runtime->workers and runtime, and keep
session->runtime NULL so flb_http_server_runtime_stop() won't see a
partially-initialized array; refer to symbols runtime, runtime->workers,
runtime->worker_count, session->runtime and flb_http_server_runtime_stop() when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35bed195-f827-4ff4-be5e-f693109e2ee8

📥 Commits

Reviewing files that changed from the base of the PR and between f6126eb and 0863b57.

📒 Files selected for processing (3)
  • src/flb_downstream.c
  • src/http_server/flb_http_server.c
  • tests/internal/http_server.c

Comment thread src/flb_downstream.c
Comment thread src/http_server/flb_http_server.c Outdated
cosmo0920 added 4 commits May 27, 2026 22:45
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/http_server/flb_http_server.c`:
- Around line 730-731: session->runtime is being published before startup
finishes, allowing flb_http_server_stop() -> flb_http_server_runtime_stop() to
free runtime->workers while the startup loop is still using it; delay assigning
session->runtime until after runtime and runtime->workers are fully initialized
and any startup iteration over runtime->workers is complete. Specifically,
perform all initialization that touches runtime and runtime->workers in the
local runtime variable (and iterate/complete startup work there), then set
session->runtime = runtime as the final step so
flb_http_server_stop()/flb_http_server_runtime_stop() cannot observe a
partially-initialized runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66af687b-fd5d-406e-8586-043467db206f

📥 Commits

Reviewing files that changed from the base of the PR and between 0863b57 and bca3afb.

📒 Files selected for processing (3)
  • src/flb_downstream.c
  • src/http_server/flb_http_server.c
  • tests/internal/http_server.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/internal/http_server.c

Comment thread src/http_server/flb_http_server.c Outdated
cosmo0920 added 2 commits May 27, 2026 22:57
…nished

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants