http_server: serialize worker teardown to prevent race conditions#11845
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR protects downstream list mutations with a config mutex, switches HTTP worker shutdown signaling to atomics, hardens worker context init/cleanup and runtime start/stop ordering, and adds tests ensuring worker init/exit callbacks run on the originating worker threads. ChangesWorker and downstream thread safety
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7e9d916 to
c4fd85e
Compare
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
c4fd85e to
4285111
Compare
This commit ensures session->runtime is assigned immediately upon allocation so that any startup failures will properly route through flb_http_server_runtime_stop, which guarantees complete worker teardown and prevents SIGSEGVs caused by background workers executing against freed structures. Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
b12e4b3 to
777d5d9
Compare
8f0b40a to
162133f
Compare
162133f to
6903912
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ccdb841c53
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/flb_downstream.c`:
- Around line 169-171: The insertion into config->downstreams is protected by
pthread_mutex_lock(&config->collectors_mutex) but the removal in
flb_downstream_destroy() is not; wrap the list removal in
flb_downstream_destroy() with the same
pthread_mutex_lock(&config->collectors_mutex) /
pthread_mutex_unlock(&config->collectors_mutex) pairing so
mk_list_del(&stream->base._head, &config->downstreams) is serialized with
mk_list_add (use the exact config->collectors_mutex and mk_list_del call to
modify the deregistration path).
In `@src/http_server/flb_http_server.c`:
- Around line 679-686: Do not publish session->runtime until all per-worker
slots are fully initialized: allocate runtime and runtime->workers, then for
each slot initialize its pthread_mutex_t and pthread_cond_t (loop over
runtime->workers), and only after every init succeeds set runtime->worker_count
and finally assign session->runtime = runtime; if any per-slot init fails, undo
previously initialized mutex/cond, free runtime->workers and runtime, and keep
session->runtime NULL so flb_http_server_runtime_stop() won't see a
partially-initialized array; refer to symbols runtime, runtime->workers,
runtime->worker_count, session->runtime and flb_http_server_runtime_stop() when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35bed195-f827-4ff4-be5e-f693109e2ee8
📒 Files selected for processing (3)
src/flb_downstream.csrc/http_server/flb_http_server.ctests/internal/http_server.c
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/http_server/flb_http_server.c`:
- Around line 730-731: session->runtime is being published before startup
finishes, allowing flb_http_server_stop() -> flb_http_server_runtime_stop() to
free runtime->workers while the startup loop is still using it; delay assigning
session->runtime until after runtime and runtime->workers are fully initialized
and any startup iteration over runtime->workers is complete. Specifically,
perform all initialization that touches runtime and runtime->workers in the
local runtime variable (and iterate/complete startup work there), then set
session->runtime = runtime as the final step so
flb_http_server_stop()/flb_http_server_runtime_stop() cannot observe a
partially-initialized runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 66af687b-fd5d-406e-8586-043467db206f
📒 Files selected for processing (3)
src/flb_downstream.csrc/http_server/flb_http_server.ctests/internal/http_server.c
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/internal/http_server.c
…nished Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
In our Ci environment, http_server related tasks are sometimes failed with flaky in these days.
So, we need to plug these failures.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Tests