New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-1.12] Fix race condition in app health and actors/workflow initialization #6972
[release-1.12] Fix race condition in app health and actors/workflow initialization #6972
Conversation
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
/test-sdk-all |
/ok-to-test |
Dapr SDK Java testCommit ref: 7f14294 ❌ Java SDK tests failedPlease check the logs for details on the error. |
Dapr SDK Go testCommit ref: 7f14294 ✅ Go SDK tests passed |
Dapr E2E testCommit ref: 7f14294 ✅ Build succeeded for linux/amd64
✅ Infrastructure deployed
✅ Build succeeded for windows/amd64
❌ Tests failed on windows/amd64Please check the logs for details on the error. ✅ Tests succeeded on linux/amd64
|
Dapr SDK Python testCommit ref: 7f14294 ✅ Python SDK tests passed |
Dapr SDK JS testCommit ref: 7f14294 ✅ JS SDK tests passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ItalyPaleAle Can you add an integration test to prove this works? We can test by running daprd (with a placement address configured), testing that healthz fails, then starting placement and wait for the healthz endpoint to return healthy.
No, this is not what this PR does.
The issue this fixes is different:
The fix makes it so calls like the one above will block until actors is initialized, whether successfully or not. I am not sure how to even write a test for this, since testing for absence of race conditions is always a hell of a problem. |
@ItalyPaleAle Please cherry pick #6974 which will pass on this branch. It doesn't test workflows as we don't have any infra for them in integration tests yet but at least covers actors. |
… complete Co-Authored-By: joshvanl <me@joshvanl.dev> Signed-off-by: joshvanl <me@joshvanl.dev>
Thank you, merged this |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## release-1.12 #6972 +/- ##
================================================
- Coverage 64.84% 64.65% -0.19%
================================================
Files 228 230 +2
Lines 20848 20847 -1
================================================
- Hits 13518 13478 -40
- Misses 6203 6236 +33
- Partials 1127 1133 +6
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…nitialization (dapr#6972) * Fix race condition in app health and actors/workflow initialization Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Adds integration test to prove actors respond before initilization is complete Co-Authored-By: joshvanl <me@joshvanl.dev> Signed-off-by: joshvanl <me@joshvanl.dev> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Signed-off-by: joshvanl <me@joshvanl.dev> Co-authored-by: joshvanl <me@joshvanl.dev> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Signed-off-by: Artur Souza <asouza.pro@gmail.com>
…o connection. (#6977) * [release-1.12] Ensure sentry certificate expiry metric is served when provided (#6973) * Ensure sentry certificate expiry metric is served when provided Signed-off-by: joshvanl <me@joshvanl.dev> * Fix spelling of test func `Expiry` Signed-off-by: joshvanl <me@joshvanl.dev> * Give metric test case a more appropriate name Signed-off-by: joshvanl <me@joshvanl.dev> * Linting Signed-off-by: joshvanl <me@joshvanl.dev> --------- Signed-off-by: joshvanl <me@joshvanl.dev> Signed-off-by: Artur Souza <asouza.pro@gmail.com> * [release-1.12] Fix race condition in app health and actors/workflow initialization (#6972) * Fix race condition in app health and actors/workflow initialization Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Adds integration test to prove actors respond before initilization is complete Co-Authored-By: joshvanl <me@joshvanl.dev> Signed-off-by: joshvanl <me@joshvanl.dev> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Signed-off-by: joshvanl <me@joshvanl.dev> Co-authored-by: joshvanl <me@joshvanl.dev> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Fix issue where gRPC app channel cannot recover from no connection. Signed-off-by: Artur Souza <asouza.pro@gmail.com> * [1.12] Adds integration test for slow app startup with service invocation Signed-off-by: joshvanl <me@joshvanl.dev> Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Keep default gRPC backoff setting. Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Fix slowapp invocation test to compensate for race between test and runtime health checks. Signed-off-by: Artur Souza <asouza.pro@gmail.com> * [1.12] integration: Change daprd default log level to info (#6980) Using debug as the default log level for daprd is not ideal as it is quite noisy, and causes issues on slow CI runners which are not able to handle the volume of IO. Should improve the stability of the integration tests on slow CI runners. Signed-off-by: joshvanl <me@joshvanl.dev> Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Fix lint. Signed-off-by: Artur Souza <asouza.pro@gmail.com> * Update tests/integration/suite/daprd/serviceinvocation/grpc/slowappstartup.go Co-authored-by: Josh van Leeuwen <me@joshvanl.dev> Signed-off-by: Artur Souza <asouza.pro@gmail.com> --------- Signed-off-by: joshvanl <me@joshvanl.dev> Signed-off-by: Artur Souza <asouza.pro@gmail.com> Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Josh van Leeuwen <me@joshvanl.dev> Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Fixes the issue discovered in #6968 (comment)
Dapr reports in /healthz (and /healthz/outbound) a ready status when the API servers are ready, but the actor runtime (and thus workflow) is initialized asynchronously.
This means that apps that try to invoke actors or use workflow fail. We've experienced this error in E2E tests in #6968, where a refactoring on unrelated things made the race condition appear.
This PR fixes the issue by making all actor and workflow APIs (including those managed by the WF engine) block while the actor runtime is initialized (successfully or not).
I validated this fix by adding a
time.Sleep(10 * time.Second)
at the start ofruntime.appHealthReadyInit