Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.12] Fix race condition in app health and actors/workflow initialization #6972

Merged
merged 3 commits into from Sep 27, 2023

Conversation

ItalyPaleAle
Copy link
Contributor

Fixes the issue discovered in #6968 (comment)

Dapr reports in /healthz (and /healthz/outbound) a ready status when the API servers are ready, but the actor runtime (and thus workflow) is initialized asynchronously.

This means that apps that try to invoke actors or use workflow fail. We've experienced this error in E2E tests in #6968, where a refactoring on unrelated things made the race condition appear.

This PR fixes the issue by making all actor and workflow APIs (including those managed by the WF engine) block while the actor runtime is initialized (successfully or not).

I validated this fix by adding a time.Sleep(10 * time.Second) at the start of runtime.appHealthReadyInit

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle requested review from a team as code owners September 27, 2023 16:38
@ItalyPaleAle
Copy link
Contributor Author

/test-sdk-all

@ItalyPaleAle
Copy link
Contributor Author

/ok-to-test

@mukundansundar mukundansundar mentioned this pull request Sep 27, 2023
@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 27, 2023

Dapr SDK Java test

🔗 Link to Action run

Commit ref: 7f14294

❌ Java SDK tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 27, 2023

Dapr SDK Go test

🔗 Link to Action run

Commit ref: 7f14294

✅ Go SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 27, 2023

Dapr E2E test

🔗 Link to Action run

Commit ref: 7f14294

✅ Build succeeded for linux/amd64

  • Image tag: dapre2e888e959df3l
  • Test image tag: dapre2e888e959df3l

✅ Infrastructure deployed

Cluster Resource group name Azure region
Linux Dapr-E2E-dapre2e888e959df3l westus3
Windows Dapr-E2E-dapre2e888e959df3w westus3
Linux/arm64 Dapr-E2E-dapre2e888e959df3la eastus

✅ Build succeeded for windows/amd64

  • Image tag: dapre2e888e959df3w
  • Test image tag: dapre2e888e959df3w

❌ Tests failed on windows/amd64

Please check the logs for details on the error.

✅ Tests succeeded on linux/amd64

  • Image tag: dapre2e888e959df3l
  • Test image tag: dapre2e888e959df3l

@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 27, 2023

Dapr SDK Python test

🔗 Link to Action run

Commit ref: 7f14294

✅ Python SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Sep 27, 2023

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 7f14294

✅ JS SDK tests passed

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItalyPaleAle Can you add an integration test to prove this works? We can test by running daprd (with a placement address configured), testing that healthz fails, then starting placement and wait for the healthz endpoint to return healthy.

@ItalyPaleAle
Copy link
Contributor Author

@ItalyPaleAle Can you add an integration test to prove this works? We can test by running daprd (with a placement address configured), testing that healthz fails, then starting placement and wait for the healthz endpoint to return healthy.

No, this is not what this PR does.

  1. Healthz hasn't changed
  2. Also this isn't impacted by whether placement is configured or not. If the actor runtime can't start (because there's no placement configured, or because there's no actor state store, or because the app doens't respond to /dapr/configuration), healthz still returns true and the actors runtime is considered "initialized" (even though is inactive)

The issue this fixes is different:

  1. Apps that use actors (and workflow) are requested to wait for /healthz to report success. See
    await daprClient.WaitForSidecarAsync();
  2. Actors are initialized async in the meanwhile
  3. The app tries to invoke actors or workflow (example:
    var startResponse = await daprClient.StartWorkflowAsync(
    )
    • If the initialization of actors is done, this succeeds
    • If not, it fails

The fix makes it so calls like the one above will block until actors is initialized, whether successfully or not.

I am not sure how to even write a test for this, since testing for absence of race conditions is always a hell of a problem.

@JoshVanL
Copy link
Contributor

JoshVanL commented Sep 27, 2023

@ItalyPaleAle Please cherry pick #6974 which will pass on this branch. It doesn't test workflows as we don't have any infra for them in integration tests yet but at least covers actors.

… complete

Co-Authored-By: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
@ItalyPaleAle
Copy link
Contributor Author

@ItalyPaleAle Please cherry pick #6974 which will pass on this branch. It doesn't test workflows as we don't have any infra for them in integration tests yet but at least covers actors.

Thank you, merged this

@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (9514d96) 64.84% compared to head (82f9ad9) 64.65%.

Additional details and impacted files
@@               Coverage Diff                @@
##           release-1.12    #6972      +/-   ##
================================================
- Coverage         64.84%   64.65%   -0.19%     
================================================
  Files               228      230       +2     
  Lines             20848    20847       -1     
================================================
- Hits              13518    13478      -40     
- Misses             6203     6236      +33     
- Partials           1127     1133       +6     
Files Coverage Δ
pkg/runtime/config.go 30.00% <100.00%> (+0.35%) ⬆️
pkg/grpc/api.go 72.91% <93.33%> (+<0.01%) ⬆️
pkg/grpc/universalapi/api_workflow.go 80.47% <85.71%> (+0.22%) ⬆️
pkg/http/api.go 77.95% <93.75%> (-0.35%) ⬇️
pkg/runtime/wfengine/wfengine.go 59.74% <91.66%> (+2.59%) ⬆️
pkg/grpc/universalapi/universalapi.go 33.33% <33.33%> (ø)
pkg/grpc/universalapi/api_actors.go 40.00% <40.00%> (ø)
pkg/runtime/runtime.go 73.14% <58.82%> (+0.13%) ⬆️

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@artursouza artursouza merged commit 11c0f2e into dapr:release-1.12 Sep 27, 2023
4 checks passed
Copy link
Member

@artursouza artursouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

artursouza added a commit to artursouza/dapr that referenced this pull request Sep 29, 2023
…nitialization (dapr#6972)

* Fix race condition in app health and actors/workflow initialization

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Adds integration test to prove actors respond before initilization is complete

Co-Authored-By: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: joshvanl <me@joshvanl.dev>
Co-authored-by: joshvanl <me@joshvanl.dev>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>
yaron2 pushed a commit that referenced this pull request Sep 29, 2023
…o connection. (#6977)

* [release-1.12] Ensure sentry certificate expiry metric is served when provided (#6973)

* Ensure sentry certificate expiry metric is served when provided

Signed-off-by: joshvanl <me@joshvanl.dev>

* Fix spelling of test func `Expiry`

Signed-off-by: joshvanl <me@joshvanl.dev>

* Give metric test case a more appropriate name

Signed-off-by: joshvanl <me@joshvanl.dev>

* Linting

Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* [release-1.12] Fix race condition in app health and actors/workflow initialization (#6972)

* Fix race condition in app health and actors/workflow initialization

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Adds integration test to prove actors respond before initilization is complete

Co-Authored-By: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: joshvanl <me@joshvanl.dev>
Co-authored-by: joshvanl <me@joshvanl.dev>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* Fix issue where gRPC app channel cannot recover from no connection.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* [1.12] Adds integration test for slow app startup with service
invocation

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* Keep default gRPC backoff setting.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* Fix slowapp invocation test to compensate for race between test and runtime health checks.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* [1.12] integration: Change daprd default log level to info (#6980)

Using debug as the default log level for daprd is not ideal as it
is quite noisy, and causes issues on slow CI runners which are not able
to handle the volume of IO. Should improve the stability of the
integration tests on slow CI runners.

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* Fix lint.

Signed-off-by: Artur Souza <asouza.pro@gmail.com>

* Update tests/integration/suite/daprd/serviceinvocation/grpc/slowappstartup.go

Co-authored-by: Josh van Leeuwen <me@joshvanl.dev>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>

---------

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: Artur Souza <asouza.pro@gmail.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Josh van Leeuwen <me@joshvanl.dev>
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle mentioned this pull request Oct 2, 2023
7 tasks
@artursouza artursouza added this to the v1.12 milestone Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants