Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

Closed
RobertKeyser opened this issue Feb 6, 2023 · 10 comments · Fixed by #3884 or #3898
Closed

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

RobertKeyser opened this issue Feb 6, 2023 · 10 comments · Fixed by #3884 or #3898
Assignees
Labels
bug Something isn't working

Comments

@RobertKeyser
Copy link
Contributor

RobertKeyser commented Feb 6, 2023

Bug Description

When workers are enabled, the /health route on the webserver takes > 1 second to respond. I tried with 1, 5, and 10 workers and didn't notice any significant difference in the time it took.

Results of 10 trials each:

Worker Count Min Time Median Time Max Time
1 1.13 2.48 4.29
5 1.17 2.84 4.84
10 1.1 1.735 4.28

Steps to Reproduce

  1. Set up workers
  2. Check health page of webserver route

Expected behavior

< 1 second http response time

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • Version:
  • OS:
  • Python Version:
  • Docker Version:

Additional context

Kubernetes default timeout for probes is 1 second, meaning with these times, Kubernetes will kill the pods if a liveness probe does not have a timeoutSeconds >= ~5 seconds

@ThomasLaPiana
Copy link
Contributor

@RobertKeyser I'm asking this to explicitly confirm, this is only a problem when workers/celery is enabled?

@RobertKeyser
Copy link
Contributor Author

That's when I started noticing it, but I can't confirm with any certainty that it's related. I'm still seeing slowness, though.

@RobertKeyser
Copy link
Contributor Author

I deployed an instance of Fides with 0 workers and it's blazing fast. The instances of Fides with workers are still sluggish, showing ~4 second times.

@ThomasLaPiana
Copy link
Contributor

@RobertKeyser is it only the health endpoint or other endpoints as well? Sounds like it's only the health check, which I think makes this lower priority

@daveqnet
Copy link
Contributor

daveqnet commented Aug 3, 2023

Re-opening this issue as unfortunately I can still reproduce the issue on builds that include #3884. Here's an example of the latency experienced on a staging environment where the fides webserver is deployed with a worker.

~for i in {1..10}; do curl -o /dev/null -s -w 'Total: %{time_total}s\n' https://fides-nightly.redacted.example.com/health; done
Total: 1.619734s
Total: 7.887261s
Total: 3.163243s
Total: 1.537107s
Total: 1.567872s
Total: 9.011004s
Total: 1.655995s
Total: 1.616508s
Total: 2.597901s
Total: 8.119707s

When workers are eliminated, the issue disappears. The API response time drops to about 500ms on average.

@ThomasLaPiana
Copy link
Contributor

I find the inconsistency here really interesting, it makes it less obvious what the issue might be

@daveqnet
Copy link
Contributor

I've tested an alpha image based on #3898 and can confirm that the latency issue is gone.

~for i in {1..10}; do curl -o /dev/null -s -w 'Total: %{time_total}s\n' https://fides-sandbox.redacted.example.com/health; done
Total: 0.589513s
Total: 0.481228s
Total: 0.469443s
Total: 0.599608s
Total: 0.485188s
Total: 0.483681s
Total: 0.496315s
Total: 0.484560s
Total: 0.497816s
Total: 0.510221s

I will close this issue and get the above PR merged to main, but only after creating a follow-on issue to investigate the real root cause.

@Roger-Ethyca
Copy link

@daveqnet Is it ok close this out?

@daveqnet
Copy link
Contributor

@daveqnet Is it ok close this out?

@Roger-Ethyca, yes indeed, as long as your tests are passing now that #3898 is merged to main?

@Roger-Ethyca
Copy link

moving to done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants