When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

RobertKeyser · 2023-02-06T16:58:31Z

Bug Description

When workers are enabled, the /health route on the webserver takes > 1 second to respond. I tried with 1, 5, and 10 workers and didn't notice any significant difference in the time it took.

Results of 10 trials each:

Worker Count	Min Time	Median Time	Max Time
1	1.13	2.48	4.29
5	1.17	2.84	4.84
10	1.1	1.735	4.28

Steps to Reproduce

Set up workers
Check health page of webserver route

Expected behavior

< 1 second http response time

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Version:
OS:
Python Version:
Docker Version:

Additional context

Kubernetes default timeout for probes is 1 second, meaning with these times, Kubernetes will kill the pods if a liveness probe does not have a timeoutSeconds >= ~5 seconds

The text was updated successfully, but these errors were encountered:

ThomasLaPiana · 2023-03-08T07:32:28Z

@RobertKeyser I'm asking this to explicitly confirm, this is only a problem when workers/celery is enabled?

RobertKeyser · 2023-03-23T17:00:33Z

That's when I started noticing it, but I can't confirm with any certainty that it's related. I'm still seeing slowness, though.

RobertKeyser · 2023-06-12T21:04:22Z

I deployed an instance of Fides with 0 workers and it's blazing fast. The instances of Fides with workers are still sluggish, showing ~4 second times.

ThomasLaPiana · 2023-06-13T12:33:17Z

@RobertKeyser is it only the health endpoint or other endpoints as well? Sounds like it's only the health check, which I think makes this lower priority

daveqnet · 2023-08-03T12:44:41Z

Re-opening this issue as unfortunately I can still reproduce the issue on builds that include #3884. Here's an example of the latency experienced on a staging environment where the fides webserver is deployed with a worker.

~ ❯ for i in {1..10}; do curl -o /dev/null -s -w 'Total: %{time_total}s\n' https://fides-nightly.redacted.example.com/health; done
Total: 1.619734s
Total: 7.887261s
Total: 3.163243s
Total: 1.537107s
Total: 1.567872s
Total: 9.011004s
Total: 1.655995s
Total: 1.616508s
Total: 2.597901s
Total: 8.119707s

When workers are eliminated, the issue disappears. The API response time drops to about 500ms on average.

ThomasLaPiana · 2023-08-08T07:59:06Z

I find the inconsistency here really interesting, it makes it less obvious what the issue might be

daveqnet · 2023-08-10T12:13:41Z

I've tested an alpha image based on #3898 and can confirm that the latency issue is gone.

~ ❯ for i in {1..10}; do curl -o /dev/null -s -w 'Total: %{time_total}s\n' https://fides-sandbox.redacted.example.com/health; done
Total: 0.589513s
Total: 0.481228s
Total: 0.469443s
Total: 0.599608s
Total: 0.485188s
Total: 0.483681s
Total: 0.496315s
Total: 0.484560s
Total: 0.497816s
Total: 0.510221s

I will close this issue and get the above PR merged to main, but only after creating a follow-on issue to investigate the real root cause.

Roger-Ethyca · 2023-08-16T20:33:16Z

@daveqnet Is it ok close this out?

daveqnet · 2023-08-17T11:45:49Z

@daveqnet Is it ok close this out?

@Roger-Ethyca, yes indeed, as long as your tests are passing now that #3898 is merged to main?

Roger-Ethyca · 2023-08-17T22:09:40Z

moving to done

RobertKeyser added the bug Something isn't working label Feb 6, 2023

RobertKeyser mentioned this issue Feb 6, 2023

Add Ability to Provision Postgres and Fides Workers ethyca/fides-helm#23

Merged

4 tasks

ThomasLaPiana self-assigned this Aug 1, 2023

ThomasLaPiana mentioned this issue Aug 1, 2023

Update Healthcheck logic #3884

Merged

6 tasks

ThomasLaPiana closed this as completed in #3884 Aug 2, 2023

daveqnet reopened this Aug 3, 2023

daveqnet mentioned this issue Aug 4, 2023

Disable worker part of /health check #3898

Merged

7 tasks

daveqnet mentioned this issue Aug 10, 2023

Refactor worker healthchecks #3912

Closed

daveqnet closed this as completed in #3898 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

RobertKeyser commented Feb 6, 2023 •

edited

ThomasLaPiana commented Mar 8, 2023

RobertKeyser commented Mar 23, 2023

RobertKeyser commented Jun 12, 2023

ThomasLaPiana commented Jun 13, 2023

daveqnet commented Aug 3, 2023

ThomasLaPiana commented Aug 8, 2023

daveqnet commented Aug 10, 2023

Roger-Ethyca commented Aug 16, 2023

daveqnet commented Aug 17, 2023

Roger-Ethyca commented Aug 17, 2023

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

When Workers are Enabled Webserver Health Check Takes a Long Time to Respond #2517

Comments

RobertKeyser commented Feb 6, 2023 • edited

Bug Description

Steps to Reproduce

Expected behavior

Screenshots

Environment

Additional context

ThomasLaPiana commented Mar 8, 2023

RobertKeyser commented Mar 23, 2023

RobertKeyser commented Jun 12, 2023

ThomasLaPiana commented Jun 13, 2023

daveqnet commented Aug 3, 2023

ThomasLaPiana commented Aug 8, 2023

daveqnet commented Aug 10, 2023

Roger-Ethyca commented Aug 16, 2023

daveqnet commented Aug 17, 2023

Roger-Ethyca commented Aug 17, 2023

RobertKeyser commented Feb 6, 2023 •

edited