Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIL prometheus metrics are misleading #8557

Closed
crusaderky opened this issue Mar 6, 2024 · 5 comments · Fixed by #8558
Closed

GIL prometheus metrics are misleading #8557

crusaderky opened this issue Mar 6, 2024 · 5 comments · Fixed by #8558
Assignees
Labels
bug Something is broken diagnostics p2 Affects more than a few users but doesn't prevent core functions performance

Comments

@crusaderky
Copy link
Collaborator

I've run a benchmark on a 2-workers cluster that simulates these use cases:

  1. a C extension repeatedly blocks the GIL on one worker in 1 second blocks
  2. a C extension repeatedly blocks the GIL on one worker in 4 second blocks
  3. a C extension repeatedly blocks the GIL on one worker in 15 second blocks
  4. a hot pure-Python loop repeatedly blocks the GIL on one worker in 15 seconds blocks

In all cases, the Bokeh dashboard displays the GIL metric more or less constantly at 50%. This is improvable, as it hides the disparity between the workers (it would be better IMHO to have something like min+median+max), but otherwise it is accurate.

The same metric on Prometheus (which, on Coiled, scrapes every 5 seconds), however, is misleading.
In all cases, I would expect the metric for the affected worker to read 100%. Instead it reads:

  1. 72%
  2. 20~25%
  3. ~7% (failed to scrape most times)
  4. 162%

In the first three cases, this leads to a misleading sense that the GIL is not being problematic.
The last case makes no sense to me - how can GIL contention be more than 100%? (again, on the bokeh dashboard it shows 50% cluster average, or 100% on the affected worker).

image

This uses coiled but a plain LocalCluster will yield the same (assuming you have prometheus scraping it).

Reproducer

import coiled
import dask
import distributed
import ctypes
from time import time, sleep

cluster = coiled.Cluster(n_workers=2)
client = distributed.Client(cluster)
a, b = client.has_what()

def extension_hogs_gil(seconds):
    libc = ctypes.PyDLL('libc.so.6')
    libc.sleep(seconds)


def pure_pyton_hogs_gil(seconds):
    deadline = time() + seconds
    fib0, fib1 = 0, 1
    n = 2
    while True:
        if n % 10_000 == 0 and time() > deadline:
            return n
        fib0, fib1 = fib1, fib0 + fib1
        n += 1


for _ in range(120):
    client.submit(extension_hogs_gil, 1, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//4):
    client.submit(extension_hogs_gil, 4, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//15):
    client.submit(extension_hogs_gil, 15, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//15):
    client.submit(pure_pyton_hogs_gil, 15, pure=False, workers=[a]).result()
@mrocklin
Copy link
Member

mrocklin commented Mar 6, 2024

cc @ntabris not to do anything here, but because I think you've also found this annoying in the past.

@crusaderky
Copy link
Collaborator Author

CC @fjetter @milesgranger

@crusaderky crusaderky changed the title GIL prometheus metrics are misleading when a C extension blocks the GIL GIL prometheus metrics are misleading Mar 6, 2024
@fjetter
Copy link
Member

fjetter commented Mar 7, 2024

I suspect the way this is aggregated here

if self.monitor_gil_contention:
self._last_gil_contention = self._gilknocker.contention_metric
self._cumulative_gil_contention += self._last_gil_contention
result["gil_contention"] = self._last_gil_contention
self._gilknocker.reset_contention_metric()
is not totally correct or interpreted poorly.

I suspect that if the interpreter stands still for a minute, this will report 1 exactly once and the next tick will then report something that is harmless. We'll not get an update during this time and the actual signal vanishes quickly and is not distinguishable from a brief but strong contention.

However, I'm just guessing since I don't know what contention_metric reports exactly.

@crusaderky crusaderky self-assigned this Mar 7, 2024
@crusaderky crusaderky added bug Something is broken performance p3 Affects a small number of users or is largely cosmetic diagnostics p2 Affects more than a few users but doesn't prevent core functions and removed needs triage p3 Affects a small number of users or is largely cosmetic labels Mar 7, 2024
@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 7, 2024

The raw XY plot from the system_monitor looks much better.
The PR to fix prometheus is straightforward.

image

I think there's an additional problem Those points at zero during the second and third phase don't seem right to me. The iteration is to hold the GIL for 15s -> wait ~0.02s for the next task -> hold the GIL for 15s again, so I would expect everything to be close to 1.

I suspect the GIL was acquired by my hog function between the sampling of time() and the measure of gilknocker. Or maybe between fetching the measure and the reset call.
I can't see a way around it, short of a new API in gilknocker itself that calculates the cumulative metric internally while holding the GIL.

ts gil_contention
302.295367 0.001318
317.488106 0.999928
317.650887 0.999930
332.678581 0.000000
347.862917 0.999928
363.056614 0.999929
363.149318 0.000000
378.244867 0.999928
393.435291 0.999927

@crusaderky
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken diagnostics p2 Affects more than a few users but doesn't prevent core functions performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants