GIL prometheus metrics are misleading #8557

crusaderky · 2024-03-06T15:46:43Z

I've run a benchmark on a 2-workers cluster that simulates these use cases:

a C extension repeatedly blocks the GIL on one worker in 1 second blocks
a C extension repeatedly blocks the GIL on one worker in 4 second blocks
a C extension repeatedly blocks the GIL on one worker in 15 second blocks
a hot pure-Python loop repeatedly blocks the GIL on one worker in 15 seconds blocks

In all cases, the Bokeh dashboard displays the GIL metric more or less constantly at 50%. This is improvable, as it hides the disparity between the workers (it would be better IMHO to have something like min+median+max), but otherwise it is accurate.

The same metric on Prometheus (which, on Coiled, scrapes every 5 seconds), however, is misleading.
In all cases, I would expect the metric for the affected worker to read 100%. Instead it reads:

72%
20~25%
~7% (failed to scrape most times)
162%

In the first three cases, this leads to a misleading sense that the GIL is not being problematic.
The last case makes no sense to me - how can GIL contention be more than 100%? (again, on the bokeh dashboard it shows 50% cluster average, or 100% on the affected worker).

This uses coiled but a plain LocalCluster will yield the same (assuming you have prometheus scraping it).

Reproducer

import coiled
import dask
import distributed
import ctypes
from time import time, sleep

cluster = coiled.Cluster(n_workers=2)
client = distributed.Client(cluster)
a, b = client.has_what()

def extension_hogs_gil(seconds):
    libc = ctypes.PyDLL('libc.so.6')
    libc.sleep(seconds)


def pure_pyton_hogs_gil(seconds):
    deadline = time() + seconds
    fib0, fib1 = 0, 1
    n = 2
    while True:
        if n % 10_000 == 0 and time() > deadline:
            return n
        fib0, fib1 = fib1, fib0 + fib1
        n += 1


for _ in range(120):
    client.submit(extension_hogs_gil, 1, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//4):
    client.submit(extension_hogs_gil, 4, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//15):
    client.submit(extension_hogs_gil, 15, pure=False, workers=[a]).result()
sleep(10)

for _ in range(120//15):
    client.submit(pure_pyton_hogs_gil, 15, pure=False, workers=[a]).result()

mrocklin · 2024-03-06T16:01:18Z

cc @ntabris not to do anything here, but because I think you've also found this annoying in the past.

crusaderky · 2024-03-06T16:02:24Z

CC @fjetter @milesgranger

fjetter · 2024-03-07T10:15:10Z

I suspect the way this is aggregated here

distributed/distributed/system_monitor.py

Lines 199 to 203 in b1597b6

    
           if self.monitor_gil_contention: 
        
               self._last_gil_contention = self._gilknocker.contention_metric 
        
               self._cumulative_gil_contention += self._last_gil_contention 
        
               result["gil_contention"] = self._last_gil_contention 
        
               self._gilknocker.reset_contention_metric()

is not totally correct or interpreted poorly.

I suspect that if the interpreter stands still for a minute, this will report 1 exactly once and the next tick will then report something that is harmless. We'll not get an update during this time and the actual signal vanishes quickly and is not distinguishable from a brief but strong contention.

However, I'm just guessing since I don't know what contention_metric reports exactly.

crusaderky · 2024-03-07T12:18:18Z

The raw XY plot from the system_monitor looks much better.
The PR to fix prometheus is straightforward.

I think there's an additional problem Those points at zero during the second and third phase don't seem right to me. The iteration is to hold the GIL for 15s -> wait ~0.02s for the next task -> hold the GIL for 15s again, so I would expect everything to be close to 1.

I suspect the GIL was acquired by my hog function between the sampling of time() and the measure of gilknocker. Or maybe between fetching the measure and the reset call.
I can't see a way around it, short of a new API in gilknocker itself that calculates the cumulative metric internally while holding the GIL.

ts	gil_contention
302.295367	0.001318
317.488106	0.999928
317.650887	0.999930
332.678581	0.000000
347.862917	0.999928
363.056614	0.999929
363.149318	0.000000
378.244867	0.999928
393.435291	0.999927

crusaderky · 2024-03-07T12:41:45Z

Follow-up: [feature request] Cumulative metric milesgranger/gilknocker#23

github-actions bot added the needs triage label Mar 6, 2024

crusaderky changed the title ~~GIL prometheus metrics are misleading when a C extension blocks the GIL~~ GIL prometheus metrics are misleading Mar 6, 2024

crusaderky mentioned this issue Mar 7, 2024

Weigh gilknocker Prometheus metric by duration #8558

Merged

crusaderky self-assigned this Mar 7, 2024

crusaderky added bug Something is broken performance p3 Affects a small number of users or is largely cosmetic diagnostics p2 Affects more than a few users but doesn't prevent core functions and removed needs triage p3 Affects a small number of users or is largely cosmetic labels Mar 7, 2024

crusaderky mentioned this issue Mar 7, 2024

[feature request] Cumulative metric milesgranger/gilknocker#23

Open

crusaderky closed this as completed in #8558 Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GIL prometheus metrics are misleading #8557

GIL prometheus metrics are misleading #8557

crusaderky commented Mar 6, 2024

mrocklin commented Mar 6, 2024

crusaderky commented Mar 6, 2024

fjetter commented Mar 7, 2024

crusaderky commented Mar 7, 2024 •

edited

Loading

crusaderky commented Mar 7, 2024

GIL prometheus metrics are misleading #8557

GIL prometheus metrics are misleading #8557

Comments

crusaderky commented Mar 6, 2024

Reproducer

mrocklin commented Mar 6, 2024

crusaderky commented Mar 6, 2024

fjetter commented Mar 7, 2024

crusaderky commented Mar 7, 2024 • edited Loading

crusaderky commented Mar 7, 2024

crusaderky commented Mar 7, 2024 •

edited

Loading