Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weigh gilknocker Prometheus metric by duration #8558

Merged
merged 4 commits into from
Mar 8, 2024

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Mar 7, 2024

main this PR
dask_scheduler_gil_contention_total dask_scheduler_gil_contention_seconds_total
dask_worker_gil_contention_total dask_worker_gil_contention_seconds_total

stress test from #8557:

BEFORE

Screenshot from 2024-03-06 15-17-22

AFTER

Screenshot from 2024-03-07 16-25-14

CC @ntabris @milesgranger

raw_interval = dask.config.get(
"distributed.admin.system-monitor.gil.interval",
)
interval = parse_timedelta(raw_interval, default="us") * 1e6

interval = parse_timedelta(raw_interval) * 1e6
Copy link
Collaborator Author

@crusaderky crusaderky Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't interpret a raw 1 in the config (which violates the jsonschema) as 1us. Crash instead.

self._gilknocker.reset_contention_metric()
result["gil_contention"] = self._last_gil_contention = gil_contention
self.cumulative_gil_contention += duration * gil_contention
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change.

Copy link
Contributor

github-actions bot commented Mar 7, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  +    26      27 suites  +26   10h 2m 13s ⏱️ + 10h 1m 49s
 4 048 tests + 4 003   3 934 ✅ + 3 926    110 💤 +   73   4 ❌ + 4 
50 828 runs  +50 783  48 463 ✅ +48 455  2 334 💤 +2 297  31 ❌ +31 

For more details on these failures, see this check.

Results for commit f919669. ± Comparison against base commit e16a7af.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as ready for review March 7, 2024 16:31
@crusaderky crusaderky requested a review from fjetter as a code owner March 7, 2024 16:31
docs/source/prometheus.rst Outdated Show resolved Hide resolved
@ntabris
Copy link
Contributor

ntabris commented Mar 7, 2024

Thanks, @crusaderky! It would be nice to be able to get more accurate GIL data via Prometheus.

@crusaderky crusaderky mentioned this pull request Mar 8, 2024
4 tasks
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky! Changes look good, assuming CI turns green.

Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good for me as well, sorry for the delayed review here!

@crusaderky crusaderky merged commit de166f6 into dask:main Mar 8, 2024
12 of 34 checks passed
@crusaderky crusaderky deleted the gilknocker branch March 8, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GIL prometheus metrics are misleading
4 participants