Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend prometheus metrics endpoint (#2792) #2833

Merged
merged 6 commits into from
Jul 16, 2019

Conversation

gsailer
Copy link
Contributor

@gsailer gsailer commented Jul 12, 2019

This PR introduces the following prometheus metrics:

  • scheduler
    • number of tasks that were received
    • number of unrunnable tasks
  • worker
    • tasks: stored, ready, waiting and serving
    • number of connections to other workers
    • number of worker threads
    • latency
    • median tick duration
    • median task duration
    • median transfer bandwidth

The requirement of crick for the digest-based metrics is handled by checking if crick is available.
If crick is not available there is a log message on loglevel info regarding the missing crick and the metrics, which require crick are not exposed.

Regarding what @mrocklin mentioned in the issue(#2792) about the digest metrics not being exposed individually. The problem here is that the metrics names would not be compliant with prometheus naming. Additionally expressive descriptions would not be available either if the metrics were not exposed individually.

@mrocklin
Copy link
Member

@jacobtomlinson would you mind reviewing this?

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems reasonable to me. Thanks for putting in the effort!

I've made a couple of comments but generally happy.

distributed/dashboard/worker_html.py Outdated Show resolved Hide resolved
# 'Number of connections currently open.',
# value=???,
# )
from prometheus_client.core import GaugeMetricFamily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to importing here seems reasonable, but does this mean there is now a stray unused import somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the import was done at the init of PrometheusHandler when the PrometheusCollector is registered in the registry.
All imports which are there are currently still required in the PrometheusHandler, which was the only point the changed _PrometheusCollector class is used.
So there is no dangling import for prometheus_client anywhere in my opinion.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good. Thanks again!

@mrocklin
Copy link
Member

Merging this in. Thanks @sublinus ! Thanks also to @jacobtomlinson for the review.

Also, I notice that this is your first code contribution to this repository. Welcome!

@mrocklin mrocklin merged commit af64e07 into dask:master Jul 16, 2019
muammar added a commit to muammar/distributed that referenced this pull request Jul 18, 2019
* upstream/master: (33 commits)
  SpecCluster: move init logic into start (dask#2850)
  Dont reuse closed worker in get_worker (dask#2841)
  Add alternative SSHCluster implementation (dask#2827)
  Extend prometheus metrics endpoint (dask#2792) (dask#2833)
  Include type name in SpecCluster repr (dask#2834)
  Don't make False add-keys report to scheduler (dask#2421)
  Add Nanny to worker docs (dask#2826)
  Respect security configuration in LocalCluster (dask#2822)
  bump version to 2.1.0
  Fix typo that prevented error message (dask#2825)
  Remove dask-mpi (dask#2824)
  Updates to use update_graph in task journey docs (dask#2821)
  Fix Client repr with memory_info=None (dask#2816)
  Fix case where key, rather than TaskState, could end up in ts.waiting_on (dask#2819)
  Use Keyword-only arguments (dask#2814)
  Relax check for worker references in cluster context manager (dask#2813)
  Add HTTPS support for the dashboard (dask#2812)
  CLN: Use dask.utils.format_bytes (dask#2810)
  bump version to 2.0.1
  Add python_requires entry to setup.py (dask#2807)
  ...
@gsailer gsailer deleted the prometheus-metrics branch July 24, 2019 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants