Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics: decorate dask/dask #10084

Open
crusaderky opened this issue Mar 17, 2023 · 1 comment
Open

Fine performance metrics: decorate dask/dask #10084

crusaderky opened this issue Mar 17, 2023 · 1 comment
Labels
io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. needs triage Needs a response from a contributor

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Mar 17, 2023

In the demo of dask/distributed#7586, you can spot "I/O" time in the collected metrics. This was possible by decorating I/O heavy dask/dask functions:

--- a/dask/dataframe/io/parquet/core.py
+++ b/dask/dataframe/io/parquet/core.py
@@ -27,6 +27,8 @@ from dask.highlevelgraph import HighLevelGraph
 from dask.layers import DataFrameIOLayer
 from dask.utils import apply, import_required, natural_sort_key, parse_bytes
 
+from distributed.metrics import context_meter
+
 __all__ = ("read_parquet", "to_parquet")
 
 NONE_LABEL = "__null_dask_index__"
@@ -158,6 +160,7 @@ class ToParquetFunctionWrapper:
             self.kwargs_pass,
         )
 
+    @context_meter.meter("thread-I/O")
     def __call__(self, df, block_index: tuple[int]):
         # Get partition index from block index tuple
         part_i = block_index[0]
@@ -643,6 +646,7 @@ def check_multi_support(engine):
     return hasattr(engine, "multi_support") and engine.multi_support()
 
 
+@context_meter.meter("thread-I/O")
 def read_parquet_part(
     fs, engine, meta, part, columns, index, use_nullable_dtypes, kwargs
 ):

without the above, I/O heavy tasks would be broadly classified as "thread-noncpu", which also includes GIL contention, higher load than there are physical CPUs, and context switch overhead.
We should decorate all functions in dask/dask that perform I/O as in the above example, with a trivial dummy contextmanager in case distributed is not installed.

@crusaderky crusaderky transferred this issue from dask/distributed Mar 17, 2023
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Mar 17, 2023
@crusaderky crusaderky added the io label Mar 17, 2023
@fjetter
Copy link
Member

fjetter commented Apr 3, 2023

Generally I'm +1 for this but I suggest to hold off with this until we have a bit more experience with the metering stack and get a better feeling where this is useful.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

2 participants