As part of the work on dask-sql, there has been some demand for machine readable logs of worker metrics (such as GPU utilization / memory usage) coupled with the tasks these workers are currently running / have recently run (along with some additional metadata of these tasks, such as when they were scheduled/started/completed); with this data readily available, it would be easier to diagnose why certain tasks are bottlenecks in a given computation by tracking what was happening with the worker while the task was running.
To give an idea of what might be wanted here, some RAPIDS folk have developed and are currently using dask-metrics for this purpose, which is able to generate per-worker CSV files containing this information (with only GPU-relevant metrics). @jakirkham also suggested adding something like an "N slowest running tasks" table to the performance reports, although I think we would want the granular data as well.
For context, all of this information is readily available through the scheduler, though it would need to be merged together manually:
cluster.scheduler.get_task_stream() # gives us task metadata along with workers running the tasks
await cluster.scheduler.get_worker_monitor_info() # gives us timestamped worker metrics
Some options I've considered for this:
- Adding task metadata to the
SystemMonitor or WorkerState metrics; not sure if/how this could be done but would make it easier to stream this data somewhere
- Adding a scheduler function to merge the existing task/worker metadata and return it in a machine readable format
It would be nice to have some discussion on if this is doable and worthwhile for troubleshooting performance in Distributed.
cc @randerzander
As part of the work on dask-sql, there has been some demand for machine readable logs of worker metrics (such as GPU utilization / memory usage) coupled with the tasks these workers are currently running / have recently run (along with some additional metadata of these tasks, such as when they were scheduled/started/completed); with this data readily available, it would be easier to diagnose why certain tasks are bottlenecks in a given computation by tracking what was happening with the worker while the task was running.
To give an idea of what might be wanted here, some RAPIDS folk have developed and are currently using dask-metrics for this purpose, which is able to generate per-worker CSV files containing this information (with only GPU-relevant metrics). @jakirkham also suggested adding something like an "N slowest running tasks" table to the performance reports, although I think we would want the granular data as well.
For context, all of this information is readily available through the scheduler, though it would need to be merged together manually:
Some options I've considered for this:
SystemMonitororWorkerStatemetrics; not sure if/how this could be done but would make it easier to stream this data somewhereIt would be nice to have some discussion on if this is doable and worthwhile for troubleshooting performance in Distributed.
cc @randerzander