Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nanny logs #2744

Merged
merged 3 commits into from Jun 6, 2019
Merged

Add nanny logs #2744

merged 3 commits into from Jun 6, 2019

Conversation

TomAugspurger
Copy link
Member

I had a need to collate logs from the cluster, and would like to include logs from the Nanny. This adds a method to get logs from the nanny as well.

This isn't quite ready.

  1. The keys in the returned dict are the address of the workers, not the nannies. Do we have a convention here? I assume we want the address of the nannies.
  2. I may be adding the handler to the distribted.nanny.logger multiple times, resulting in duplicate logs. Need to verify what's going on.

@mrocklin
Copy link
Member

mrocklin commented Jun 3, 2019

Do we have a convention here?

Other functions like run and broadcast take a nanny= keyword argument. That might be appropriate here as well.

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2019

So client.get_worker_logs(nanny=True)

In fact, I suspect that on the scheduler you could just pass through the nanny=nanny keyword to Scheduler.broadcast and everything would work out.

@TomAugspurger TomAugspurger marked this pull request as ready for review June 4, 2019 14:01
@TomAugspurger
Copy link
Member Author

TomAugspurger commented Jun 4, 2019

Thanks @mrocklin that works out quite nicely.

I updated the docstring for Client.run to note that workers should still be the worker addresses when you specify nanny=True, not the nanny address.

@TomAugspurger
Copy link
Member Author

As an aside, having timestamps in logs is quite helpful for collation across the cluster (assuming the clocks synced well enough). What's our backwards compatibility story for configuration things like that?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Jun 4, 2019

Actually, this may not be ready. I thought I had sorted out the duplicate handler, but apparently not.

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2019

Other than the possible duplicate handler issue the implementation here looks great to me.

@mrocklin
Copy link
Member

mrocklin commented Jun 6, 2019

As an aside, having timestamps in logs is quite helpful for collation across the cluster (assuming the clocks synced well enough). What's our backwards compatibility story for configuration things like that?

I don't have any particular thoughts here.

You may want to take a look here though:

@mrocklin
Copy link
Member

mrocklin commented Jun 6, 2019

Actually, this may not be ready. I thought I had sorted out the duplicate handler, but apparently not.

Can you explain this a bit more? This seems to be the same behavior as before. Is this correct?

@TomAugspurger
Copy link
Member Author

I'm not too sure what's going on.

In [1]: from distributed import Client
   ...: import logging
   ...:
   ...:
   ...: client = Client(n_workers=2, threads_per_worker=1)
   ...:
   ...: logger = logging.getLogger('distributed.nanny')
   ...: logger.handlers

Out[1]:
[<StreamHandler <stderr> (WARNING)>,
 <DequeHandler (NOTSET)>,
 <DequeHandler (NOTSET)>]

So if I get the distributed.nanny logger from my main processes, I see one handler per Nanny.

Oh... but I just realized this is the same behavior as distributed.worker with processes=False.

   ...: client = Client(n_workers=2, threads_per_worker=1, processes=False)
   ...:
   ...: logger = logging.getLogger('distributed.worker')
   ...: logger.handlers
Out[2]: [<DequeHandler (NOTSET)>, <DequeHandler (NOTSET)>]

so perhaps we're OK?

@mrocklin
Copy link
Member

mrocklin commented Jun 6, 2019

Looks good to me. Merging.

@mrocklin mrocklin merged commit 587be8d into dask:master Jun 6, 2019
@TomAugspurger TomAugspurger deleted the nanny-logs branch June 7, 2019 02:47
muammar added a commit to muammar/distributed that referenced this pull request Jun 12, 2019
* upstream/master: (58 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
muammar added a commit to muammar/distributed that referenced this pull request Jul 18, 2019
* upstream/master: (43 commits)
  Add unknown pytest markers (dask#2764)
  Delay lookup of allowed failures. (dask#2761)
  Change address -> worker in ColumnDataSource for nbytes plot (dask#2755)
  Remove module state in Prometheus Handlers (dask#2760)
  Add stress test for UCX (dask#2759)
  Add nanny logs (dask#2744)
  Move some of the adaptive logic into the scheduler (dask#2735)
  Add SpecCluster.new_worker_spec method (dask#2751)
  Worker dashboard fixes (dask#2747)
  Add async context managers to scheduler/worker classes (dask#2745)
  Fix the resource key representation before sending graphs (dask#2716) (dask#2733)
  Allow user to configure whether workers are daemon. (dask#2739)
  Pin pytest >=4 with pip in appveyor and python 3.5 (dask#2737)
  Add Experimental UCX Comm (dask#2591)
  Close nannies gracefully (dask#2731)
  add kwargs to progressbars (dask#2638)
  Add back LocalCluster.__repr__. (dask#2732)
  Move bokeh module to dashboard (dask#2724)
  Close clusters at exit (dask#2730)
  Add SchedulerPlugin TaskState example (dask#2622)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants