Logging standardization - Contextual logging - Structured logging #4762

fjetter · 2021-04-28T09:21:39Z

Logging is often a crucial instrument for debugging and we are using different ways to do so.

Python stdlib logging for human readable messages without contextual information
Implementation specific log calls which store logs in a deque with some context information, e.g.

Scheduler.log_event/Scheduler.events Logs external stimuli from workers and clients as an event in a dictionary by source
Scheduler.transition_log Exclusively used to log transitions in a semi-structured format (key, start, finish, recommendations, timestamp)
Worker.log Unstructured. Part events, part transitions, sometimes with timestamps

The problems I see with this approach are multifold

The internal deque logging has been frequently the cause for memory related troubles since they accumulate memory over time and users are often not aware of this. We artificially need to limit the amount of logs to keep with options like transition-log-length, events-log-length, events-cleanup-delay, etc.
Our internal logging is not in a standardized format. Mostly there are tuples logged where the order and length is different, depending on what kind of event was logged (e.g. work stealing is different to transition, external stimuli log entirely different information)
Neither the stdlib logging nor the implementation specific logic currently logs enough context information (that's very subjective of course). For instance, we know the module which created the log event but not which worker or which thread issued it, let alone in what context. Context could be as simple as logging the worker name, ip, thread ID, etc. but also application specific things like computation ID (Add Computation model to Scheduler #4613) of a transition (see also Capability to pull logs for one submitted dask job #4037 Distributed request tracing #4718)
The split into internal and stdlib logging means that to get all logs we usually need to consolidate multiple sources. For instance, we'll need to collect stdout/err (or however your stdlib logger is configured), scrape all workers and the scheduler. All in different formats.

Machine readability is often not great. For the simple filtering of "give me all events belonging to a key" we have specialized functions like story but we need to write specialized functions for every possible query

distributed/distributed/worker.py

Lines 1946 to 1958 in b577ece

    
           def story(self, *keys): 
        
               keys = [key.key if isinstance(key, TaskState) else key for key in keys] 
        
               return [ 
        
                   msg 
        
                   for msg in self.log 
        
                   if any(key in msg for key in keys) 
        
                   or any( 
        
                       key in c 
        
                       for key in keys 
        
                       for c in msg 
        
                       if isinstance(c, (tuple, list, set)) 
        
                   ) 
        
               ]

Our internal logging is ephemeral by design and this is not optional or configurable
The internal logging cannot be filtered by a level.

Most, if not all, of the above described issues can be addressed by custom solutions.
For instance

our deque loggers could be implemented as stdlib logging handlers to have one source which is highly configurable (https://docs.python.org/3/library/logging.handlers.html#logging.handlers.QueueHandler)
Structured logging can be implemented for better machine readability https://docs.python.org/3/howto/logging-cookbook.html#implementing-structured-logging
Log via adapters to add more context information https://docs.python.org/3/howto/logging-cookbook.html#adding-contextual-information-to-your-logging-output
etc.

Instead of doing this all by ourselves, we could also resort to libs which are doing a great job of encapsulating this in easy to use APIs. One lib I am decently familiar with and is quite popular is structlog and I was wondering if this was something we are interested in.

The text was updated successfully, but these errors were encountered:

fjetter · 2021-04-30T12:43:29Z

I was made aware of https://eliot.readthedocs.io/en/stable/introduction.html and https://github.com/Delgan/loguru. Happy to hear about more libs and/or get feedback of the already mentioned.

quasiben · 2021-05-04T13:59:25Z

Configuring the logging is also challenging. And while our docs demonstrate how to log to disk, it's a little buried:
https://docs.dask.org/en/latest/debugging.html#logs and not obvious what to change

xref: #3669

mrocklin · 2021-05-04T14:20:44Z

cc @itamarst

charlesbluca · 2021-09-01T18:02:16Z

And while our docs demonstrate how to log to disk, it's a little buried:
https://docs.dask.org/en/latest/debugging.html#logs and not obvious what to change

While this is prone to change based on this discussion, would it be worthwhile giving more information on log config options in the configuration reference? Not sure how heavily trafficked that page is but I recall going there looking for log config options (such as the ability to output to file) and assuming they didn't exist because I didn't see any options listed there.

itamarst · 2021-09-01T19:40:47Z

So structlog is structured logging, so a lot better than just strings of text. The problem is that's all it is: messages at particular points in time (loguro is the same). A request ID included in all messages will help you trace causality somewhat, until you hit recursion, and now everything is a mess.

Eliot is fundamentally different: it gives you causality, and a notion of actions that start and end. The output is a tree of actions (or really a forest of actions).

What you want from logs is causality. "A caused B which caused C and D" is much more powerful than "A, B, C, D happened in that order", especially in a concurrent system, and extra-especially in distributed system where timestamps may not match up as well as one might hope.
Additionally, having a notion of "start" and "end" of an action also allows you to get performance information that would be much harder with just structured logging which is point-in-time; you end up having to emulate that manually, so may as well use that from the start. This is also critically performance information that is often missing from sampling profilers, e.g. "f(12) is slow but f(0) is fast").

See https://pythonspeed.com/articles/logging-for-scientific-computing/ — I gave Dask variant of this talk at summit earlier this year, not sure if video is available.

Eliot is one way to do this. It has Dask Distributed support built-in, for users of Distributed: https://eliot.readthedocs.io/en/stable/scientific-computing.html

Another alternative, which is attractive in that there is a bunch of existing tooling for it because a bunch of SaaS platforms and tracing software systems support, is OpenTelemetry.

Bigger picture perspective: if Dask Distributed has a good logging tracing/logging setup, and users are encouraged to use the same framework, users get to see logs that connect not just their logic but also how the distributed system is scheduling everything. Which is probably useful for performance optimization.

fjetter mentioned this issue Apr 30, 2021

Ensure deps are actually logged in worker #4753

Merged

fjetter mentioned this issue Jun 28, 2021

Ship warning/error/critical worker logs to Scheduler #4978

Open

fjetter mentioned this issue Aug 17, 2021

Allow Client to subscribe to events // Remote printing and warning #5217

Merged

fjetter changed the title ~~Contextual and/or structured logging~~ Logging standardization - Contextual logging - Structured logging Aug 17, 2021

jrbourbeau mentioned this issue Aug 30, 2021

Tighter coupling of task metadata to associated worker metrics #5288

Open

jrbourbeau mentioned this issue Oct 14, 2021

Capability to pull logs for one submitted dask job #4037

Closed

fjetter mentioned this issue Oct 15, 2021

Accurate logging of cluster state #1288

Closed

fjetter mentioned this issue Dec 14, 2021

Set scheduler log sizes automatically based on available memory #5570

Open

gjoseph92 mentioned this issue Jan 11, 2022

Include timestamps in log format by default? #5649

Closed

gjoseph92 mentioned this issue Jan 20, 2022

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

Open

fjetter mentioned this issue Aug 23, 2022

Make annotations available in the thread_state #6932

Open

This was referenced Oct 21, 2022

Track reason of workers closing and restarting #7166

Merged

Add Client.log_on_scheduler method to help correlating cluster logs with client-side events. #7179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging standardization - Contextual logging - Structured logging #4762

Logging standardization - Contextual logging - Structured logging #4762

fjetter commented Apr 28, 2021

fjetter commented Apr 30, 2021

quasiben commented May 4, 2021

mrocklin commented May 4, 2021

charlesbluca commented Sep 1, 2021

itamarst commented Sep 1, 2021

Logging standardization - Contextual logging - Structured logging #4762

Logging standardization - Contextual logging - Structured logging #4762

Comments

fjetter commented Apr 28, 2021

fjetter commented Apr 30, 2021

quasiben commented May 4, 2021

mrocklin commented May 4, 2021

charlesbluca commented Sep 1, 2021

itamarst commented Sep 1, 2021