Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spans skeleton #7862

Merged
merged 2 commits into from May 31, 2023
Merged

Spans skeleton #7862

merged 2 commits into from May 31, 2023

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 26, 2023

In scope

  • Define spans on the client
  • Define a default span when no spans are explicitly set by the user
  • third party extensions can navigate spans top-down or bottom-up

Out of scope

  • As of this PR, you can trace a task back to its span, but not yet the other way around (without a full scan of Scheduler.tasks)
  • spans don't contain any useful aggregated information of their own
  • once a task is forgotten, you lose all the information

@crusaderky crusaderky requested a review from fjetter as a code owner May 26, 2023 17:58

@contextmanager
def span(*tags: str) -> Iterator[None]:
"""Tag group of tasks to be part of a certain group, called a span.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is not rendered anywhere at the moment.
We should wait until we have something actually useful before we advertise it in sphinx.

@crusaderky
Copy link
Collaborator Author

CC @fjetter @hendrikmakait @ntabris

@crusaderky crusaderky self-assigned this May 26, 2023
@crusaderky crusaderky removed the request for review from fjetter May 26, 2023 18:13
@github-actions
Copy link
Contributor

github-actions bot commented May 26, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±  0         20 suites  ±0   11h 29m 15s ⏱️ + 11m 52s
  3 648 tests +  5    3 537 ✔️ +  3     108 💤 ±0  3 +2 
35 270 runs  +50  33 504 ✔️ +48  1 763 💤 ±0  3 +2 

For more details on these failures, see this check.

Results for commit 2098b9a. ± Comparison against base commit 3286d2f.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as draft May 26, 2023 21:31
@crusaderky crusaderky marked this pull request as ready for review May 27, 2023 00:04
@crusaderky crusaderky force-pushed the spans branch 2 times, most recently from 7d31b76 to c9398a7 Compare May 30, 2023 11:18
Comment on lines 28 to 34
def assert_span(d, *expect):
span_id = (c.id, *expect)
assert s.tasks[d.key].annotations["span"] == span_id
assert a.state.tasks[d.key].annotations["span"] == span_id
assert ext.spans[span_id].id == span_id

assert_span(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that every task gets a span defined by the client ID?

If so, my sense is that multi-client workloads are rare, and we're probably already able to attribute the client ID. My inclination would be to drop Client ID from spans, which leaves the common case of no spans unchanged / clean and without annotations, which seems good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that every task gets a span defined by the client ID?

Yes, there's always a root span if the user does not explicitly call span() on the client workflow, and it's identified by the client ID.

Last week, @fjetter, @hendrikmakait, @ntabris and I reached consensus in this direction during a high-bandwidth design discussion. The conclusion was that adding an annotation to all tasks is cheap and is a small cost to pay to support for multi-tenancy. The overhead becomes even smaller if the user actually uses the span() context manager, as this adds just a single element to a tuple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conclusion was that adding an annotation to all tasks is cheap and is a small cost to pay to support for multi-tenancy.

Can I ask to learn more about the benefits of multi-tenancy that were discussed? I appreciate that the cost is small. I'm confused about the benefit, it seems very small to me.

Adding an annotation to every task that includes the client ID doesn't make much sense to me, especially not in support of a new feature that hasn't demonstrated that it will get much use. I'd love to learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I'd like to propose that we drop the client id from the span annotations. I suspect that that won't have any bad effects (please correct me if I'm not thinking of something) and will also isolate the effects of the proposed change to only when the new feature is used.

As an example, I can imagine changes in the future which turn off certain optimizations if any annotations are present, just because we don't know how to handle them. In this case (which seems not unlikely to me) it would have low impact if annotations are rare (as they are today) but would have very high impact if this change goes through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against turning the implicit span creation by clients into a config value that's off by default and assuming a global span instead. If we found a way to attach tasks to an implicit client span without annotations, even better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is that we want to support overlapping spans to tag independent workloads. This also means that we want to handle workloads from different clients independently.

Setting one default span per client was almost an afterthought to me: When discussing how to pick default spans, we decided to drop any heuristics and simply use static defaults. (This is the important decision.) Using one default span per client felt like a better default on a multi-tenant setup as it would allow users to always differentiate between independent work triggered by different clients (scheduled jobs/users) without any/much perceived cost. I still think having such a feature would be nice, but I personally don't care if that's hidden behind some flag to avoid some cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-client workloads are exceedingly rare. I recommend that we don't worry about them until they come up (I'll bet lots of beers that they don't come up). If they're the only motivation behind adding annotations to all tasks then, to me, the choice is clear that we don't think much about multi-client workloads and also get to not add annotations in the common case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me, spans can still be set explicitly in multi-client environments to handle workloads independently and whether or not we add annotations in the common case is essentially an implementation detail to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed client ID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky . I appreciate it.

@crusaderky
Copy link
Collaborator Author

This is again ready for review and merge

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky for getting this off the ground! The code looks good to me.

Outside the scope of this PR, I'd like to ideate on the representation of spans and how we communicate them to the scheduler. I think we should move our representation closer to something like the trace/span specification of OpenTelemetry (https://opentelemetry.io/docs/concepts/signals/traces/). In particular, I see value in relying on UUIDs to differentiate between different "runs" of the same span (in particular, now that we've decided against creating a default span per client) and using a richer representation that would allow us to add additional attributes. This is probably best done in a high-bandwidth conversation.

distributed/spans.py Outdated Show resolved Hide resolved
@hendrikmakait hendrikmakait merged commit 7926ea6 into dask:main May 31, 2023
23 of 28 checks passed
@crusaderky crusaderky deleted the spans branch May 31, 2023 12:00
@crusaderky
Copy link
Collaborator Author

In particular, I see value in relying on UUIDs to differentiate between different "runs" of the same span (in particular, now that we've decided against creating a default span per client)

Sounds like update_graph should prepend a unique ID to the span annotations.
...which is almost exactly what I just removed with the client id...

@mrocklin
Copy link
Member

Sounds like update_graph should prepend a unique ID to the span annotations.
...which is almost exactly what I just removed with the client id...

Two possible differences:

  1. We wouldn't add the ID to tasks without a span (thus leaving the common case the same as it was before this PR)
  2. We might want to differentiate between similarly named spans within the same cluster (this is what I took from Hendrik's comment above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants