Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove stringification #8083

Merged
merged 12 commits into from Aug 24, 2023
Merged

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Aug 8, 2023

Problem

The distributed scheduler is stringifying keys as a first step when receiving a graph. This is nice because we can rely on keys to be of a single type in the scheduler instead of dealing with a generic Hashable.

However, it is also introducing weird artifact of us parsing these str keys again, for example in our P2P extension we have to analyze these strings to infer whether we're dealing with a specific kind (the barrier task)

The stringification is also known to be a performance problem at graph submission time. See #7998 which lists stringify (and subsequent key_splits) to be listed as 12% CPU time each (i.e. in total 24%) of an update_graph of a large graph >1-2MM tasks

From a complexity perspective I would love to get rid of stringification (and possible restrict key types to str and tuple, not everything hashable)

Changes

This PR removes the stringification of keys. Strings will be stored and handled in their raw form on the scheduler. This is a breaking change for plugins with transition hooks.

This PR also restricts keys to the following types: bytes, float, int, str, or tuples of these. This is enforced by validate_key

def validate_key(k):
"""Validate a key as received on a stream."""
if isinstance(k, tuple):
for e in k:
validate_key(e)
elif not isinstance(k, (bytes, int, float, str)):
raise TypeError(f"Unexpected key type {type(k)} (value: {k!r})")
and is a breaking change.

Migrating

If you run into a TypeError: TypeError: Unexpected key type..., please adjust your keys to match the types allowed by validate_key.

If you require strings to be keys, consider whether you achieve your goals by using the raw form of the keys, or to transform the keys to their previous string format using dask.utils.stringify as needed.

Comment on lines 278 to 280
class CustomFile(zict.File):
def _safe_key(self, key):
return super()._safe_key(stringify(key))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place, so far, that truly requires a str but I believe this should be handled on this abstraction level and not force the entire system to use str

@fjetter
Copy link
Member Author

fjetter commented Aug 8, 2023

This would be a breaking change for plugins so this should make a difference somehow to be worth doing.

@madsbk I believe you've been interested in this kind of thing? I also wonder what would break on the Rapids side of things if we went through with this or if there is something else I'm forgetting

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±  0         21 suites  ±0   11h 37m 58s ⏱️ + 23m 53s
  3 779 tests ±  0    3 668 ✔️ +  3     107 💤  -   1  4  - 2 
36 531 runs  +12  34 724 ✔️ +27  1 803 💤  - 12  4  - 3 

For more details on these failures, see this check.

Results for commit d4798e8. ± Comparison against base commit 03ea2e1.

♻️ This comment has been updated with latest results.

@madsbk
Copy link
Contributor

madsbk commented Aug 10, 2023

I like this, what started as an simplification of keys is now a significant source of complexity!

It might break downstream projects that manually stringify or de-stringify graph keys but it shouldn't be much work to fix.

@fjetter
Copy link
Member Author

fjetter commented Aug 14, 2023

I did some profiling on memory usage on the scheduler and the way we're doing stringification is a major driver for scheduler memory usage (this is relevant for large graphs)

@fjetter
Copy link
Member Author

fjetter commented Aug 14, 2023

In it's current version, spill heavy workloads are not looking great on benchmarking. This is quite obvious since the spilling now needs to handle stringification as well. This manifests in wall time but also in memory usage for workloads that rely on spilling to keep memory down.

We're not tracking scheduler memory so improvements here are not visible (see also #7998 (comment))

https://github.com/coiled/benchmarks/suites/15067327751/artifacts/860670472

image

@@ -116,7 +116,7 @@ def rearrange_by_column_p2p(
f"p2p requires all column names to be str, found: {unsupported}",
)

name = f"shuffle-p2p-{token}"
name = f"shuffle_p2p-{token}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting change I noticed. The previous shuffle-p2p would be improperly split such that the prefix is shuffle instead of shuffle-p2p (due to the -)

@fjetter fjetter changed the title [RFC / WIP] Remove stringification [RFC] Remove stringification Aug 14, 2023
@fjetter fjetter marked this pull request as ready for review August 15, 2023 11:10
@fjetter fjetter changed the title [RFC] Remove stringification Remove stringification Aug 15, 2023
@fjetter
Copy link
Member Author

fjetter commented Aug 15, 2023

I'm rerunning benchmarks. I suspect that the spilling regressions should be less severe now.

In the example I profiled in #7998 (comment) this PR reduced the memory footprint on the scheduler by about 25% and reduced blocked event loops durations to about half (still too long but this is incremental progress).

@mrocklin
Copy link
Member

mrocklin commented Aug 15, 2023 via email

@@ -39,7 +39,7 @@ dependencies = [
"toolz >= 0.10.0",
"tornado >= 6.0.4",
"urllib3 >= 1.24.3",
"zict >= 2.2.0",
"zict >= 3.0.0",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower zict versions raise for whatever reason if I pass a str key. I don't see a reason to not upgrade, though.

Maybe this issue is related to the perf regression I'm seeing but I'm not super motivated debugging spilling perf (it's just 10-20% not an order of magnitute)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very surprised about the regression in spilling. Particularly in the uncompressible test I would have expected disk I/O to be 2 to 3 orders of magnitude slower than the stringification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we open an issue to investigate that and move forward with this PR?

@fjetter
Copy link
Member Author

fjetter commented Aug 15, 2023

New benchmarks look almost identical. A marginal speedup for task heavy stuff like shuffle[tasks] but a mild perf hit and memory hit for test_spilling (memory hit because we're slower at spilling for whatever reason).

https://github.com/coiled/benchmarks/suites/15104615297/artifacts/863239323

@hendrikmakait hendrikmakait self-requested a review August 15, 2023 16:57
@fjetter
Copy link
Member Author

fjetter commented Aug 15, 2023

Just to say this one more time:

This will be a breaking change for plugins but I believe it will generally be a welcome change. It will also only affect plugins that actually inspect tasks/keys which is probably only a small fraction of users.

I do believe this is worth it.

cc @jacobtomlinson @jrbourbeau

@fjetter
Copy link
Member Author

fjetter commented Aug 15, 2023

  • Introduce again a validate_key. We should not allow anything that's hashable. That'd be a bit crazy

@hendrikmakait
Copy link
Member

  • Introduce again a validate_key. We should not allow anything that's hashable. That'd be a bit crazy

Has this been addressed? I can't see any commits after this comment.

distributed/tests/test_client.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/utils_comm.py Show resolved Hide resolved
@mrocklin
Copy link
Member

This will be a breaking change for plugins but I believe it will generally be a welcome change

+1 to breaking things and moving fast

@fjetter
Copy link
Member Author

fjetter commented Aug 16, 2023

got all tests working again ( think)

@fjetter
Copy link
Member Author

fjetter commented Aug 17, 2023

Since there is a release scheduled for tomorrow, I suggest to merge this after the release to give folks an opportunity to catch potential breakage.

Generally speaking, I'm not very concerned since I suspect that users who run into compatibility issues are few and can easily fix this on their side (worst case, they stringify themselves).

@jacobtomlinson
Copy link
Member

This will be a breaking change for plugins

@fjetter It would be great to write some small description of the change and how to migrate. This could just be an edit to the top comment here that gives affected users some actionable information when they navigate here from the changelog. We could tweet it also.

typ = type(k)
if typ is not str and typ is not bytes:
raise TypeError(f"Unexpected key type {typ} (value: {k!r})")
if not isinstance(k, (bytes, int, float, str, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should recursively validate the elements of k if k is a tuple.

@hendrikmakait
Copy link
Member

@jacobtomlinson: I've updated the PR description. Is there anything we should add?

@jacobtomlinson
Copy link
Member

That's perfect thanks @hendrikmakait

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I expect a lot of type annotations, particularly in scheduler.py and worker_state_machine.py, are now wrong as they explicitly call for str.
  • I expect a lot of Hashable annotations can now be tightened

Updating all type annotations is a substantial piece of work so it's best left to a separate PR.

As mentioned above, I'd rather disallow recursive tuples. This can also be left to a future PR.

In the whole documentation, I found a single casual mention of stringification - I'm a bit surprised of it.

Documentation and annotations in dask/dask need to be updated; e.g.
https://docs.dask.org/en/stable/spec.html#definitions
This can also be done as a follow-up.

distributed/recreate_tasks.py Outdated Show resolved Hide resolved
distributed/recreate_tasks.py Outdated Show resolved Hide resolved
@gen_cluster(client=True)
async def test_tuple_keys(c, s, a, b):
x = dask.delayed(inc)(1, dask_key_name=("x", 1))
y = dask.delayed(inc)(x, dask_key_name=("y", 1))
future = c.compute(y)
assert (await future) == 3
z = dask.delayed(inc)(y, dask_key_name=("z", MyHashable(1, 2)))
with pytest.raises(TypeError, match="key"):
await c.compute(z)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather disallow nested tuples too. For starters, it would make type annotations more robust (AFAIK type annotations don't support recursion).

raise TypeError(f"Unexpected key type {typ} (value: {k!r})")
if isinstance(k, tuple):
for e in k:
validate_key(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather disallow nested tuples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this function feels like it should be in dask/dask.

I'm aware that dask/dask can run on top of schedulers other than dask/distributed, but I'd much rather not have a tighter restriction on dask/distributed than in dask/dask.

@crusaderky
Copy link
Collaborator

This PR made dask/dask CI hang on interpreter teardown: dask/dask#10468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants