Remove stringification #8083

fjetter · 2023-08-08T12:45:26Z

Problem

The distributed scheduler is stringifying keys as a first step when receiving a graph. This is nice because we can rely on keys to be of a single type in the scheduler instead of dealing with a generic Hashable.

However, it is also introducing weird artifact of us parsing these str keys again, for example in our P2P extension we have to analyze these strings to infer whether we're dealing with a specific kind (the barrier task)

The stringification is also known to be a performance problem at graph submission time. See #7998 which lists stringify (and subsequent key_splits) to be listed as 12% CPU time each (i.e. in total 24%) of an update_graph of a large graph >1-2MM tasks

From a complexity perspective I would love to get rid of stringification (and possible restrict key types to str and tuple, not everything hashable)

Changes

This PR removes the stringification of keys. Strings will be stored and handled in their raw form on the scheduler. This is a breaking change for plugins with transition hooks.

This PR also restricts keys to the following types: bytes, float, int, str, or tuples of these. This is enforced by validate_key

distributed/distributed/utils.py

Lines 942 to 948 in 75b8055

    
           def validate_key(k): 
        
               """Validate a key as received on a stream.""" 
        
               if isinstance(k, tuple): 
        
                   for e in k: 
        
                       validate_key(e) 
        
               elif not isinstance(k, (bytes, int, float, str)): 
        
                   raise TypeError(f"Unexpected key type {type(k)} (value: {k!r})")

and is a breaking change.

Migrating

If you run into a TypeError: TypeError: Unexpected key type..., please adjust your keys to match the types allowed by validate_key.

If you require strings to be keys, consider whether you achieve your goals by using the raw form of the keys, or to transform the keys to their previous string format using dask.utils.stringify as needed.

fjetter · 2023-08-08T13:44:39Z

distributed/spill.py

+class CustomFile(zict.File):
+    def _safe_key(self, key):
+        return super()._safe_key(stringify(key))


This is the only place, so far, that truly requires a str but I believe this should be handled on this abstraction level and not force the entire system to use str

fjetter · 2023-08-08T14:02:43Z

This would be a breaking change for plugins so this should make a difference somehow to be worth doing.

@madsbk I believe you've been interested in this kind of thing? I also wonder what would break on the Rapids side of things if we went through with this or if there is something else I'm forgetting

github-actions · 2023-08-08T15:02:30Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±  0       21 suites ±0 11h 37m 58s ⏱️ + 23m 53s
  3 779 tests ±  0   3 668 ✔️ +  3   107 💤 -   1 4 ❌ - 2
36 531 runs +12 34 724 ✔️ +27 1 803 💤 - 12 4 ❌ - 3

For more details on these failures, see this check.

Results for commit d4798e8. ± Comparison against base commit 03ea2e1.

♻️ This comment has been updated with latest results.

madsbk · 2023-08-10T07:52:34Z

I like this, what started as an simplification of keys is now a significant source of complexity!

It might break downstream projects that manually stringify or de-stringify graph keys but it shouldn't be much work to fix.

fjetter · 2023-08-14T11:43:47Z

I did some profiling on memory usage on the scheduler and the way we're doing stringification is a major driver for scheduler memory usage (this is relevant for large graphs)

fjetter · 2023-08-14T11:51:18Z

In it's current version, spill heavy workloads are not looking great on benchmarking. This is quite obvious since the spilling now needs to handle stringification as well. This manifests in wall time but also in memory usage for workloads that rely on spilling to keep memory down.

We're not tracking scheduler memory so improvements here are not visible (see also #7998 (comment))

https://github.com/coiled/benchmarks/suites/15067327751/artifacts/860670472

fjetter · 2023-08-14T12:18:35Z

distributed/shuffle/_shuffle.py

@@ -116,7 +116,7 @@ def rearrange_by_column_p2p(
            f"p2p requires all column names to be str, found: {unsupported}",
        )

-    name = f"shuffle-p2p-{token}"
+    name = f"shuffle_p2p-{token}"


This is an interesting change I noticed. The previous shuffle-p2p would be improperly split such that the prefix is shuffle instead of shuffle-p2p (due to the -)

fjetter · 2023-08-15T11:13:14Z

I'm rerunning benchmarks. I suspect that the spilling regressions should be less severe now.

In the example I profiled in #7998 (comment) this PR reduced the memory footprint on the scheduler by about 25% and reduced blocked event loops durations to about half (still too long but this is incremental progress).

mrocklin · 2023-08-15T13:15:00Z

That feels like really good progress :)

…

On Tue, Aug 15, 2023 at 6:13 AM Florian Jetter ***@***.***> wrote: I'm rerunning benchmarks. I suspect that the spilling regressions should be less severe now. In the example I profiled in #7998 (comment) <#7998 (comment)> this PR reduced the memory footprint on the scheduler by about 25% and reduced blocked event loops durations to about half (still too long but this is incremental progress). — Reply to this email directly, view it on GitHub <#8083 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCIPBXS6MR3HXK6MCDXVNKVNANCNFSM6AAAAAA3INOHTQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

fjetter · 2023-08-15T16:52:53Z

pyproject.toml

@@ -39,7 +39,7 @@ dependencies = [
    "toolz >= 0.10.0",
    "tornado >= 6.0.4",
    "urllib3 >= 1.24.3",
-    "zict >= 2.2.0",
+    "zict >= 3.0.0",


lower zict versions raise for whatever reason if I pass a str key. I don't see a reason to not upgrade, though.

Maybe this issue is related to the perf regression I'm seeing but I'm not super motivated debugging spilling perf (it's just 10-20% not an order of magnitute)

I'm very surprised about the regression in spilling. Particularly in the uncompressible test I would have expected disk I/O to be 2 to 3 orders of magnitude slower than the stringification.

Shall we open an issue to investigate that and move forward with this PR?

fjetter · 2023-08-15T16:53:24Z

New benchmarks look almost identical. A marginal speedup for task heavy stuff like shuffle[tasks] but a mild perf hit and memory hit for test_spilling (memory hit because we're slower at spilling for whatever reason).

https://github.com/coiled/benchmarks/suites/15104615297/artifacts/863239323

fjetter · 2023-08-15T16:59:22Z

Just to say this one more time:

This will be a breaking change for plugins but I believe it will generally be a welcome change. It will also only affect plugins that actually inspect tasks/keys which is probably only a small fraction of users.

I do believe this is worth it.

cc @jacobtomlinson @jrbourbeau

fjetter · 2023-08-15T17:01:15Z

Introduce again a validate_key. We should not allow anything that's hashable. That'd be a bit crazy

hendrikmakait · 2023-08-16T10:58:22Z

Introduce again a validate_key. We should not allow anything that's hashable. That'd be a bit crazy

Has this been addressed? I can't see any commits after this comment.

distributed/tests/test_client.py

distributed/tests/test_scheduler.py

distributed/utils_comm.py

mrocklin · 2023-08-16T13:11:29Z

This will be a breaking change for plugins but I believe it will generally be a welcome change

+1 to breaking things and moving fast

distributed/tests/test_scheduler.py

fjetter · 2023-08-16T15:14:44Z

got all tests working again ( think)

fjetter · 2023-08-17T11:41:06Z

Since there is a release scheduled for tomorrow, I suggest to merge this after the release to give folks an opportunity to catch potential breakage.

Generally speaking, I'm not very concerned since I suspect that users who run into compatibility issues are few and can easily fix this on their side (worst case, they stringify themselves).

jacobtomlinson · 2023-08-21T09:30:34Z

This will be a breaking change for plugins

@fjetter It would be great to write some small description of the change and how to migrate. This could just be an edit to the top comment here that gives affected users some actionable information when they navigate here from the changelog. We could tweet it also.

hendrikmakait · 2023-08-24T07:11:26Z

distributed/utils.py

-    typ = type(k)
-    if typ is not str and typ is not bytes:
-        raise TypeError(f"Unexpected key type {typ} (value: {k!r})")
+    if not isinstance(k, (bytes, int, float, str, tuple)):


I think this should recursively validate the elements of k if k is a tuple.

hendrikmakait · 2023-08-24T09:58:16Z

@jacobtomlinson: I've updated the PR description. Is there anything we should add?

jacobtomlinson · 2023-08-24T10:11:04Z

That's perfect thanks @hendrikmakait

crusaderky

I expect a lot of type annotations, particularly in scheduler.py and worker_state_machine.py, are now wrong as they explicitly call for str.
I expect a lot of Hashable annotations can now be tightened

Updating all type annotations is a substantial piece of work so it's best left to a separate PR.

As mentioned above, I'd rather disallow recursive tuples. This can also be left to a future PR.

In the whole documentation, I found a single casual mention of stringification - I'm a bit surprised of it.

Documentation and annotations in dask/dask need to be updated; e.g.
https://docs.dask.org/en/stable/spec.html#definitions
This can also be done as a follow-up.

distributed/recreate_tasks.py

crusaderky · 2023-08-24T11:55:04Z

distributed/tests/test_client.py

 @gen_cluster(client=True)
 async def test_tuple_keys(c, s, a, b):
    x = dask.delayed(inc)(1, dask_key_name=("x", 1))
    y = dask.delayed(inc)(x, dask_key_name=("y", 1))
    future = c.compute(y)
    assert (await future) == 3
+    z = dask.delayed(inc)(y, dask_key_name=("z", MyHashable(1, 2)))
+    with pytest.raises(TypeError, match="key"):
+        await c.compute(z)


I would rather disallow nested tuples too. For starters, it would make type annotations more robust (AFAIK type annotations don't support recursion).

crusaderky · 2023-08-24T12:00:09Z

distributed/utils.py

-        raise TypeError(f"Unexpected key type {typ} (value: {k!r})")
+    if isinstance(k, tuple):
+        for e in k:
+            validate_key(e)


I'd rather disallow nested tuples

Also this function feels like it should be in dask/dask.

I'm aware that dask/dask can run on top of schedulers other than dask/distributed, but I'd much rather not have a tighter restriction on dask/distributed than in dask/dask.

crusaderky · 2023-08-29T19:17:59Z

This PR made dask/dask CI hang on interpreter teardown: dask/dask#10468

fjetter commented Aug 8, 2023

View reviewed changes

fjetter mentioned this pull request Aug 10, 2023

Remove dumps_task #8067

Merged

fjetter force-pushed the remove_stringification2 branch from 0c14690 to 7e163d4 Compare August 12, 2023 13:24

fjetter commented Aug 14, 2023

View reviewed changes

fjetter changed the title ~~[RFC / WIP] Remove stringification~~ [RFC] Remove stringification Aug 14, 2023

fjetter force-pushed the remove_stringification2 branch from 2333463 to 95bccf3 Compare August 15, 2023 11:09

fjetter marked this pull request as ready for review August 15, 2023 11:10

fjetter requested a review from jacobtomlinson as a code owner August 15, 2023 11:10

fjetter changed the title ~~[RFC] Remove stringification~~ Remove stringification Aug 15, 2023

fjetter commented Aug 15, 2023

View reviewed changes

hendrikmakait self-requested a review August 15, 2023 16:57

hendrikmakait reviewed Aug 16, 2023

View reviewed changes

distributed/tests/test_client.py Outdated Show resolved Hide resolved

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved

distributed/utils_comm.py Show resolved Hide resolved

fjetter commented Aug 16, 2023

View reviewed changes

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved

fjetter added 2 commits August 16, 2023 15:59

Remove stringification

9853f53

Fix test_recreate_task_collection

b2f2919

fjetter force-pushed the remove_stringification2 branch from 6b7e9fa to b2f2919 Compare August 16, 2023 13:59

fix test_decide_worker_coschedule_order_neighbors

327162d

fjetter added 2 commits August 16, 2023 18:45

allow hashable

5c4f10e

not

8e0f7b9

be specific about types

31465cb

hendrikmakait added 2 commits August 24, 2023 08:50

Merge branch 'main' into remove_stringification2

085d803

Allow bytes

5f99bf2

hendrikmakait reviewed Aug 24, 2023

View reviewed changes

Improved key validation

75b8055

Update docs

c0821c5

crusaderky approved these changes Aug 24, 2023

View reviewed changes

crusaderky added 2 commits August 24, 2023 13:17

Update distributed/recreate_tasks.py

12b2093

Update distributed/recreate_tasks.py

d4798e8

hendrikmakait merged commit 22eb33a into dask:main Aug 24, 2023
24 of 29 checks passed

This was referenced Aug 24, 2023

Stricter data type for dask keys dask/dask#10461

Closed

Update type annotations for dask keys #8130

Closed

pentschev mentioned this pull request Aug 25, 2023

Tests failing after Distributed upstream changes rapidsai/dask-cuda#1224

Closed

hendrikmakait mentioned this pull request Aug 28, 2023

Huge increase in average memory in test_spilling[release] coiled/benchmarks#964

Open

crusaderky mentioned this pull request Aug 29, 2023

'remove stringification' causes dask/dask CI to hang dask/dask#10468

Closed

crusaderky added a commit to crusaderky/dask that referenced this pull request Aug 29, 2023

Pin to dask/distributed#8083

bcb854c

crusaderky mentioned this pull request Aug 30, 2023

Remove support for np.int64 in keys dask/dask#10483

Merged

rjzamora mentioned this pull request Sep 25, 2023

JIT graph building dask/dask#10518

Open

crusaderky mentioned this pull request Oct 2, 2023

Test against str collisions in the SpillBuffer #8226

Merged

hendrikmakait mentioned this pull request Feb 28, 2024

cluster_dump roundtrips break non-stringified keys #8540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove stringification #8083

Remove stringification #8083

fjetter commented Aug 8, 2023 •

edited by hendrikmakait

fjetter Aug 8, 2023

fjetter commented Aug 8, 2023

github-actions bot commented Aug 8, 2023 •

edited

madsbk commented Aug 10, 2023

fjetter commented Aug 14, 2023

fjetter commented Aug 14, 2023

fjetter Aug 14, 2023

fjetter commented Aug 15, 2023

mrocklin commented Aug 15, 2023 via email

fjetter Aug 15, 2023

crusaderky Aug 24, 2023

hendrikmakait Aug 24, 2023

fjetter commented Aug 15, 2023

fjetter commented Aug 15, 2023

fjetter commented Aug 15, 2023 •

edited

hendrikmakait commented Aug 16, 2023

mrocklin commented Aug 16, 2023

fjetter commented Aug 16, 2023

fjetter commented Aug 17, 2023

jacobtomlinson commented Aug 21, 2023

hendrikmakait Aug 24, 2023

hendrikmakait commented Aug 24, 2023

jacobtomlinson commented Aug 24, 2023

crusaderky left a comment •

edited

crusaderky Aug 24, 2023

crusaderky Aug 24, 2023

crusaderky Aug 24, 2023

crusaderky commented Aug 29, 2023

	def validate_key(k):
	"""Validate a key as received on a stream."""
	if isinstance(k, tuple):
	for e in k:
	validate_key(e)
	elif not isinstance(k, (bytes, int, float, str)):
	raise TypeError(f"Unexpected key type {type(k)} (value: {k!r})")

Remove stringification #8083

Remove stringification #8083

Conversation

fjetter commented Aug 8, 2023 • edited by hendrikmakait

Problem

Changes

Migrating

Choose a reason for hiding this comment

fjetter commented Aug 8, 2023

github-actions bot commented Aug 8, 2023 • edited

Unit Test Results

madsbk commented Aug 10, 2023

fjetter commented Aug 14, 2023

fjetter commented Aug 14, 2023

Choose a reason for hiding this comment

fjetter commented Aug 15, 2023

mrocklin commented Aug 15, 2023 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Aug 15, 2023

fjetter commented Aug 15, 2023

fjetter commented Aug 15, 2023 • edited

hendrikmakait commented Aug 16, 2023

mrocklin commented Aug 16, 2023

fjetter commented Aug 16, 2023

fjetter commented Aug 17, 2023

jacobtomlinson commented Aug 21, 2023

Choose a reason for hiding this comment

hendrikmakait commented Aug 24, 2023

jacobtomlinson commented Aug 24, 2023

crusaderky left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Aug 29, 2023

fjetter commented Aug 8, 2023 •

edited by hendrikmakait

github-actions bot commented Aug 8, 2023 •

edited

fjetter commented Aug 15, 2023 •

edited

crusaderky left a comment •

edited