New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove stringification #8083
Remove stringification #8083
Conversation
distributed/spill.py
Outdated
class CustomFile(zict.File): | ||
def _safe_key(self, key): | ||
return super()._safe_key(stringify(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place, so far, that truly requires a str but I believe this should be handled on this abstraction level and not force the entire system to use str
This would be a breaking change for plugins so this should make a difference somehow to be worth doing. @madsbk I believe you've been interested in this kind of thing? I also wonder what would break on the Rapids side of things if we went through with this or if there is something else I'm forgetting |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files ± 0 21 suites ±0 11h 37m 58s ⏱️ + 23m 53s For more details on these failures, see this check. Results for commit d4798e8. ± Comparison against base commit 03ea2e1. ♻️ This comment has been updated with latest results. |
I like this, what started as an simplification of keys is now a significant source of complexity! It might break downstream projects that manually stringify or de-stringify graph keys but it shouldn't be much work to fix. |
0c14690
to
7e163d4
Compare
I did some profiling on memory usage on the scheduler and the way we're doing stringification is a major driver for scheduler memory usage (this is relevant for large graphs) |
In it's current version, spill heavy workloads are not looking great on benchmarking. This is quite obvious since the spilling now needs to handle stringification as well. This manifests in wall time but also in memory usage for workloads that rely on spilling to keep memory down. We're not tracking scheduler memory so improvements here are not visible (see also #7998 (comment)) https://github.com/coiled/benchmarks/suites/15067327751/artifacts/860670472 |
@@ -116,7 +116,7 @@ def rearrange_by_column_p2p( | |||
f"p2p requires all column names to be str, found: {unsupported}", | |||
) | |||
|
|||
name = f"shuffle-p2p-{token}" | |||
name = f"shuffle_p2p-{token}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting change I noticed. The previous shuffle-p2p
would be improperly split such that the prefix is shuffle
instead of shuffle-p2p
(due to the -
)
2333463
to
95bccf3
Compare
I'm rerunning benchmarks. I suspect that the spilling regressions should be less severe now. In the example I profiled in #7998 (comment) this PR reduced the memory footprint on the scheduler by about 25% and reduced blocked event loops durations to about half (still too long but this is incremental progress). |
That feels like really good progress :)
…On Tue, Aug 15, 2023 at 6:13 AM Florian Jetter ***@***.***> wrote:
I'm rerunning benchmarks. I suspect that the spilling regressions should
be less severe now.
In the example I profiled in #7998 (comment)
<#7998 (comment)>
this PR reduced the memory footprint on the scheduler by about 25% and
reduced blocked event loops durations to about half (still too long but
this is incremental progress).
—
Reply to this email directly, view it on GitHub
<#8083 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCIPBXS6MR3HXK6MCDXVNKVNANCNFSM6AAAAAA3INOHTQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@@ -39,7 +39,7 @@ dependencies = [ | |||
"toolz >= 0.10.0", | |||
"tornado >= 6.0.4", | |||
"urllib3 >= 1.24.3", | |||
"zict >= 2.2.0", | |||
"zict >= 3.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower zict versions raise for whatever reason if I pass a str key. I don't see a reason to not upgrade, though.
Maybe this issue is related to the perf regression I'm seeing but I'm not super motivated debugging spilling perf (it's just 10-20% not an order of magnitute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very surprised about the regression in spilling. Particularly in the uncompressible test I would have expected disk I/O to be 2 to 3 orders of magnitude slower than the stringification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we open an issue to investigate that and move forward with this PR?
New benchmarks look almost identical. A marginal speedup for task heavy stuff like shuffle[tasks] but a mild perf hit and memory hit for test_spilling (memory hit because we're slower at spilling for whatever reason). https://github.com/coiled/benchmarks/suites/15104615297/artifacts/863239323 |
Just to say this one more time: This will be a breaking change for plugins but I believe it will generally be a welcome change. It will also only affect plugins that actually inspect tasks/keys which is probably only a small fraction of users. I do believe this is worth it. |
|
Has this been addressed? I can't see any commits after this comment. |
+1 to breaking things and moving fast |
6b7e9fa
to
b2f2919
Compare
got all tests working again ( think) |
Since there is a release scheduled for tomorrow, I suggest to merge this after the release to give folks an opportunity to catch potential breakage. Generally speaking, I'm not very concerned since I suspect that users who run into compatibility issues are few and can easily fix this on their side (worst case, they stringify themselves). |
@fjetter It would be great to write some small description of the change and how to migrate. This could just be an edit to the top comment here that gives affected users some actionable information when they navigate here from the changelog. We could tweet it also. |
distributed/utils.py
Outdated
typ = type(k) | ||
if typ is not str and typ is not bytes: | ||
raise TypeError(f"Unexpected key type {typ} (value: {k!r})") | ||
if not isinstance(k, (bytes, int, float, str, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should recursively validate the elements of k
if k
is a tuple
.
@jacobtomlinson: I've updated the PR description. Is there anything we should add? |
That's perfect thanks @hendrikmakait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I expect a lot of type annotations, particularly in scheduler.py and worker_state_machine.py, are now wrong as they explicitly call for str.
- I expect a lot of Hashable annotations can now be tightened
Updating all type annotations is a substantial piece of work so it's best left to a separate PR.
As mentioned above, I'd rather disallow recursive tuples. This can also be left to a future PR.
In the whole documentation, I found a single casual mention of stringification - I'm a bit surprised of it.
Documentation and annotations in dask/dask need to be updated; e.g.
https://docs.dask.org/en/stable/spec.html#definitions
This can also be done as a follow-up.
@gen_cluster(client=True) | ||
async def test_tuple_keys(c, s, a, b): | ||
x = dask.delayed(inc)(1, dask_key_name=("x", 1)) | ||
y = dask.delayed(inc)(x, dask_key_name=("y", 1)) | ||
future = c.compute(y) | ||
assert (await future) == 3 | ||
z = dask.delayed(inc)(y, dask_key_name=("z", MyHashable(1, 2))) | ||
with pytest.raises(TypeError, match="key"): | ||
await c.compute(z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather disallow nested tuples too. For starters, it would make type annotations more robust (AFAIK type annotations don't support recursion).
raise TypeError(f"Unexpected key type {typ} (value: {k!r})") | ||
if isinstance(k, tuple): | ||
for e in k: | ||
validate_key(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather disallow nested tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this function feels like it should be in dask/dask.
I'm aware that dask/dask can run on top of schedulers other than dask/distributed, but I'd much rather not have a tighter restriction on dask/distributed than in dask/dask.
This PR made dask/dask CI hang on interpreter teardown: dask/dask#10468 |
Problem
The distributed scheduler is stringifying keys as a first step when receiving a graph. This is nice because we can rely on keys to be of a single type in the scheduler instead of dealing with a generic
Hashable
.However, it is also introducing weird artifact of us parsing these str keys again, for example in our P2P extension we have to analyze these strings to infer whether we're dealing with a specific kind (the barrier task)
The stringification is also known to be a performance problem at graph submission time. See #7998 which lists stringify (and subsequent key_splits) to be listed as 12% CPU time each (i.e. in total 24%) of an
update_graph
of a large graph >1-2MM tasksFrom a complexity perspective I would love to get rid of stringification (and possible restrict key types to str and tuple, not everything hashable)
Changes
This PR removes the stringification of keys. Strings will be stored and handled in their raw form on the scheduler. This is a breaking change for plugins with transition hooks.
This PR also restricts keys to the following types:
bytes
,float
,int
,str
, ortuple
s of these. This is enforced byvalidate_key
distributed/distributed/utils.py
Lines 942 to 948 in 75b8055
Migrating
If you run into a
TypeError: TypeError: Unexpected key type...
, please adjust your keys to match the types allowed byvalidate_key
.If you require strings to be keys, consider whether you achieve your goals by using the raw form of the keys, or to transform the keys to their previous string format using
dask.utils.stringify
as needed.