Conversation
…ontains a uniform, hashable datatypes. Otherwise defaults to recursively serializing with pickle
|
Can one of the admins verify this patch? Admins can comment |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 20m 2s ⏱️ + 4m 47s For more details on these failures, see this check. Results for commit 886108d. ± Comparison against base commit 1fd07f0. ♻️ This comment has been updated with latest results. |
|
cc: @madsbk -- Wondering if you would mind taking a look at this PR. Also interested in your thoughts on dropping the requirement to use pickle with protocol=4. xref : rapidsai/dask-cuda#746 |
|
cc: @ian-r-rose |
| if is_dask_collection(first_val) or typename(type(first_val)) not in [ | ||
| "str", | ||
| "int", | ||
| "float", | ||
| ]: |
There was a problem hiding this comment.
| if is_dask_collection(first_val) or typename(type(first_val)) not in [ | |
| "str", | |
| "int", | |
| "float", | |
| ]: | |
| if is_dask_collection(first_val) or isinstance(first_val, (str, int, float)): |
There was a problem hiding this comment.
Trying to use isinstance(first_val, (str, int, float)) returns test failures in shuffle.
Additionally, this seems more consistent with the convention of using type(x), which is the preferred approach in serialize.py. I could see replacing with type(x) in [int, float] if that makes more sense...
Thoughts?
There was a problem hiding this comment.
This makes me a bit nervous.
AFAIK, the use of type(x) is only here to avoid sub-classes of list, set, tuple, dict to be converted to their base-class by msgpack. By disabling iterate_collection and use Dispatch on the while collection, we make sure to preserve the sub-classes.
Do you have a reproducer for the shuffle failing?
PS: I think this is another good reason why we need to re-design and simplify the serialization in Dask.
| and iterate_collection is True | ||
| or type(x) is dict | ||
| and iterate_collection | ||
| and iterate_collection is True |
There was a problem hiding this comment.
Any reason for this semantically?
There was a problem hiding this comment.
Nothing other than readability.
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
…lect fact that type(x.data) is being checked, not type(x)
…ributed into dispatch_some_lists
|
@madsbk -- Can we drop the requirement to use |
jrbourbeau
left a comment
There was a problem hiding this comment.
cc @rjzamora @jakirkham if either of you get a chance to look at this. FYI for others, NVIDIA folks are off Sep 15-16 for a company-wide holiday, so it may be until next week until pings are seen
Yes |
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
Closes Future submit()/result() takes very long time to complete with 8MB Python object #6368
Related to: Don't force recursing into python collections when serializing objects. #6940
Tests added / passed
[X ] Passes
pre-commit run --all-filesUsing the reproducer from @adbreind in #6368 as a starting point for:
On current main, I get:
Run time (in seconds) for 5 runs is: [13.804176807403564, 13.784174680709839, 13.835507869720459, 13.706598997116089, 13.749552011489868], and mean runtime: 13.776002073287964 secondsWhile on this branch, I get:
Run time (in seconds) for 5 runs is: [0.15462398529052734, 0.1298370361328125, 0.1314990520477295, 0.13030385971069336, 0.12935090065002441], and mean runtime: 0.13512296676635743 secondsWhat is happening?
When serializing collections, we prefer to use pickle and recurse into the collections, serializing each object in the collection separately. This decision was motivated by Blockwise-IO work as described by @rjzamora. While it makes sense, it also has the unfortunate consequence of making it expensive to serialize collections in general.
Here we create a
Dispatch()method for lists that converts a list to a numpy array, which can then be serialized. We addinfer_if_recurse_to_serialize_list. Now that lists can be serialized recursively usingpickle, or withdask_serialize, we offload the decision about whether toiterate_collectiontoinfer_if_recurse_to_serialize_list.We also need to handle the case where a
Serializeobject must itself be serialized. To handle this, we add aniterate_collectionattribute.