Improve type reconciliation for P2P #8332

hendrikmakait · 2023-11-08T10:56:24Z

This PR does not address #8310 and dask/dask#10014. Due to its dependency on Arrow, I doubt that the current P2P shuffle implementation will be able to gracefully handle these cases without a significant increase in complexity. To improve the user experience in those cases, we may want to add a better exception. I'd like to leave that to a future PR.

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait · 2023-11-08T10:57:04Z

cc @fjetter and @phofl

hendrikmakait · 2023-11-08T11:00:16Z

distributed/shuffle/_arrow.py

-    # First version that supports concatenating extension arrays (apache/arrow#14463)
-    minversion = "12.0.0"
+    # First version to implement type promotion for pa.concat_tables (apache/arrow#36846)
+    minversion = "14.0.0"


This version came out on Nov 1, 2023. Users are unlikely to have it installed in their existing environments, so this PR will create some user pain forcing them to upgrade or manually choose to use task-based shuffling.

can we do this on a best effort basis instead of effectively pinning it? There are many users out there who are not affected by this problem. It would be nice if this wasn't a breaking change for them.

Yeah I am worried about increasing to 14 by default as well

Just pushed a more permissive approach (it's actually more permissive than what we had before). I just didn't want to go through the hassle unless others think that being more backward-compatible is with it here.

hendrikmakait · 2023-11-08T11:03:28Z

distributed/shuffle/_arrow.py

+    for column, dtype in meta.dtypes.items():
+        actual = df[column].dtype
+        if actual == dtype:
+            continue


Interestingly, the dtypes of the _partitions column don't match up (int64 vs uint64). @phofl, can pandas handle this conversion as a zero-copy operation?

This is a numpy question, but I am pretty sure that this is not possible without a copy

hendrikmakait · 2023-11-08T11:07:18Z

distributed/shuffle/tests/test_shuffle.py

    def make_partition(i):
        """Return mismatched column types for every other partition"""
        if i % 2 == 1:
-            return pd.DataFrame({"a": np.random.random(10), "b": [True] * 10})


There's a regression hidden here:

Previously, we could reconcile bool and float partitions. Arrow considers these irreconcilable. (Personally, I agree.)

Has this problem been reported as such or was this just a dummy unit test?

IIRC, this was a dummy example to test reconciliation. I could easily see a scenario where someone applies a "boolean" UDF that fails to cast (some) numerical values to a bool though.

github-actions · 2023-11-08T11:53:06Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      27 files ±  0       27 suites ±0 14h 11m 51s ⏱️ + 27m 22s
  3 928 tests +  2   3 805 ✔️ +  1   117 💤 ±0 5 ❌ +1 1 🔥 ±0
49 414 runs +41 46 995 ✔️ +47 2 413 💤 - 5 5 ❌ - 1 1 🔥 ±0

For more details on these failures and errors, see this check.

Results for commit 69376b1. ± Comparison against base commit 0dc9e88.

This pull request removes 11 and adds 13 tests. Note that renamed tests count towards both.

distributed.shuffle.tests.test_shuffle ‑ test_handle_null_partitions_p2p_shuffling
distributed.shuffle.tests.test_shuffle ‑ test_handle_null_partitions_p2p_shuffling_2
distributed.shuffle.tests.test_shuffle ‑ test_handle_object_columns_p2p
distributed.shuffle.tests.test_shuffle ‑ test_p2p_flaky_connect_fails_without_retry
distributed.shuffle.tests.test_shuffle ‑ test_p2p_flaky_connect_recover_with_retry
distributed.shuffle.tests.test_shuffle ‑ test_raise_on_incompatible_partitions_p2p_shuffling
distributed.shuffle.tests.test_shuffle ‑ test_reconcile_mismatching_partitions_p2p_shuffling
distributed.shuffle.tests.test_shuffle ‑ test_set_index_p2p
distributed.shuffle.tests.test_shuffle ‑ test_set_index_p2p_with_existing_index
distributed.shuffle.tests.test_shuffle ‑ test_shuffle_p2p_with_existing_index
…

distributed.shuffle.tests.test_shuffle ‑ test_flaky_connect_fails_without_retry
distributed.shuffle.tests.test_shuffle ‑ test_flaky_connect_recover_with_retry
distributed.shuffle.tests.test_shuffle ‑ test_handle_categorical_data
distributed.shuffle.tests.test_shuffle ‑ test_handle_floats_in_int_meta
distributed.shuffle.tests.test_shuffle ‑ test_handle_null_partitions
distributed.shuffle.tests.test_shuffle ‑ test_handle_null_partitions_2
distributed.shuffle.tests.test_shuffle ‑ test_handle_object_columns
distributed.shuffle.tests.test_shuffle ‑ test_raise_on_incompatible_partitions
distributed.shuffle.tests.test_shuffle ‑ test_reconcile_partitions
distributed.shuffle.tests.test_shuffle ‑ test_set_index
…

♻️ This comment has been updated with latest results.

fjetter · 2023-11-08T12:46:01Z

distributed/shuffle/_shuffle.py

@@ -306,8 +306,6 @@ def split_by_worker(

    from dask.dataframe.dispatch import to_pyarrow_table_dispatch

-    df = df.astype(meta.dtypes, copy=False)


fjetter · 2023-11-08T12:46:51Z

distributed/shuffle/tests/test_shuffle.py

    def make_partition(i):
        """Return mismatched column types for every other partition"""
        if i % 2 == 1:
-            return pd.DataFrame({"a": np.random.random(10), "b": [True] * 10})


Has this problem been reported as such or was this just a dummy unit test?

fjetter · 2023-11-08T12:48:38Z

distributed/shuffle/_arrow.py

-    # First version that supports concatenating extension arrays (apache/arrow#14463)
-    minversion = "12.0.0"
+    # First version to implement type promotion for pa.concat_tables (apache/arrow#36846)
+    minversion = "14.0.0"


can we do this on a best effort basis instead of effectively pinning it? There are many users out there who are not affected by this problem. It would be nice if this wasn't a breaking change for them.

hendrikmakait · 2023-11-08T17:35:17Z

distributed/shuffle/tests/test_shuffle.py

+    df.b = df.b.astype("category")
+    shuffled = df.shuffle("a", shuffle="p2p")
+    result, expected = await c.compute([shuffled, df], sync=True)
+    dd.assert_eq(result, expected, check_categorical=False)


There's one caveat here: The expected categories_dtype of column b is string, but after the shuffle, we end up with object. There's not much we can do about this at the moment, meta also has object.

Related: dask/dask#6242

fjetter

Thanks for implementing the backwards compat.

Assuming dask/dask#10622 is accepted we'll enter a deprecation cycle and can hopefully soon drop the older pyarrow versions

phofl · 2023-11-09T13:51:54Z

distributed/shuffle/tests/test_shuffle.py

+            {
+                # Extension types
+                f"col{next(counter)}": pd.array(
+                    [pd.Period("2022-01-01", freq="D") + i for i in range(100)],


Non blocking, I think you can use period_range and interval_range instead, which is better here

hendrikmakait added 4 commits November 8, 2023 10:13

Initial

2b916d3

PyArrow 14

54a481d

Cast where necessary

8bd0303

Overrule string

227464c

hendrikmakait requested a review from fjetter as a code owner November 8, 2023 10:56

Min version

9cf7740

hendrikmakait commented Nov 8, 2023

View reviewed changes

[skip-caching]

518e987

fjetter reviewed Nov 8, 2023

View reviewed changes

hendrikmakait added 2 commits November 8, 2023 17:30

More permissive approach

614e862

Fix tests

4a1f3be

hendrikmakait commented Nov 8, 2023

View reviewed changes

hendrikmakait requested review from fjetter and phofl November 9, 2023 08:41

fjetter approved these changes Nov 9, 2023

View reviewed changes

phofl reviewed Nov 9, 2023

View reviewed changes

Review feedback

69376b1

hendrikmakait merged commit 8212868 into dask:main Nov 10, 2023
26 of 34 checks passed

hendrikmakait deleted the p2p-dtype-reconciliation branch November 10, 2023 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve type reconciliation for P2P #8332

Improve type reconciliation for P2P #8332

hendrikmakait commented Nov 8, 2023

hendrikmakait commented Nov 8, 2023

hendrikmakait Nov 8, 2023

fjetter Nov 8, 2023

phofl Nov 8, 2023

hendrikmakait Nov 8, 2023

hendrikmakait Nov 8, 2023

phofl Nov 9, 2023

hendrikmakait Nov 8, 2023 •

edited

fjetter Nov 8, 2023

hendrikmakait Nov 8, 2023

github-actions bot commented Nov 8, 2023 •

edited

fjetter Nov 8, 2023

fjetter Nov 8, 2023

fjetter Nov 8, 2023

hendrikmakait Nov 8, 2023 •

edited

hendrikmakait Nov 8, 2023

fjetter left a comment

phofl Nov 9, 2023

		@@ -306,8 +306,6 @@ def split_by_worker(

		from dask.dataframe.dispatch import to_pyarrow_table_dispatch

		df = df.astype(meta.dtypes, copy=False)

Improve type reconciliation for P2P #8332

Improve type reconciliation for P2P #8332

Conversation

hendrikmakait commented Nov 8, 2023

hendrikmakait commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Nov 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 8, 2023 • edited

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Nov 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Nov 8, 2023 •

edited

github-actions bot commented Nov 8, 2023 •

edited

hendrikmakait Nov 8, 2023 •

edited