Integrate dask-expr and make CI happy #980

milesgranger · 2024-02-28T13:08:36Z

First commit is only skipping tests that fail currently w/o dask-expr, can be moved to it's own PR if desired. It skips tests that fail from mostly deprecations/future warnings, and some other things due to outdated code from updated dependencies.

other commits are adding dask-expr logic and marking tests that currently fail using skipif dask-expr is enabled.

dask_ml/ensemble/_blockwise.py

dask_ml/linear_model/utils.py

milesgranger · 2024-03-04T11:46:44Z

tests/preprocessing/test_data.py

+    mark = pytest.mark.xfail(
+        DASK_EXPR_ENABLED,
+        reason="dask-expr: NotImplementedError in assert_eq_df(res_df.iloc[:, 1:], frame, check_dtype=False)",
+    )
+
+    @pytest.mark.parametrize("daskify", [pytest.param(True, marks=mark), False])
    def test_df_transform_index(self, daskify):


@phofl

traceback

___________________________ TestPolynomialFeatures.test_df_transform_index[True] ___________________________ self = <tests.preprocessing.test_data.TestPolynomialFeatures object at 0x7f0c5bf00bb0>, daskify = True @pytest.mark.parametrize("daskify", [pytest.param(True, marks=mark), False]) def test_df_transform_index(self, daskify): frame = copy(df) if not daskify: frame = frame.compute() frame = frame.sample(frac=1.0) res_df = dpp.PolynomialFeatures( preserve_dataframe=True, degree=1 ).fit_transform(frame) > assert_eq_df(res_df.iloc[:, 1:], frame, check_dtype=False) tests/preprocessing/test_data.py:649: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <dask_expr._indexing.ILocIndexer object at 0x7f0c691fb400> key = (slice(None, None, None), slice(1, None, None)) def __getitem__(self, key): msg = ( "'DataFrame.iloc' only supports selecting columns. " "It must be used like 'df.iloc[:, column_indexer]'." ) if not isinstance(key, tuple): raise NotImplementedError(msg) if len(key) > 2: raise ValueError("Too many indexers") iindexer, cindexer = key if iindexer != slice(None): raise NotImplementedError(msg) if len(self.obj.columns) == len(set(self.obj.columns)): col_names = self.obj.columns[cindexer] if not is_scalar(col_names): col_names = list(col_names) return new_collection(Projection(self.obj, col_names)) else: > raise NotImplementedError E NotImplementedError ../../mambaforge/envs/dask-ml/lib/python3.10/site-packages/dask_expr/_indexing.py:51: NotImplementedError

does the DataFRame have duplicated columns? Then yes, we don't care for now

Ah, okay the output res_df has duplicate columns, but seems expected as that's what the iloc is for, but you're saying b/c of that we don't care right now because it's not implemented.

(Pdb) p frame.columns Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19'], dtype='object') (Pdb) p res_df.columns Index(['1', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19'], dtype='object')

milesgranger · 2024-03-04T11:50:01Z

tests/preprocessing/test_label.py

+        if DASK_EXPR_ENABLED and daskify == "unknown" and categories != ["b", "a"]:
+            with pytest.raises(AssertionError, match="Arrays are not equal"):
+                np.testing.assert_array_equal(a.classes_, categories)
+            return


@phofl

traceback

def test_categorical(self, categories, transformed, daskify, ordered): cat = pd.Series( ["a", "b", "a"], dtype=pd.api.types.CategoricalDtype(categories=categories, ordered=ordered), ) if daskify: cat = dd.from_pandas(cat, npartitions=2) transformed = da.from_array(transformed, chunks=(2, 1)) if daskify == "unknown": cat = cat.cat.as_unknown() a = dpp.LabelEncoder().fit(cat) if daskify != "unknown": assert a.dtype_ == cat.dtype > np.testing.assert_array_equal(a.classes_, categories) tests/preprocessing/test_label.py:125: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (<built-in function eq>, array(['b', 'a'], dtype=object), ['a', 'b']) kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True} @wraps(func) def inner(*args, **kwds): with self._recreate_cm(): > return func(*args, **kwds) E AssertionError: E Arrays are not equal E E Mismatched elements: 2 / 2 (100%) E x: array(['b', 'a'], dtype=object) E y: array(['a', 'b'], dtype='<U1') ../../mambaforge/envs/dask-ml/lib/python3.10/contextlib.py:79: AssertionError

This could be legit, could you look into why the order is not correct?

milesgranger · 2024-03-04T11:50:36Z

tests/test_parallel_post_fit.py

+    if not DASK_EXPR_ENABLED:
+        # dask-expr cannot set _meta
+        dd_X._meta = pd.DataFrame({"c_0": [5]})
+
+        # Failure when not proving predict_meta
+        # because of value dependent model
+        wrap = ParallelPostFit(base)
+        with pytest.raises(ValueError):
+            wrap.predict(dd_X)


@phofl I kinda assume we can safely ignore this test for dask-expr? Fails b/c we cannot set _meta.

Hm this is not great since it seems like the behaviour actually depends on meta?

Is there a 'good' way to modify meta? Otherwise I can try mocking it here.

lets chat about this in the standup

okay lets move ahead then

tests/test_pca.py

tests/test_utils.py

tests/test_pca.py

dask_ml/linear_model/utils.py

tests/ensemble/test_blockwise.py

tests/feature_extraction/test_text.py

tests/preprocessing/test_data.py

…warning

dask_ml/linear_model/utils.py

phofl · 2024-03-05T15:31:20Z

I can't merge, don't know who can

@fjetter probably?

TomAugspurger · 2024-03-08T12:03:53Z

I can as well :) Thanks for handling this!

milesgranger mentioned this pull request Feb 28, 2024

Add fix for dask bag to_dataframe for dask-expr dask/dask#10965

Closed

3 tasks

milesgranger force-pushed the milesgranger/dask-expr-integration-fixes branch 2 times, most recently from bb68436 to 2481557 Compare February 28, 2024 13:29

phofl reviewed Feb 28, 2024

View reviewed changes

dask_ml/ensemble/_blockwise.py Outdated Show resolved Hide resolved

phofl reviewed Feb 28, 2024

View reviewed changes

dask_ml/linear_model/utils.py Outdated Show resolved Hide resolved

milesgranger force-pushed the milesgranger/dask-expr-integration-fixes branch 4 times, most recently from 569cedc to d7d91b3 Compare February 29, 2024 13:05

Skip failing tests and fix lint and docs build

bd5760f

milesgranger force-pushed the milesgranger/dask-expr-integration-fixes branch 2 times, most recently from 6240910 to 400446b Compare March 4, 2024 09:50

milesgranger added 2 commits March 4, 2024 11:57

Attempt get dask-expr happy

401293b

Use git tip for dask-expr and dask for now

963567e

milesgranger force-pushed the milesgranger/dask-expr-integration-fixes branch from d488738 to 963567e Compare March 4, 2024 10:57

milesgranger changed the title ~~[WIP] Integrate dask-expr~~ Integrate dask-expr and make CI happy Mar 4, 2024

milesgranger commented Mar 4, 2024

View reviewed changes

milesgranger requested a review from phofl March 4, 2024 11:52