Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved support for pyarrow strings #10000

Merged
merged 60 commits into from
Mar 8, 2023

Conversation

@j-bennet j-bennet marked this pull request as draft February 24, 2023 23:55
@j-bennet j-bennet changed the title Tmporarily turn on pyarrow strings to look for CI failures. Temporarily turn on pyarrow strings to look for CI failures. Feb 24, 2023
@github-actions github-actions bot added the array label Mar 1, 2023
@j-bennet j-bennet force-pushed the j-bennet/pyarrow-ci-failures branch from 14201f7 to 37b6d81 Compare March 1, 2023 17:07
dask/array/core.py Outdated Show resolved Hide resolved
dask/array/utils.py Outdated Show resolved Hide resolved
dask/utils_test.py Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the array label Mar 1, 2023
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask/partd#64 should fix some of the shuffle-related serialization errors. You might try temporarily pointing CI environments at that PR to see what tests start passing

@j-bennet j-bennet force-pushed the j-bennet/pyarrow-ci-failures branch from e96e2bf to 7e65678 Compare March 2, 2023 21:42
@j-bennet
Copy link
Contributor Author

j-bennet commented Mar 2, 2023

dask/partd#64 should fix some of the shuffle-related serialization errors. You might try temporarily pointing CI environments at that PR to see what tests start passing

@jrbourbeau Thanks, I added this to install script. I couldn't add it to the conda env file because of version conflict - when installed from git, partd version is 1.0, and we have 1.2.0 in requirements.

conftest.py Outdated Show resolved Hide resolved
dask/bag/tests/test_bag.py Outdated Show resolved Hide resolved
dask/bag/tests/test_bag.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_csv.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_demo.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_arithmetics_reduction.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/utils.py Outdated Show resolved Hide resolved
@j-bennet j-bennet force-pushed the j-bennet/pyarrow-ci-failures branch 4 times, most recently from 889a4fb to 43a6d96 Compare March 5, 2023 00:09
@j-bennet j-bennet changed the title Temporarily turn on pyarrow strings to look for CI failures. Initial support for pyarrow strings Mar 6, 2023
@j-bennet j-bennet force-pushed the j-bennet/pyarrow-ci-failures branch from 2b5eed4 to 6ca705b Compare March 6, 2023 17:43
@j-bennet
Copy link
Contributor Author

j-bennet commented Mar 6, 2023

This PR is using a couple of different ways to handle pyarrow string failures.

  1. For some tests, using pyarrow strings is undesirable or unnecessary:
    • sometimes we explicitly need object dtype, for example: to store lists and explode
    • tests that check for a specific number of graph layers
    • tests that don't have upstream support for pyarrow strings, and it's not clear when support will be added (storing to hdf, spark compatibility)

These tests use a fixture called disable_pyarrow_strings, defined in conftest.py: @pytest.mark.usefixtures("disable_pyarrow_strings"). Mostly there's also a comment saying why.

  1. Tests that have corresponding issues opened in upstream. Marked with pytest.mark.xfail(pyarrow_strings_enabled(), reason="[link to issue]".

  2. Tests that fail with pyarrow strings, but there's no corresponding upstream issue:

    • fastparquet tests are expected to fail with pyarrow strings enabled in config, because we raise a warning
    • some reductions are not implemented for pyarrow strings in upstream, such as all, sum and mean.

Those are marked with @xfail_with_pyarrow_strings, a shared mark defined in dask.tests.__init__.py.

dask/bag/tests/test_bag.py Outdated Show resolved Hide resolved
continuous_integration/scripts/install.sh Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
@j-bennet
Copy link
Contributor Author

j-bennet commented Mar 7, 2023

I simplified how the failures are handled. We have two different types now:

  1. Marked with xfail_with_pyarrow_strings. These break with pyarrow strings. Some may be dask problems, some may be fixed upstream, we may have to get back to them.
  2. Marked with skip_with_pyarrow_strings. These don't make sense with pyarrow strings and there's no point to run them.

@j-bennet j-bennet marked this pull request as ready for review March 7, 2023 18:09
@j-bennet j-bennet force-pushed the j-bennet/pyarrow-ci-failures branch from 1c1b692 to b295ef9 Compare March 7, 2023 18:51
@jrbourbeau jrbourbeau changed the title Initial support for pyarrow strings Improved support for pyarrow strings Mar 8, 2023
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here @j-bennet!

Comment on lines -100 to -102
# pyarrow dtypes currently fail, so we allow continuing on error for that specific build
# TODO: Remove the `continue-on-error` line below once tests are all passing
continue-on-error: ${{ matrix.extra == 'pyarrow' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@jrbourbeau jrbourbeau merged commit 15ba4b3 into dask:main Mar 8, 2023
@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2023

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants