Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_default_shuffle_method raises if pyarrow is outdated #10496

Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Sep 7, 2023

@hendrikmakait hendrikmakait marked this pull request as ready for review September 8, 2023 12:52
@jrbourbeau jrbourbeau changed the title get_default_shuffle_method raises if pyarrow is outdated [DNM] get_default_shuffle_method raises if pyarrow is outdated Sep 8, 2023
@jrbourbeau
Copy link
Member

Marking as DNM for now since the distributed PR needs to go in first

@hendrikmakait we're getting a FAILED dask/tests/test_distributed.py::test_bag_groupby_default - RuntimeError: P2P shuffling requires pyarrow>=7.0.0 failure here now. Maybe this will go away once the distributed PR is merged (I've not looked into it at all). Just wanted to double check whether additional changes are still needed or not

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 8, 2023

@jrbourbeau: I checked test_distributed.py locally and the following tests will start failing because the installed pyarrow version does not match the minimum requirement of p2p:

  • test_fused_blockwise_dataframe_merge
  • test_map_partitions_df_input

There are two ways to fix this:

  1. Increase the installed version to pyarrow>=12.0.0
  2. Force the tests to use task-based shuffling

From what I understand, we want to test those lower version, so I'll take a quick stab at (2.). [EDIT]: Done ✅

# Not implemented for Bags
try:
shuffle = get_default_shuffle_method()
except ImportError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow does not match p2p requirements, but we don't care.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I'm confused. I wouldn't expect get_default_shuffle_method to ever raise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See dask/distributed#8157 (comment)

The idea is that we now require a pretty recent version of pyarrow so instead of silently falling back to tasks (or requiring it for all of distributed), we fail to alert users that have pyarrow installed but not the minimum version. Please defer to @fjetter to hash this out, he's taking over the PR while I'm out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more troublesome issue is that get_default_shuffle_method gets shared between dataframe and bag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau is this a problem for you? I want to avoid users accidentally falling back to tasks if they have an old pyarrow installed. This exception is raised early and they can either fix it by upgrading or by setting the config of shuffle to tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau I will move forward with this now since you didn't reply in a little over a week. I'd like this to sit in main for at least a day before we release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good if CI is happy

@hendrikmakait
Copy link
Member Author

When using the sibling branch on CI, tests are green: f9a11b2

@fjetter
Copy link
Member

fjetter commented Sep 14, 2023

merged main again. This should also show that the changes are compatible with dask/distributed after merging dask/distributed#8157

Once build is done (and green) we can merge

@fjetter fjetter changed the title [DNM] get_default_shuffle_method raises if pyarrow is outdated get_default_shuffle_method raises if pyarrow is outdated Sep 14, 2023
@fjetter
Copy link
Member

fjetter commented Sep 14, 2023

There are plenty of failures on gpu CI but I think this is unrelated cc @dask/gpu

https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/4831/CUDA_VER=11.5,LINUX_VER=ubuntu18.04,PYTHON_VER=3.9,RAPIDS_VER=23.10/testReport/junit/dask.dataframe.io.tests/test_io/test_gpu_from_pandas_npartitions_duplicates/

dask/dataframe/io/tests/test_io.py:353: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/dask/lib/python3.9/site-packages/nvtx/nvtx.py:101: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:728: in __init__
    self._init_from_dict_like(
/opt/conda/envs/dask/lib/python3.9/site-packages/nvtx/nvtx.py:101: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:925: in _init_from_dict_like
    keys, values, lengths = zip(
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:931: in <genexpr>
    vc := as_column(v, nan_as_null=nan_as_null),
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:2470: in as_column
    data = as_column(
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:1996: in as_column
    col = ColumnBase.from_arrow(arbitrary)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:379: in from_arrow
    result = libcudf.interop.from_arrow(data)[0]
/opt/conda/envs/dask/lib/python3.9/contextlib.py:79: in inner
    return func(*args, **kwds)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: Fatal CUDA error encountered at: /opt/conda/conda-bld/work/cpp/src/bitmask/null_mask.cu:93: 2 cudaErrorMemoryAllocation out of memory

@fjetter
Copy link
Member

fjetter commented Sep 14, 2023

I will move forward with merging this since otherwise the dask/distributed repo might also break. If the gpuCI failure is related, we will have to investigate this more closely and possible revert the distributed PR

@fjetter fjetter merged commit 3f76495 into dask:main Sep 14, 2023
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants