New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_default_shuffle_method
raises if pyarrow
is outdated
#10496
get_default_shuffle_method
raises if pyarrow
is outdated
#10496
Conversation
get_default_shuffle_method
raises if pyarrow
is outdatedget_default_shuffle_method
raises if pyarrow
is outdated
Marking as @hendrikmakait we're getting a |
@jrbourbeau: I checked
There are two ways to fix this:
From what I understand, we want to test those lower version, so I'll take a quick stab at (2.). [EDIT]: Done ✅ |
# Not implemented for Bags | ||
try: | ||
shuffle = get_default_shuffle_method() | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow
does not match p2p
requirements, but we don't care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I'm confused. I wouldn't expect get_default_shuffle_method
to ever raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See dask/distributed#8157 (comment)
The idea is that we now require a pretty recent version of pyarrow
so instead of silently falling back to tasks
(or requiring it for all of distributed
), we fail to alert users that have pyarrow
installed but not the minimum version. Please defer to @fjetter to hash this out, he's taking over the PR while I'm out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more troublesome issue is that get_default_shuffle_method
gets shared between dataframe
and bag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau is this a problem for you? I want to avoid users accidentally falling back to tasks if they have an old pyarrow installed. This exception is raised early and they can either fix it by upgrading or by setting the config of shuffle to tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau I will move forward with this now since you didn't reply in a little over a week. I'd like this to sit in main for at least a day before we release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good if CI is happy
When using the sibling branch on CI, tests are green: f9a11b2 |
merged main again. This should also show that the changes are compatible with dask/distributed after merging dask/distributed#8157 Once build is done (and green) we can merge |
get_default_shuffle_method
raises if pyarrow
is outdatedget_default_shuffle_method
raises if pyarrow
is outdated
There are plenty of failures on gpu CI but I think this is unrelated cc @dask/gpu dask/dataframe/io/tests/test_io.py:353:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/conda/envs/dask/lib/python3.9/site-packages/nvtx/nvtx.py:101: in inner
result = func(*args, **kwargs)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:728: in __init__
self._init_from_dict_like(
/opt/conda/envs/dask/lib/python3.9/site-packages/nvtx/nvtx.py:101: in inner
result = func(*args, **kwargs)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:925: in _init_from_dict_like
keys, values, lengths = zip(
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/dataframe.py:931: in <genexpr>
vc := as_column(v, nan_as_null=nan_as_null),
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:2470: in as_column
data = as_column(
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:1996: in as_column
col = ColumnBase.from_arrow(arbitrary)
/opt/conda/envs/dask/lib/python3.9/site-packages/cudf/core/column/column.py:379: in from_arrow
result = libcudf.interop.from_arrow(data)[0]
/opt/conda/envs/dask/lib/python3.9/contextlib.py:79: in inner
return func(*args, **kwds)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E RuntimeError: Fatal CUDA error encountered at: /opt/conda/conda-bld/work/cpp/src/bitmask/null_mask.cu:93: 2 cudaErrorMemoryAllocation out of memory |
I will move forward with merging this since otherwise the dask/distributed repo might also break. If the gpuCI failure is related, we will have to investigate this more closely and possible revert the distributed PR |
pre-commit run --all-files