You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When applying tuple to a dask dataframe without pyarrow installed, it gives a column with tuples as expected. If instead we apply it with pyarrow installed, we get string dtypes instead.
The problem can be reproduced by the following commands in the console:
$ pyenv deactivate
$ pyenv virtualenv --clear 3.10.12 tuple10 # create a clear environment
$ pyenv activate tuple10
$ pip install dask[dataframe]==2024.1.1
$ python tuple_test.py # we expect a tuple to be the result
d
0 <class 'tuple'>
$ pip install pyarrow
$ python tuple_test.py # but with pyarrow we get a string instead
d
0 <class 'str'>
Thanks for reporting your issue, this behavior is expected: In 2023.7.1, we added a feature that by casts object columns to the pyarrow[string] dtype by default. Most users are using the object dtype for strings and this cast translates to a major performance gain (both memory and runtime) for them.
You can disable this behavior with dask.config.set({"dataframe.convert-string": False}).
When applying tuple to a dask dataframe without pyarrow installed, it gives a column with tuples as expected. If instead we apply it with pyarrow installed, we get string dtypes instead.
The problem can be reproduced by the following commands in the console:
with tuple_test.py
Environment:
The text was updated successfully, but these errors were encountered: