applying tuple with pyarrow #10881

SurkynRik · 2024-02-01T15:20:45Z

When applying tuple to a dask dataframe without pyarrow installed, it gives a column with tuples as expected. If instead we apply it with pyarrow installed, we get string dtypes instead.

The problem can be reproduced by the following commands in the console:

$ pyenv deactivate
$ pyenv virtualenv --clear 3.10.12 tuple10 # create a clear environment
$ pyenv activate tuple10
$ pip install dask[dataframe]==2024.1.1
$ python tuple_test.py # we expect a tuple to be the result
                 d
0  <class 'tuple'>
$ pip install pyarrow
$ python tuple_test.py # but with pyarrow we get a string instead
               d
0  <class 'str'>

with tuple_test.py

import dask.dataframe as dd
import pandas as pd


def apply_tuple_on_two_cols(
    counts_df: dd.DataFrame,
):
    counts_df["d"] = counts_df[["b", "c"]].apply(
        tuple, axis=1, meta=pd.Series(dtype=object)
    )
    counts_df["d"] = counts_df["d"].apply(
        type,
        meta=pd.Series(dtype=object),
    )
    return counts_df[["d"]]


def test_tuple_application():
    counts = dd.from_pandas(
        pd.DataFrame({"a": ["1"], "b": ["2"], "c": [3]}), npartitions=1
    )
    result = apply_tuple_on_two_cols(counts)
    print(result.compute())


if __name__ == "__main__":
    test_tuple_application()

Environment:

Dask version:2024.1.1
Pyarrow version: 15.0.0
Python version:3.10.12
Operating System:Ubuntu 22.04
Install method (conda, pip, source):pip

hendrikmakait · 2024-02-02T08:36:48Z

Thanks for reporting your issue, this behavior is expected: In 2023.7.1, we added a feature that by casts object columns to the pyarrow[string] dtype by default. Most users are using the object dtype for strings and this cast translates to a major performance gain (both memory and runtime) for them.

You can disable this behavior with dask.config.set({"dataframe.convert-string": False}).

hendrikmakait · 2024-02-02T08:38:49Z

See #10631 for a related discussion.

github-actions bot added the needs triage Needs a response from a contributor label Feb 1, 2024

hendrikmakait added convert-string and removed needs triage Needs a response from a contributor labels Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

applying tuple with pyarrow #10881

applying tuple with pyarrow #10881

SurkynRik commented Feb 1, 2024 •

edited

hendrikmakait commented Feb 2, 2024

hendrikmakait commented Feb 2, 2024

applying tuple with pyarrow #10881

applying tuple with pyarrow #10881

Comments

SurkynRik commented Feb 1, 2024 • edited

hendrikmakait commented Feb 2, 2024

hendrikmakait commented Feb 2, 2024

SurkynRik commented Feb 1, 2024 •

edited