Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

applying tuple with pyarrow #10881

Open
SurkynRik opened this issue Feb 1, 2024 · 2 comments
Open

applying tuple with pyarrow #10881

SurkynRik opened this issue Feb 1, 2024 · 2 comments

Comments

@SurkynRik
Copy link

SurkynRik commented Feb 1, 2024

When applying tuple to a dask dataframe without pyarrow installed, it gives a column with tuples as expected. If instead we apply it with pyarrow installed, we get string dtypes instead.

The problem can be reproduced by the following commands in the console:

$ pyenv deactivate
$ pyenv virtualenv --clear 3.10.12 tuple10 # create a clear environment
$ pyenv activate tuple10
$ pip install dask[dataframe]==2024.1.1
$ python tuple_test.py # we expect a tuple to be the result
                 d
0  <class 'tuple'>
$ pip install pyarrow
$ python tuple_test.py # but with pyarrow we get a string instead
               d
0  <class 'str'>

with tuple_test.py

import dask.dataframe as dd
import pandas as pd


def apply_tuple_on_two_cols(
    counts_df: dd.DataFrame,
):
    counts_df["d"] = counts_df[["b", "c"]].apply(
        tuple, axis=1, meta=pd.Series(dtype=object)
    )
    counts_df["d"] = counts_df["d"].apply(
        type,
        meta=pd.Series(dtype=object),
    )
    return counts_df[["d"]]


def test_tuple_application():
    counts = dd.from_pandas(
        pd.DataFrame({"a": ["1"], "b": ["2"], "c": [3]}), npartitions=1
    )
    result = apply_tuple_on_two_cols(counts)
    print(result.compute())


if __name__ == "__main__":
    test_tuple_application()

Environment:

  • Dask version:2024.1.1
  • Pyarrow version: 15.0.0
  • Python version:3.10.12
  • Operating System:Ubuntu 22.04
  • Install method (conda, pip, source):pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Feb 1, 2024
@hendrikmakait
Copy link
Member

Thanks for reporting your issue, this behavior is expected: In 2023.7.1, we added a feature that by casts object columns to the pyarrow[string] dtype by default. Most users are using the object dtype for strings and this cast translates to a major performance gain (both memory and runtime) for them.

You can disable this behavior with dask.config.set({"dataframe.convert-string": False}).

@hendrikmakait hendrikmakait added convert-string and removed needs triage Needs a response from a contributor labels Feb 2, 2024
@hendrikmakait
Copy link
Member

See #10631 for a related discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants