Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when using a Dataframe with byte data and pandas 2 #10951

Closed
danmar3 opened this issue Feb 26, 2024 · 2 comments
Closed

UnicodeDecodeError when using a Dataframe with byte data and pandas 2 #10951

danmar3 opened this issue Feb 26, 2024 · 2 comments
Labels
needs triage Needs a response from a contributor

Comments

@danmar3
Copy link

danmar3 commented Feb 26, 2024

Describe the issue:
Dataframes with columns containing data of type 'bytes' throws an UnicodeDecodeError. This issue happens when having pandas 2 installed.

Minimal Complete Verifiable Example:

import pickle
import pandas as pd
import dask.dataframe as dd
data = pd.DataFrame({
    'a': pd.Series([1, 2, 3, 4]),
    'b': pd.Series([pickle.dumps(vi) for vi in [1, 2, 3, 4]]),
})

data_dd = dd.from_pandas(data, npartitions=2)
data_dd.map_partitions(lambda x: x['b'] ).head(2)

This code throws the following error:

...
File [env/lib/python3.10/site-packages/dask/base.py:375), in DaskMethodsMixin.compute(self, **kwargs)
...
File lib.pyx:720, in pandas._libs.lib.ensure_string_array()

File lib.pyx:813, in pandas._libs.lib.ensure_string_array()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Anything else we need to know?:
The code only fails on pandas 2. When using pandas 1.5.3 there is no error.

Environment:

  • Dask version: 2024.2.1
  • Python version: 3.10.12
  • Operating System: wsl2
  • Install method (conda, pip, source): pip
  • Pandas: 2.2.1
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Feb 26, 2024
@danmar3
Copy link
Author

danmar3 commented Feb 26, 2024

I just found out this is related to #10631 and #10139

Adding dask.config.set({"dataframe.convert-string": False}) fixes this issue.

We can close this if you think this is redundant with the other threads.

@phofl
Copy link
Collaborator

phofl commented Apr 4, 2024

Yep I think closing this makes sense. Thanks for your report and digging in though

@phofl phofl closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

2 participants