Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use upstream pandas pickling protocol for pyarrow string series #9613

Closed
3 tasks
ian-r-rose opened this issue Oct 31, 2022 · 4 comments · Fixed by #9740
Closed
3 tasks

Use upstream pandas pickling protocol for pyarrow string series #9613

ian-r-rose opened this issue Oct 31, 2022 · 4 comments · Fixed by #9740
Assignees
Labels
dataframe needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Comments

@ian-r-rose
Copy link
Collaborator

(Say that ten times fast)

The new(ish) pandas string[pyarrow] extension dtype is an important thing for Dask to support well (cf. #5720, #9590), as it allows for faster string operations in less memory and with less possible GIL contention. However, there has been a long-standing bug in pyarrow whereby a slice of a string[pyarrow] series serializes the full backing buffer rather than slice of it.

In #9024 @jcrist worked around this by registering a custom copyreg implementation for pyarrow[string] arrays, which has been fairly effective. However, this implementation is problematic, as it effectively monkey-patches some code that does not belong to Dask, and presents a maintenance burden.

A few days ago @mroeschke landed a fix in pandas (pandas-dev/pandas#49078) for the same issue. As yet it is unreleased, but it's my understanding that it will be in pandas 2.0. When it is released, Dask's implementation will shadow the pandas one. Some investigation of the implementations are in a gist here: https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e

This issue is a reminder to defer to pandas' implementation once it is released. There are a few things to be done :

  • Defer to pandas' implementation once it is available.
  • Decide if we should we delete Dask's implementation entirely, or keep it around for backwards compatibility with older versions of pandas
  • Possibly upstream any performance improvements in Dask's implementation

The last point is a small wrinkle: it appears that the Dask implementation is slightly faster than the pandas one (by ~5%-10%). Some investigation that I did with dask main and a pandas 2.0 nightly:

import copyreg
import contextlib
import pickle
import random
import string

import dask
import dask.dataframe as dd  # Trigger Dask's copyreg
import pandas

print(pandas.__version__)

# context manager to remove the dask copyreg
@contextlib.contextmanager
def disable_copyreg(klass):
    disp = copyreg.dispatch_table.pop(klass, None)
    try:
        yield
    finally:
        copyreg.dispatch_table[klass] = disp

# Create ~500 MiB of sample string data
s = pandas.Series(
    [
        "".join(random.choices(string.ascii_letters, k=random.randint(100, 1000)))
        for _ in range(1_000_000)
    ],
    dtype="string[pyarrow]",
    name="data",
)

# Which approach is faster?

# Pandas implementation
with disable_copyreg(pandas.arrays.ArrowStringArray):
    %timeit pickle.dumps(s)    # ~550 ms
    
# Dask implementation
%timeit pickle.dumps(s)  # ~500 ms

Now, to me that difference isn't hugely important, and the pandas implementation is significantly simpler, so I'm given to just go with that. But I thought that @mroeschke might be interested to look at the above.

@mroeschke
Copy link
Contributor

As yet it is unreleased, but it's my understanding that it will be in pandas 2.0.

Correct. Additionally in pandas 2.0, we bumped our minimum pyarrow version to 6.0 FWIW.

Could you also point me to the __reduce__ (or relevant pickle) operation in Dask? Curious to see if there any tricks pandas could use :)

@ian-r-rose
Copy link
Collaborator Author

Could you also point me to the __reduce__ (or relevant pickle) operation in Dask? Curious to see if there any tricks pandas could use :)

Yes, the Dask implementation is here: https://github.com/dask/dask/blob/main/dask/dataframe/_pyarrow_compat.py

@mroeschke
Copy link
Contributor

mroeschke commented Nov 1, 2022

Thanks for sharing the implementation. Yeah your implementation really digs in with the buffers.

Maybe one barrier for up-streaming this into pandas is that it appears specific to string types while pandas-dev/pandas#49078 applies to pickling a ArrowExtensionArray with any pyarrow dtype.

@ian-r-rose
Copy link
Collaborator Author

Yes, given the greater generality and simplicity of your implementation, I'm leaning towards just using that (we could even update the implementation here while we wait for pandas 2.0)

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Dec 5, 2022
@jrbourbeau jrbourbeau self-assigned this Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants