Use upstream pandas pickling protocol for pyarrow string series #9613

ian-r-rose · 2022-10-31T23:23:06Z

(Say that ten times fast)

The new(ish) pandas string[pyarrow] extension dtype is an important thing for Dask to support well (cf. #5720, #9590), as it allows for faster string operations in less memory and with less possible GIL contention. However, there has been a long-standing bug in pyarrow whereby a slice of a string[pyarrow] series serializes the full backing buffer rather than slice of it.

In #9024 @jcrist worked around this by registering a custom copyreg implementation for pyarrow[string] arrays, which has been fairly effective. However, this implementation is problematic, as it effectively monkey-patches some code that does not belong to Dask, and presents a maintenance burden.

A few days ago @mroeschke landed a fix in pandas (pandas-dev/pandas#49078) for the same issue. As yet it is unreleased, but it's my understanding that it will be in pandas 2.0. When it is released, Dask's implementation will shadow the pandas one. Some investigation of the implementations are in a gist here: https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e

This issue is a reminder to defer to pandas' implementation once it is released. There are a few things to be done :

Defer to pandas' implementation once it is available.
Decide if we should we delete Dask's implementation entirely, or keep it around for backwards compatibility with older versions of pandas
Possibly upstream any performance improvements in Dask's implementation

The last point is a small wrinkle: it appears that the Dask implementation is slightly faster than the pandas one (by ~5%-10%). Some investigation that I did with dask main and a pandas 2.0 nightly:

import copyreg
import contextlib
import pickle
import random
import string

import dask
import dask.dataframe as dd  # Trigger Dask's copyreg
import pandas

print(pandas.__version__)

# context manager to remove the dask copyreg
@contextlib.contextmanager
def disable_copyreg(klass):
    disp = copyreg.dispatch_table.pop(klass, None)
    try:
        yield
    finally:
        copyreg.dispatch_table[klass] = disp

# Create ~500 MiB of sample string data
s = pandas.Series(
    [
        "".join(random.choices(string.ascii_letters, k=random.randint(100, 1000)))
        for _ in range(1_000_000)
    ],
    dtype="string[pyarrow]",
    name="data",
)

# Which approach is faster?

# Pandas implementation
with disable_copyreg(pandas.arrays.ArrowStringArray):
    %timeit pickle.dumps(s)    # ~550 ms
    
# Dask implementation
%timeit pickle.dumps(s)  # ~500 ms

Now, to me that difference isn't hugely important, and the pandas implementation is significantly simpler, so I'm given to just go with that. But I thought that @mroeschke might be interested to look at the above.

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-11-01T01:42:33Z

As yet it is unreleased, but it's my understanding that it will be in pandas 2.0.

Correct. Additionally in pandas 2.0, we bumped our minimum pyarrow version to 6.0 FWIW.

Could you also point me to the __reduce__ (or relevant pickle) operation in Dask? Curious to see if there any tricks pandas could use :)

ian-r-rose · 2022-11-01T15:37:56Z

Could you also point me to the __reduce__ (or relevant pickle) operation in Dask? Curious to see if there any tricks pandas could use :)

Yes, the Dask implementation is here: https://github.com/dask/dask/blob/main/dask/dataframe/_pyarrow_compat.py

mroeschke · 2022-11-01T22:32:13Z

Thanks for sharing the implementation. Yeah your implementation really digs in with the buffers.

Maybe one barrier for up-streaming this into pandas is that it appears specific to string types while pandas-dev/pandas#49078 applies to pickling a ArrowExtensionArray with any pyarrow dtype.

ian-r-rose · 2022-11-01T22:35:16Z

Yes, given the greater generality and simplicity of your implementation, I'm leaning towards just using that (we could even update the implementation here while we wait for pandas 2.0)

ian-r-rose added the dataframe label Oct 31, 2022

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Dec 5, 2022

jrbourbeau self-assigned this Dec 9, 2022

jrbourbeau mentioned this issue Dec 9, 2022

Serialize all pyarrow extension arrays efficiently #9740

Merged

jrbourbeau closed this as completed in #9740 Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use upstream pandas pickling protocol for pyarrow string series #9613

Use upstream pandas pickling protocol for pyarrow string series #9613

ian-r-rose commented Oct 31, 2022

mroeschke commented Nov 1, 2022

ian-r-rose commented Nov 1, 2022

mroeschke commented Nov 1, 2022 •

edited

Loading

ian-r-rose commented Nov 1, 2022

Use upstream pandas pickling protocol for pyarrow string series #9613

Use upstream pandas pickling protocol for pyarrow string series #9613

Comments

ian-r-rose commented Oct 31, 2022

mroeschke commented Nov 1, 2022

ian-r-rose commented Nov 1, 2022

mroeschke commented Nov 1, 2022 • edited Loading

ian-r-rose commented Nov 1, 2022

mroeschke commented Nov 1, 2022 •

edited

Loading