Use upstream pandas pickling protocol for pyarrow string series #9613
Labels
dataframe
needs attention
It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
(Say that ten times fast)
The new(ish) pandas
string[pyarrow]
extension dtype is an important thing for Dask to support well (cf. #5720, #9590), as it allows for faster string operations in less memory and with less possible GIL contention. However, there has been a long-standing bug in pyarrow whereby a slice of astring[pyarrow]
series serializes the full backing buffer rather than slice of it.In #9024 @jcrist worked around this by registering a custom
copyreg
implementation forpyarrow[string]
arrays, which has been fairly effective. However, this implementation is problematic, as it effectively monkey-patches some code that does not belong to Dask, and presents a maintenance burden.A few days ago @mroeschke landed a fix in pandas (pandas-dev/pandas#49078) for the same issue. As yet it is unreleased, but it's my understanding that it will be in pandas 2.0. When it is released, Dask's implementation will shadow the pandas one. Some investigation of the implementations are in a gist here: https://gist.github.com/ian-r-rose/41d5199412154faf1eff5a2df2e8b94e
This issue is a reminder to defer to pandas' implementation once it is released. There are a few things to be done :
The last point is a small wrinkle: it appears that the Dask implementation is slightly faster than the pandas one (by ~5%-10%). Some investigation that I did with dask
main
and a pandas 2.0 nightly:Now, to me that difference isn't hugely important, and the pandas implementation is significantly simpler, so I'm given to just go with that. But I thought that @mroeschke might be interested to look at the above.
The text was updated successfully, but these errors were encountered: