Fix tokenize for pandas extension arrays#5813
Conversation
|
Maybe ignore the StringArray / BooleanArray stuff for now. I'll split that into its own PR, and leave this one just for tokenize. |
|
I've limited these changes to just the tokenization things. Will followup with support for String and Boolean dtypes. I made dask/dask-benchmarks#32 with benchmarks. We are slower in several cases, though that's unavoidable. Previously we were falling back to |
|
We do get a nice speedup for tokenizing a MultiIndex by avoiding allocating the ndarray of tuples from # master
In [3]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])
In [4]: %timeit tokenize(idx)
618 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# This PR
In [6]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])
In [7]: %timeit tokenize(idx)
105 µs ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
|
Apologies for rushing this, but going to self-merge so that I can make the PR for string / boolean that depends on this. |
|
Thanks for handling this Tom
…On Tue, Jan 21, 2020 at 12:56 PM Tom Augspurger ***@***.***> wrote:
Merged #5813 <#5813> into master.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5813?email_source=notifications&email_token=AACKZTF6CSNAGEAUGLVMLB3Q65OPZA5CNFSM4KJV6WMKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWDSC6LQ#event-2967744302>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHI7LUKDTYVMZRZ5Z3Q65OPZANCNFSM4KJV6WMA>
.
|
This adds support for pandas new arrays.
It turned up an issue with our tokenization of some extension dtypes. On master, we don't get a deterministic token for some dtypes. I've done a bit of work here, but still need to check performance (hence the TODO).