Fix tokenize for pandas extension arrays by TomAugspurger · Pull Request #5813 · dask/dask

TomAugspurger · 2020-01-21T16:05:32Z

This adds support for pandas new arrays.

It turned up an issue with our tokenization of some extension dtypes. On master, we don't get a deterministic token for some dtypes. I've done a bit of work here, but still need to check performance (hence the TODO).

TomAugspurger · 2020-01-21T16:07:25Z

Maybe ignore the StringArray / BooleanArray stuff for now. I'll split that into its own PR, and leave this one just for tokenize.

TomAugspurger · 2020-01-21T18:20:58Z

I've limited these changes to just the tokenization things. Will followup with support for String and Boolean dtypes.

I made dask/dask-benchmarks#32 with benchmarks. We are slower in several cases, though that's unavoidable. Previously we were falling back to normalize_object, which just called uuid.uuid4().hex, which wasn't deterministic. Now that we're actually getting deterministic tokens we have to hash the values, which takes time.

# master
[ 50.00%] ··· tokenize.TimeTokenizePandas.time_tokenize
[ 50.00%] ··· ===================== ========= ==========
              --                         as_series
              --------------------- --------------------
                      dtype            True     False
              ===================== ========= ==========
                      period         316±0μs   53.8±0μs
                  datetime64[ns]     342±0μs   444±0μs
               datetime64[ns, CET]   287±0μs   274±0μs
                       int           402±0μs   196±0μs
                     category        582±0μs   148±0μs
                      sparse         267±0μs   55.3±0μs
                       Int           279±0μs   52.5±0μs
                      string         331±0μs   50.5±0μs
                     boolean         314±0μs   57.5±0μs
              ===================== ========= ==========

# This branch
[ 50.00%] ··· tokenize.TimeTokenizePandas.time_tokenize
[ 50.00%] ··· ===================== ========= ==========
              --                         as_series
              --------------------- --------------------
                      dtype            True     False
              ===================== ========= ==========
                      period         346±0μs   51.8±0μs
                  datetime64[ns]     352±0μs   227±0μs
               datetime64[ns, CET]   295±0μs   217±0μs
                       int           381±0μs   238±0μs
                     category        641±0μs   154±0μs
                      sparse         291±0μs   54.5±0μs
                       Int           265±0μs   63.3±0μs
                      string         375±0μs   80.2±0μs
                     boolean         379±0μs   108±0μs
              ===================== ========= ==========

TomAugspurger · 2020-01-21T19:38:49Z

We do get a nice speedup for tokenizing a MultiIndex by avoiding allocating the ndarray of tuples from MultiIndex.values.

# master
In [3]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])

In [4]: %timeit tokenize(idx)
618 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# This PR
In [6]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])

In [7]: %timeit tokenize(idx)
105 µs ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

TomAugspurger · 2020-01-21T20:56:12Z

Apologies for rushing this, but going to self-merge so that I can make the PR for string / boolean that depends on this.

mrocklin · 2020-01-21T20:57:52Z

Thanks for handling this Tom

…

On Tue, Jan 21, 2020 at 12:56 PM Tom Augspurger ***@***.***> wrote: Merged #5813 <#5813> into master. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5813?email_source=notifications&email_token=AACKZTF6CSNAGEAUGLVMLB3Q65OPZA5CNFSM4KJV6WMKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWDSC6LQ#event-2967744302>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHI7LUKDTYVMZRZ5Z3Q65OPZANCNFSM4KJV6WMA> .

TomAugspurger added 2 commits January 21, 2020 09:40

Support pandas nullable string and boolean

62ce0b0

tokenize

362798e

TomAugspurger changed the title ~~[WIP] Support pandas 1.0's StringArray and BooleanArray~~ Fix tokenize for pandas extension arrays Jan 21, 2020

TomAugspurger added 2 commits January 21, 2020 12:09

Merge remote-tracking branch 'upstream/master' into string-dtype

0da8c9a

fixups

daed098

TomAugspurger added 2 commits January 21, 2020 13:00

skip reason

0deadb3

multiindex

4bf9ef9

TomAugspurger added 2 commits January 21, 2020 14:11

fixup

676b5ff

fixup

aca8795

TomAugspurger merged commit 04d8658 into dask:master Jan 21, 2020

TomAugspurger deleted the string-dtype branch January 21, 2020 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tokenize for pandas extension arrays#5813

Fix tokenize for pandas extension arrays#5813
TomAugspurger merged 8 commits intodask:masterfrom
TomAugspurger:string-dtype

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

mrocklin commented Jan 21, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

TomAugspurger commented Jan 21, 2020

Uh oh!

mrocklin commented Jan 21, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants