Skip to content

Fix tokenize for pandas extension arrays#5813

Merged
TomAugspurger merged 8 commits intodask:masterfrom
TomAugspurger:string-dtype
Jan 21, 2020
Merged

Fix tokenize for pandas extension arrays#5813
TomAugspurger merged 8 commits intodask:masterfrom
TomAugspurger:string-dtype

Conversation

@TomAugspurger
Copy link
Copy Markdown
Member

This adds support for pandas new arrays.

It turned up an issue with our tokenization of some extension dtypes. On master, we don't get a deterministic token for some dtypes. I've done a bit of work here, but still need to check performance (hence the TODO).

@TomAugspurger
Copy link
Copy Markdown
Member Author

Maybe ignore the StringArray / BooleanArray stuff for now. I'll split that into its own PR, and leave this one just for tokenize.

@TomAugspurger TomAugspurger changed the title [WIP] Support pandas 1.0's StringArray and BooleanArray Fix tokenize for pandas extension arrays Jan 21, 2020
@TomAugspurger
Copy link
Copy Markdown
Member Author

I've limited these changes to just the tokenization things. Will followup with support for String and Boolean dtypes.

I made dask/dask-benchmarks#32 with benchmarks. We are slower in several cases, though that's unavoidable. Previously we were falling back to normalize_object, which just called uuid.uuid4().hex, which wasn't deterministic. Now that we're actually getting deterministic tokens we have to hash the values, which takes time.

# master
[ 50.00%] ··· tokenize.TimeTokenizePandas.time_tokenize
[ 50.00%] ··· ===================== ========= ==========
              --                         as_series
              --------------------- --------------------
                      dtype            True     False
              ===================== ========= ==========
                      period         316±0μs   53.8±0μs
                  datetime64[ns]     342±0μs   444±0μs
               datetime64[ns, CET]   287±0μs   274±0μs
                       int           402±0μs   196±0μs
                     category        582±0μs   148±0μs
                      sparse         267±0μs   55.3±0μs
                       Int           279±0μs   52.5±0μs
                      string         331±0μs   50.5±0μs
                     boolean         314±0μs   57.5±0μs
              ===================== ========= ==========

# This branch
[ 50.00%] ··· tokenize.TimeTokenizePandas.time_tokenize
[ 50.00%] ··· ===================== ========= ==========
              --                         as_series
              --------------------- --------------------
                      dtype            True     False
              ===================== ========= ==========
                      period         346±0μs   51.8±0μs
                  datetime64[ns]     352±0μs   227±0μs
               datetime64[ns, CET]   295±0μs   217±0μs
                       int           381±0μs   238±0μs
                     category        641±0μs   154±0μs
                      sparse         291±0μs   54.5±0μs
                       Int           265±0μs   63.3±0μs
                      string         375±0μs   80.2±0μs
                     boolean         379±0μs   108±0μs
              ===================== ========= ==========

@TomAugspurger
Copy link
Copy Markdown
Member Author

We do get a nice speedup for tokenizing a MultiIndex by avoiding allocating the ndarray of tuples from MultiIndex.values.

# master
In [3]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])

In [4]: %timeit tokenize(idx)
618 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# This PR
In [6]: idx = pd.MultiIndex.from_product([['a', 'b', 'c', 'd'], list(range(1000))])

In [7]: %timeit tokenize(idx)
105 µs ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@TomAugspurger
Copy link
Copy Markdown
Member Author

Apologies for rushing this, but going to self-merge so that I can make the PR for string / boolean that depends on this.

@TomAugspurger TomAugspurger merged commit 04d8658 into dask:master Jan 21, 2020
@TomAugspurger TomAugspurger deleted the string-dtype branch January 21, 2020 20:56
@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jan 21, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants