[C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` #37055

randolf-scholz · 2023-08-07T23:19:34Z

Describe the bug, including details regarding any error messages, version, and platform.

I have a large dataset (>100M rows) with a dictionary[int32,string] column (ChunkedArray) and noticed that compute.value_counts is extremely slow for this column, compared to other columns.

table[col].value_counts() is 10x-100x slower than table[col].combine_chunks().value_counts() in this case.

Component(s)

C++, Python

The text was updated successfully, but these errors were encountered:

assignUser · 2023-08-08T00:27:13Z

I assume this is with pyarrow 12?

randolf-scholz · 2023-08-08T00:31:07Z

Yes, 12.0.1.

pitrou · 2023-08-22T16:47:02Z

cc @js8544 @felipecrv

js8544 · 2023-08-22T16:55:48Z

Ah, I had done some research on this issue but forgot to post my findings. I think @rok's comment here and the discussion here explain it well. We can optimize it by first computing it over each chunk and hash-aggregate the result. However, I don't think we can directly call hash aggregate functions in compute kernels, without having to depend on acero?

cc @westonpace Can you confirm?

westonpace · 2023-08-23T15:32:53Z

I'm not entirely sure I understand the goal. The aggregate operations do have standalone python bindings. For example:

>>> import pyarrow as pa
>>> x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
>>> import pyarrow.compute as pc
>>> pc.sum(x)
<pyarrow.Int64Scalar: 45>

However, the individuals parts (the partial aggregate func (Consume) and the final aggregate func (Finalize)) cannot be called from python individually. So, for example, it is not possible to create a streaming aggregator in python.

However, in this case, you might be able to get away with something like this:

import pyarrow as pa
import pyarrow.compute as pc

x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
y = pa.chunked_array([[1, 1, 2, 2, 3], [4, 4]])

x_counts = pc.value_counts(x)
y_counts = pc.value_counts(y)

x_batch = pa.RecordBatch.from_struct_array(x_counts)
y_batch = pa.RecordBatch.from_struct_array(y_counts)

table = pa.Table.from_batches([x_batch, y_batch])

counts = table.group_by("values").aggregate([("counts", "sum")])

I'm not sure if it will be faster or not.

js8544 · 2023-08-24T03:03:46Z

I'm not entirely sure I understand the goal.

Sorry I wasn't clear enough. As discussed here, there are two ways to implement the value_counts kernel for Dictionary inputs. The current implementation uses the first approach, but we want to switch to the second for better performance. However, we would need to call hash_sum within the value_counts kernel. There used to be a internal::GroupBy available, but I am not sure if that's possible now after the refactoring. To be clear, I'm talking about kernel implementation in C++, not user's code in Python.

js8544 · 2023-10-20T00:35:34Z

Hi @randolf-scholz, do you remember how many chunks are in your ChunkedArray? I'm optimizing this kernel and would like to reproduce your case.

randolf-scholz · 2023-10-20T11:19:33Z

@js8544 The dataset in question was the table "hosp/labevents.csv" from the MIMIC-IV dataset: https://physionet.org/content/mimiciv/2.2/.

I changed my own preprocessing, so it doesn't really affect me anymore, but I was able to reproduce it in pyarrow 13:

Read the csv file, parsing the "value"-column to dictionary[int32, string]
%timeit table["value"].value_counts(): 10.5 s ± 102 ms (on desktop, was worse on laptop with fewer cores)
%timeit table["value"].combine_chunks().value_counts(): 1.29 s ± 12.9 ms

The stats of the data are:

length: 118,171,367
null_count: 19,803,023 (~17%)
num_chunks: 13095
num_unique: 39160
binary entropy (non-null): 9.48 bits
normalized entropy: 62%

js8544 · 2023-10-23T04:16:45Z

Thanks! Since the original file requires registration and some other verificaiton processes, I downloaded a demo file with about 100K rows. Nevertheless I was able to optimize value_counts() to the same level as combine_chunks().values_counts():

# Before
1.04 ms ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
625 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # combine_chunks().value_counts()
# After
642 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
610 µs ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # combine_chunks().value_counts()

I'll write a formal C++ benchmark to further verify and send a PR shortly.

…38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: #37055 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

…ays (apache#38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: apache#37055 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

randolf-scholz added the Type: bug label Aug 7, 2023

github-actions bot added Component: C++ Component: Python labels Aug 7, 2023

randolf-scholz changed the title ~~compute.value_counts extremely slow for chunked dictionary[int32,string]-types~~ value_counts extremely slow for chunked dictionary[int32,string]-types Aug 7, 2023

randolf-scholz changed the title ~~value_counts extremely slow for chunked dictionary[int32,string]-types~~ value_counts extremely slow for chunked dictionary[int32,string]-Array Aug 7, 2023

randolf-scholz changed the title ~~value_counts extremely slow for chunked dictionary[int32,string]-Array~~ value_counts extremely slow for chunked DictionaryArray Aug 7, 2023

pitrou changed the title ~~value_counts extremely slow for chunked DictionaryArray~~ [C++][Python] value_counts extremely slow for chunked DictionaryArray Aug 22, 2023

github-actions bot mentioned this issue Oct 23, 2023

GH-37055: [C++] Optimize hash kernels for Dictionary ChunkedArrays #38394

Merged

github-actions bot assigned js8544 Oct 23, 2023

felipecrv closed this as completed in #38394 Dec 23, 2023

felipecrv added this to the 15.0.0 milestone Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` #37055

[C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` #37055

randolf-scholz commented Aug 7, 2023

assignUser commented Aug 8, 2023

randolf-scholz commented Aug 8, 2023

pitrou commented Aug 22, 2023

js8544 commented Aug 22, 2023

westonpace commented Aug 23, 2023

js8544 commented Aug 24, 2023 •

edited

Loading

js8544 commented Oct 20, 2023

randolf-scholz commented Oct 20, 2023 •

edited

Loading

js8544 commented Oct 23, 2023 •

edited

Loading

[C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

[C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

Comments

randolf-scholz commented Aug 7, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

assignUser commented Aug 8, 2023

randolf-scholz commented Aug 8, 2023

pitrou commented Aug 22, 2023

js8544 commented Aug 22, 2023

westonpace commented Aug 23, 2023

js8544 commented Aug 24, 2023 • edited Loading

js8544 commented Oct 20, 2023

randolf-scholz commented Oct 20, 2023 • edited Loading

js8544 commented Oct 23, 2023 • edited Loading

[C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` #37055

[C++][Python] `value_counts` extremely slow for chunked `DictionaryArray` #37055

js8544 commented Aug 24, 2023 •

edited

Loading

randolf-scholz commented Oct 20, 2023 •

edited

Loading

js8544 commented Oct 23, 2023 •

edited

Loading