Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

Closed
randolf-scholz opened this issue Aug 7, 2023 · 9 comments · Fixed by #38394
Closed

[C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

randolf-scholz opened this issue Aug 7, 2023 · 9 comments · Fixed by #38394

Comments

@randolf-scholz
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I have a large dataset (>100M rows) with a dictionary[int32,string] column (ChunkedArray) and noticed that compute.value_counts is extremely slow for this column, compared to other columns.

table[col].value_counts() is 10x-100x slower than table[col].combine_chunks().value_counts() in this case.

Component(s)

C++, Python

@randolf-scholz randolf-scholz changed the title compute.value_counts extremely slow for chunked dictionary[int32,string]-types value_counts extremely slow for chunked dictionary[int32,string]-types Aug 7, 2023
@randolf-scholz randolf-scholz changed the title value_counts extremely slow for chunked dictionary[int32,string]-types value_counts extremely slow for chunked dictionary[int32,string]-Array Aug 7, 2023
@randolf-scholz randolf-scholz changed the title value_counts extremely slow for chunked dictionary[int32,string]-Array value_counts extremely slow for chunked DictionaryArray Aug 7, 2023
@assignUser
Copy link
Member

I assume this is with pyarrow 12?

@randolf-scholz
Copy link
Author

Yes, 12.0.1.

@pitrou pitrou changed the title value_counts extremely slow for chunked DictionaryArray [C++][Python] value_counts extremely slow for chunked DictionaryArray Aug 22, 2023
@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

cc @js8544 @felipecrv

@js8544
Copy link
Collaborator

js8544 commented Aug 22, 2023

Ah, I had done some research on this issue but forgot to post my findings. I think @rok's comment here and the discussion here explain it well. We can optimize it by first computing it over each chunk and hash-aggregate the result. However, I don't think we can directly call hash aggregate functions in compute kernels, without having to depend on acero?

cc @westonpace Can you confirm?

@westonpace
Copy link
Member

I'm not entirely sure I understand the goal. The aggregate operations do have standalone python bindings. For example:

>>> import pyarrow as pa
>>> x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
>>> import pyarrow.compute as pc
>>> pc.sum(x)
<pyarrow.Int64Scalar: 45>

However, the individuals parts (the partial aggregate func (Consume) and the final aggregate func (Finalize)) cannot be called from python individually. So, for example, it is not possible to create a streaming aggregator in python.

However, in this case, you might be able to get away with something like this:

import pyarrow as pa
import pyarrow.compute as pc

x = pa.chunked_array([[1, 2, 3, 4, 5], [6, 7, 8, 9]])
y = pa.chunked_array([[1, 1, 2, 2, 3], [4, 4]])

x_counts = pc.value_counts(x)
y_counts = pc.value_counts(y)

x_batch = pa.RecordBatch.from_struct_array(x_counts)
y_batch = pa.RecordBatch.from_struct_array(y_counts)

table = pa.Table.from_batches([x_batch, y_batch])

counts = table.group_by("values").aggregate([("counts", "sum")])

I'm not sure if it will be faster or not.

@js8544
Copy link
Collaborator

js8544 commented Aug 24, 2023

I'm not entirely sure I understand the goal.

Sorry I wasn't clear enough. As discussed here, there are two ways to implement the value_counts kernel for Dictionary inputs. The current implementation uses the first approach, but we want to switch to the second for better performance. However, we would need to call hash_sum within the value_counts kernel. There used to be a internal::GroupBy available, but I am not sure if that's possible now after the refactoring. To be clear, I'm talking about kernel implementation in C++, not user's code in Python.

@js8544
Copy link
Collaborator

js8544 commented Oct 20, 2023

Hi @randolf-scholz, do you remember how many chunks are in your ChunkedArray? I'm optimizing this kernel and would like to reproduce your case.

@randolf-scholz
Copy link
Author

randolf-scholz commented Oct 20, 2023

@js8544 The dataset in question was the table "hosp/labevents.csv" from the MIMIC-IV dataset: https://physionet.org/content/mimiciv/2.2/.

I changed my own preprocessing, so it doesn't really affect me anymore, but I was able to reproduce it in pyarrow 13:

  1. Read the csv file, parsing the "value"-column to dictionary[int32, string]
  2. %timeit table["value"].value_counts(): 10.5 s ± 102 ms (on desktop, was worse on laptop with fewer cores)
  3. %timeit table["value"].combine_chunks().value_counts(): 1.29 s ± 12.9 ms

The stats of the data are:

  • length: 118,171,367
  • null_count: 19,803,023 (~17%)
  • num_chunks: 13095
  • num_unique: 39160
  • binary entropy (non-null): 9.48 bits
  • normalized entropy: 62%

@js8544
Copy link
Collaborator

js8544 commented Oct 23, 2023

Thanks! Since the original file requires registration and some other verificaiton processes, I downloaded a demo file with about 100K rows. Nevertheless I was able to optimize value_counts() to the same level as combine_chunks().values_counts():

# Before
1.04 ms ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
625 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # combine_chunks().value_counts()
# After
642 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # value_counts()
610 µs ± 2.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # combine_chunks().value_counts()

I'll write a formal C++ benchmark to further verify and send a PR shortly.

felipecrv added a commit that referenced this issue Dec 23, 2023
…38394)

### Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

### What changes are included in this PR?

Reuse the dictionary unifier across chunks.

### Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

### Are there any user-facing changes?

No. 

* Closes: #37055

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@felipecrv felipecrv added this to the 15.0.0 milestone Dec 23, 2023
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ays (apache#38394)

### Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

### What changes are included in this PR?

Reuse the dictionary unifier across chunks.

### Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

### Are there any user-facing changes?

No. 

* Closes: apache#37055

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ays (apache#38394)

### Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

### What changes are included in this PR?

Reuse the dictionary unifier across chunks.

### Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

### Are there any user-facing changes?

No. 

* Closes: apache#37055

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants