Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37055: [C++] Optimize hash kernels for Dictionary ChunkedArrays #38394

Merged
merged 6 commits into from
Dec 23, 2023

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented Oct 23, 2023

Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

What changes are included in this PR?

Reuse the dictionary unifier across chunks.

Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #37055 has been automatically assigned in GitHub to PR creator.

@js8544
Copy link
Collaborator Author

js8544 commented Oct 23, 2023

Benchmark result: https://gist.github.com/js8544/d64f289313c814a2df6a87a945a0b382#file-pr38394-txt
Improvement is significant when # of unique is large, i.e. when dictionary size is large.

Also significant improvement in user's case in python: #37055 (comment)

@js8544
Copy link
Collaborator Author

js8544 commented Oct 23, 2023

It was mentioned in #9683 (comment) that we can compute the result of each chunk and then merge them with hash_sum. However, since hash aggregate functions are moved to acero. It's less ideal to have compute kernels depend on acero because it's a level higher in the dependency tree.

This PR saves calls of the dictionary unifier, we can also further optimize this by optimizing the unifying process. This will be done once we have a faster hashtable: #38372.

@js8544
Copy link
Collaborator Author

js8544 commented Dec 9, 2023

@felipecrv Do you mind having a look at this?

@felipecrv
Copy link
Contributor

@felipecrv Do you mind having a look at this?

Soon!

ARROW_CHECK_OK(unifier->Unify(*arr_dict, &transpose_map));
ARROW_CHECK_OK(unifier->GetResult(&out_dict_type, &out_dict));
RETURN_NOT_OK(dictionary_unifier_->Unify(*arr_dict, &transpose_map));
RETURN_NOT_OK(dictionary_unifier_->GetResult(&out_dict_type, &out_dict));

dictionary_ = out_dict;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need dictionary_ to be a member variable now? Wouldn't it suffice to perform a single DictionaryUnifier::GetResult call at the end?

Copy link
Contributor

@felipecrv felipecrv Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation for DictionaryUnifier::GetResult says the unifier can't be re-used after a call to GetResult [1].

My suggestion (that I think will work well):

Rename dictionary_ to first_dictionary_ and change Append(arr) to transition through this state machine on each call:

// --------------------------------------------------------------------------------------------------------------
//  Current State                                     Next State
// --------------------------------------------------------------------------------------------------------------
//  !first_dictionary_ && !dictionary_unifier_   -->  first_dictionary_ = arr_dict_
//                                                    UNCHANGED dictionary_unifier_
// --------------------------------------------------------------------------------------------------------------
//   first_dictionary_ && !dictionary_unifier_   -->  if !first_dictionary_.Equals(arr_dict) then
//                                                       dictionary_unifier_ = unify(first_dictionary_, arr_dict)
//                                                       first_dictionary_ = nullptr
//                                                    else
//                                                       UNCHANGED first_dictionary_, dictionary_unifier_
//                                                    end
// --------------------------------------------------------------------------------------------------------------
                         dictionary_unifier_    -->  dictionary_unifier_ = unify(dictionary_unifier_, arr_dict)

You will then have to re-think how dictionary_value_type and dictionary work below.

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/array_dict.h#L169

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@js8544 to clarify, dictionary_unifier_ = unify(dictionary_unifier_, arr_dict) means something like unifier->unify(arr_dict) in the actual code :)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 15, 2023
@js8544 js8544 force-pushed the jinshang/dict_hash_using_agg branch from 001a846 to f09ab5c Compare December 16, 2023 04:03
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes other than the rebase.

@js8544
Copy link
Collaborator Author

js8544 commented Dec 19, 2023

I don't see any changes other than the rebase.

Still working on it :)

@js8544
Copy link
Collaborator Author

js8544 commented Dec 22, 2023

@felipecrv Updated as you suggested.

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

cpp/src/arrow/compute/kernels/vector_hash.cc Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_hash.cc Outdated Show resolved Hide resolved
@felipecrv
Copy link
Contributor

@js8544 I won't merge my own suggestions before you have a chance to reply and apply them yourself. If you're OK with them and CI passes with them applied I will merge.

js8544 and others added 3 commits December 23, 2023 11:47
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@felipecrv
Copy link
Contributor

macOS build failing for some unknown reason and this has been rebased very recently, so I'm merging.

@felipecrv felipecrv merged commit ec41209 into apache:main Dec 23, 2023
32 of 34 checks passed
@felipecrv felipecrv removed the awaiting committer review Awaiting committer review label Dec 23, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ec41209.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…ays (apache#38394)

### Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

### What changes are included in this PR?

Reuse the dictionary unifier across chunks.

### Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

### Are there any user-facing changes?

No. 

* Closes: apache#37055

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ays (apache#38394)

### Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

### What changes are included in this PR?

Reuse the dictionary unifier across chunks.

### Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

### Are there any user-facing changes?

No. 

* Closes: apache#37055

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] value_counts extremely slow for chunked DictionaryArray
2 participants