Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16807: [C++][R] count distinct incorrectly merges state #13583

Merged
merged 11 commits into from Jul 16, 2022

Conversation

drin
Copy link
Contributor

@drin drin commented Jul 12, 2022

This addresses a bug where the count_distinct function simply added counts when merging state. The correct logic would be to return the number of distinct elements after both states have been merged.

State for count_distinct is backed by a MemoTable, which is then backed by a HashTable. To properly merge state, this PR adds 2 functions to each MemoTable: MaybeInsert and MergeTable. The MaybeInsert function handles simplified logic for inserting an element into the MemoTable. The MergeTable function handles iteration over elements in the MemoTable to be merged.

This PR also adds an R test and a C++ test. The R test mirrors what was provided in ARROW-16807. The C++ test, AllChunkedArrayTypesWithNulls, mirrors another C++ test, AllArrayTypesWithNulls, but uses chunked arrays for test data.

drin added 5 commits July 12, 2022 08:47
The bug described in ARROW-16807 is that when merging states for count
distinct, the non_null counts are simply added (no overload that I saw).
The fix was to change this into a proper merge of the data in the MemoTable.

There are 3 derived classes from MemoTable. This adds a `MaybeInsert`
and `MergeTable` function to each to support the merge logic.
This adds a unittest to run count_distinct on chunked arrays, which were
incorrectly handled before.
This is a test to reproduce the issue seen in ARROW-16807, which calls
`count_distinct` (via `n_distinct`) on a dataset with many chunks. If
`summarize` is called before `collect`, then count_distinct merges
distinct counts across chunks incorrectly. Calling collect before
summarize does not expose the bug.
Added an extra assert to be sure that collect and summarize produce the
same results when commuted.
@github-actions
Copy link

@drin
Copy link
Contributor Author

drin commented Jul 12, 2022

I just realized that this introduces (or maybe just exposes) a bug when calling this function on scalar inputs. If the input is a scalar, non_nulls is incremented without changing state. To address this "correctly," the code path for scalar inputs should also update the state by using GetOrInsert as is done for the code path for vector inputs.

I am working on figuring this out, given that the compiler has type conversion issues when attempting to just call GetOrInsert with a scalar reference.

@cyb70289 cyb70289 changed the title ARROW-16807: count distinct incorrectly merges state ARROW-16807: [C++][R] count distinct incorrectly merges state Jul 13, 2022
r/tests/testthat/test-dplyr-summarize.R Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
drin added 2 commits July 13, 2022 16:38
The path for count_distinct with scalar inputs didn't update state, and
instead only added a single count. To fix this, we use MaybeInsert and
UnboxScalar to insert the new value. The non_null count can then be set
the same way for both vector and scalar inputs
The UnboxScalar implementation that has `enable_if_has_string_view` is
also true for Decimal128 and Decimal256 types. Usually the
implementation templated for these types would be called, but The
BinaryMemoTable only has a type for BinaryTypes. This means that for
count_distinct with Decimal columns, the UnboxScalar implementation that
gets called is the one for string_view type
@drin
Copy link
Contributor Author

drin commented Jul 13, 2022

@wesm @westonpace previous pair of commits makes this work for scalar inputs. I tried putting some of the conditional logic ("if type is decimal") in CountDistinctImpl but it ended up being difficult to satisfy the compiler.

the current change to UnboxScalar tries to essentially use is_decimal, but then I realized that there isn't an easy way to call the appriopriate view function unless I know which decimal type I have. Additionally, the UnboxScalar implementations for Decimal types gets the value, which I think would be difficult to downcast/change back to string_view.

all in all, this was the only approach I could get to work and it seemed semi reasonable. Feedback would be much appreciated though.

@wesm
Copy link
Member

wesm commented Jul 14, 2022

@drin I pushed a simpler solution for the UnboxScalar change and removed MaybeInsert. We should probably open a follow up Jira to improve error handling in hash table merges but I don't think it's necessary to do that in this PR

@drin
Copy link
Contributor Author

drin commented Jul 14, 2022

I made a JIRA to track error handling improvements for hash table merge: ARROW-17074

I specifically made the priority minor, but feel free to change it (I don't know the protocol for priority assignment)

@drin drin marked this pull request as ready for review July 14, 2022 16:15
@drin drin requested a review from cyb70289 July 14, 2022 16:16
Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Only some minor comments.
Thanks @drin !

cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/hashing.h Outdated Show resolved Hide resolved
@drin
Copy link
Contributor Author

drin commented Jul 15, 2022

thanks for the reviews @cyb70289 and Wes!

@drin drin requested a review from cyb70289 July 15, 2022 16:53
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good from my perspective. Thanks for adding the tests. I'll let others confirm before merging.

@wesm
Copy link
Member

wesm commented Jul 16, 2022

@cyb70289 these CI failures don't seem related — R has some lint failures and there are these otel protobuf linkage issues. I'll go ahead to merge this

@wesm wesm merged commit af4db77 into apache:master Jul 16, 2022
@nealrichardson
Copy link
Contributor

R issues are likely a combination of transient upstream issues and stuff that has been fixed in master. If things persist I'll follow up.

@ursabot
Copy link

ursabot commented Jul 17, 2022

Benchmark runs are scheduled for baseline = afc6840 and contender = af4db77. af4db77 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.18% ⬆️0.03%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.36% ⬆️0.14%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] af4db773 ec2-t3-xlarge-us-east-2
[Failed] af4db773 test-mac-arm
[Failed] af4db773 ursa-i9-9960x
[Finished] af4db773 ursa-thinkcentre-m75q
[Finished] afc6840c ec2-t3-xlarge-us-east-2
[Failed] afc6840c test-mac-arm
[Failed] afc6840c ursa-i9-9960x
[Finished] afc6840c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@drin drin deleted the ARROW-16807-count-distinct branch July 18, 2022 19:36
kou pushed a commit that referenced this pull request Feb 20, 2023
…Hub issue numbers (#34260)

Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature.

Issue numbers have been rewritten based on the following correspondence.
Also, the pkgdown settings have been changed and updated to link to GitHub.

I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly.

---
ARROW-6338	#5198
ARROW-6364	#5201
ARROW-6323	#5169
ARROW-6278	#5141
ARROW-6360	#5329
ARROW-6533	#5450
ARROW-6348	#5223
ARROW-6337	#5399
ARROW-10850	#9128
ARROW-10624	#9092
ARROW-10386	#8549
ARROW-6994	#23308
ARROW-12774	#10320
ARROW-12670	#10287
ARROW-16828	#13484
ARROW-14989	#13482
ARROW-16977	#13514
ARROW-13404	#10999
ARROW-16887	#13601
ARROW-15906	#13206
ARROW-15280	#13171
ARROW-16144	#13183
ARROW-16511	#13105
ARROW-16085	#13088
ARROW-16715	#13555
ARROW-16268	#13550
ARROW-16700	#13518
ARROW-16807	#13583
ARROW-16871	#13517
ARROW-16415	#13190
ARROW-14821	#12154
ARROW-16439	#13174
ARROW-16394	#13118
ARROW-16516	#13163
ARROW-16395	#13627
ARROW-14848	#12589
ARROW-16407	#13196
ARROW-16653	#13506
ARROW-14575	#13160
ARROW-15271	#13170
ARROW-16703	#13650
ARROW-16444	#13397
ARROW-15016	#13541
ARROW-16776	#13563
ARROW-15622	#13090
ARROW-18131	#14484
ARROW-18305	#14581
ARROW-18285	#14615
* Closes: #33631

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…to GitHub issue numbers (apache#34260)

Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature.

Issue numbers have been rewritten based on the following correspondence.
Also, the pkgdown settings have been changed and updated to link to GitHub.

I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly.

---
ARROW-6338	apache#5198
ARROW-6364	apache#5201
ARROW-6323	apache#5169
ARROW-6278	apache#5141
ARROW-6360	apache#5329
ARROW-6533	apache#5450
ARROW-6348	apache#5223
ARROW-6337	apache#5399
ARROW-10850	apache#9128
ARROW-10624	apache#9092
ARROW-10386	apache#8549
ARROW-6994	apache#23308
ARROW-12774	apache#10320
ARROW-12670	apache#10287
ARROW-16828	apache#13484
ARROW-14989	apache#13482
ARROW-16977	apache#13514
ARROW-13404	apache#10999
ARROW-16887	apache#13601
ARROW-15906	apache#13206
ARROW-15280	apache#13171
ARROW-16144	apache#13183
ARROW-16511	apache#13105
ARROW-16085	apache#13088
ARROW-16715	apache#13555
ARROW-16268	apache#13550
ARROW-16700	apache#13518
ARROW-16807	apache#13583
ARROW-16871	apache#13517
ARROW-16415	apache#13190
ARROW-14821	apache#12154
ARROW-16439	apache#13174
ARROW-16394	apache#13118
ARROW-16516	apache#13163
ARROW-16395	apache#13627
ARROW-14848	apache#12589
ARROW-16407	apache#13196
ARROW-16653	apache#13506
ARROW-14575	apache#13160
ARROW-15271	apache#13170
ARROW-16703	apache#13650
ARROW-16444	apache#13397
ARROW-15016	apache#13541
ARROW-16776	apache#13563
ARROW-15622	apache#13090
ARROW-18131	apache#14484
ARROW-18305	apache#14581
ARROW-18285	apache#14615
* Closes: apache#33631

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants