Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Merging Parquet stats can preserve distinct counts when either side is zero #37014

Closed
sfc-gh-ebrossard opened this issue Aug 3, 2023 · 0 comments · Fixed by #37016

Comments

@sfc-gh-ebrossard
Copy link
Contributor

Describe the enhancement requested

// Clear has_distinct_count_ as distinct count cannot be merged.
is more conservative than necessary. While in general we can't preserve distinct counts, since we don't know the set of overlapping values, we can preserve distinct counts if either this or other has a distinct count of 0.

I'm preparing a PR, assuming this sounds reasonable to fix.

Component(s)

C++, Parquet

sfc-gh-ebrossard added a commit to sfc-gh-ebrossard/arrow that referenced this issue Aug 3, 2023
@sfc-gh-ebrossard sfc-gh-ebrossard changed the title [C++] Merging Parquet stats can preserve distinct counts when either side is zero [C++][Parquet] Merging Parquet stats can preserve distinct counts when either side is zero Aug 3, 2023
@pitrou pitrou added this to the 14.0.0 milestone Aug 31, 2023
pitrou added a commit that referenced this issue Aug 31, 2023
…erging stats (#37016)

### Rationale for this change

Statistics merging is unnecessarily conservative with respect to preserving distinct counts. If either side is 0, the merge stats can preserve the distinct count, instead of clearing it.

### What changes are included in this PR?

Distinct count is preserved if either the lhs or rhs has a distinct count of 0 when merging stats.

### Are these changes tested?

Yes, added a unit test.

### Are there any user-facing changes?

I don't think this change to stats is considered to be user-facing.
* Closes: #37014

Lead-authored-by: Elliott Brossard <elliott.brossard@snowflake.com>
Co-authored-by: Elliott Brossard <64754120+sfc-gh-ebrossard@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…when merging stats (apache#37016)

### Rationale for this change

Statistics merging is unnecessarily conservative with respect to preserving distinct counts. If either side is 0, the merge stats can preserve the distinct count, instead of clearing it.

### What changes are included in this PR?

Distinct count is preserved if either the lhs or rhs has a distinct count of 0 when merging stats.

### Are these changes tested?

Yes, added a unit test.

### Are there any user-facing changes?

I don't think this change to stats is considered to be user-facing.
* Closes: apache#37014

Lead-authored-by: Elliott Brossard <elliott.brossard@snowflake.com>
Co-authored-by: Elliott Brossard <64754120+sfc-gh-ebrossard@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…when merging stats (apache#37016)

### Rationale for this change

Statistics merging is unnecessarily conservative with respect to preserving distinct counts. If either side is 0, the merge stats can preserve the distinct count, instead of clearing it.

### What changes are included in this PR?

Distinct count is preserved if either the lhs or rhs has a distinct count of 0 when merging stats.

### Are these changes tested?

Yes, added a unit test.

### Are there any user-facing changes?

I don't think this change to stats is considered to be user-facing.
* Closes: apache#37014

Lead-authored-by: Elliott Brossard <elliott.brossard@snowflake.com>
Co-authored-by: Elliott Brossard <64754120+sfc-gh-ebrossard@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment