You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aldrin Montana / @drin: @westonpace found a bug (47dd2ec). I will pick this up tomorrow and will try to add some extra unit tests and check for other bugs in the area.
When reading from parquet files with multiple row groups,
count_distinct
(wrapped byn_distinct
in R) returns inaccurate and inconsistent results:If the file is stored as a single row group, results are correct. When grouped, results are correct.
I can reproduce this in Python as well using the same file and
pyarrow.compute.count_distinct
:This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.
Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE
Memory:
Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb
Runtime:
SIMD Level none
Detected SIMD Level none
Build:
C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d7894
Reporter: Edward Visel / @alistaire47
Assignee: Aldrin Montana / @drin
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-16807. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: