-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization #35989
Conversation
bc598dd
to
fcda54d
Compare
Some notes:
|
Another problem is that: template <typename DType>
static std::shared_ptr<Statistics> MakeTypedColumnStats(
const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
// If ColumnOrder is defined, return max_value and min_value
if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
return MakeStatistics<DType>(
descr, metadata.statistics.min_value, metadata.statistics.max_value,
metadata.num_values - metadata.statistics.null_count,
metadata.statistics.null_count, metadata.statistics.distinct_count,
metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
metadata.statistics.__isset.null_count,
metadata.statistics.__isset.distinct_count);
}
// Default behavior
return MakeStatistics<DType>(
descr, metadata.statistics.min, metadata.statistics.max,
metadata.num_values - metadata.statistics.null_count,
metadata.statistics.null_count, metadata.statistics.distinct_count,
metadata.statistics.__isset.max || metadata.statistics.__isset.min,
metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
} For the Arrow write file, this is ok, however, when And when |
30100b0
to
8262868
Compare
After my reflect, the |
@mapleFU How was this data produced? |
(and, yes, |
From here: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47327471 |
#35989 (comment) @pitrou @wgtmac Should we trying to fix this? when |
57e0a02
to
e90f209
Compare
No idea why this is failed... |
96cd715
to
9099ae0
Compare
Co-authored-by: Gang Wu <ustcwg@gmail.com>
9099ae0
to
0f86ba0
Compare
b068767
to
b1e2222
Compare
Co-authored-by: Gang Wu <ustcwg@gmail.com>
b1e2222
to
e4d8003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Gently ping @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, just two minor comments
cpp/src/parquet/statistics.cc
Outdated
} | ||
// num_values_ is reliable and it means number of non-null values. | ||
s.all_null_value = num_values_ == 0; | ||
// FIXME(mwish): distinct count is not encoded for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open an issue for this? FIXMEs in the code will be forgotten...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created. But actually I found a lot of issues be forgotten #36505
Conbench analyzed the 6 benchmark runs on commit There were 4 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
What changes are included in this PR?
has_distinct_count_ = false
, and add some commentsAre these changes tested?
Yes
Are there any user-facing changes?
No
has
flag #34351