-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34351: [C++][Parquet] Statistic: tiny optimization #34355
Conversation
|
emmmm. The problem can is also seen here: https://github.com/apache/arrow/pull/34054/files#r1118029417 |
After go through the piece of code, I found that current impl is ok, because we mostly only use statistics on writer(In fact, As for now, we can assume that:
For reader:
Currently, a writer will not has bug on merging. But if a reader, checks |
@@ -495,9 +495,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> { | |||
bool has_null_count, bool has_distinct_count, MemoryPool* pool) | |||
: TypedStatisticsImpl(descr, pool) { | |||
TypedStatisticsImpl::IncrementNumValues(num_values); | |||
// Currently, `has_null_count` argument is not used. | |||
// Internal has_null_count_ would always be true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if has_null_count
is not used we should remove it to reduce confusion and future misuse. Otherwise, we should respect it from the input argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, but I need other's opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this class is only used in the writer and we always write the null count in statistics? If so, then I agree we can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a weird patch, this class is used in reader, but ndv is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is ndv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for make misunderstanding. NDV means num-different values. Load statistics is introduced in ColumnChunkMetaData
, and it's weird that:
- Writer will never set ndv.
- Reader should read ndv.
The behavior here is trickey
bool has_null_count_ = false; | ||
// Currently, has_distinct_count_ would not be encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that statistics_
already has a set of hasXXX
variables. Can we reuse those directly?
class PARQUET_EXPORT EncodedStatistics {
std::shared_ptr<std::string> max_, min_;
bool is_signed_ = false;
public:
EncodedStatistics()
: max_(std::make_shared<std::string>()), min_(std::make_shared<std::string>()) {}
int64_t null_count = 0;
int64_t distinct_count = 0;
bool has_min = false;
bool has_max = false;
bool has_null_count = false;
bool has_distinct_count = false;
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> { | |||
ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max())); | |||
} | |||
|
|||
void TestMergeEmpty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add a case where either side does not have a valid stats (min/max/null_count/distinct_count) meaning that the merged stats is dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will add them
OK! |
@pitrou @emkornfield I need some idea about the code here. The problem is that:
|
This patch is neccessary but not visited for a long time. I'll close and recreate this this weekend. |
Rationale for this change
This patch does some tiny optimizations on Parquet C++ Statistics. It does:
std::string
. Because assume the case like that:After c2 set, c1 would be set too. So I use std::string here.
What changes are included in this PR?
As we talked above.
Are these changes tested?
Test by unittest.
Are there any user-facing changes?
No
has
flag #34351