GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

mapleFU · 2023-02-26T04:40:15Z

Rationale for this change

This patch does some tiny optimizations on Parquet C++ Statistics. It does:

For min-max, using std::string. Because assume the case like that:

EncodedStatistics c1;
// do some operations
EncodedStatistics c2 = c1;
c2.set_max("dasdasdassd");

After c2 set, c1 would be set too. So I use std::string here.

Force setting ndv count during merging, and add some comments

What changes are included in this PR?

As we talked above.

Are these changes tested?

Test by unittest.

Are there any user-facing changes?

No

Closes: [C++][Parquet] Statistics Merge ignore setting has flag #34351

github-actions · 2023-02-26T04:40:39Z

Closes: [C++][Parquet] Statistics Merge ignore setting has flag #34351

github-actions · 2023-02-26T04:40:41Z

⚠️ GitHub issue #34351 has been automatically assigned in GitHub to PR creator.

mapleFU · 2023-02-26T06:38:47Z

===============================================================================
Omission: parquet::RowGroupMetaData::Equals() isn't stable. [test: #==(TestParquetRowGroupMetadata)]
/Users/runner/work/arrow/arrow/c_glib/test/parquet/test-row-group-metadata.rb:48:in `block in <class:TestParquetRowGroupMetadata>'
===============================================================================
.........F
===============================================================================
Failure: test: #has_n_distinct_values?(TestParquetStatistics):
        @statistics.has_n_distinct_values?
        |           |
        |           false
        #<Parquet::Int32Statistics:0x7f86979871f8 ptr=0x7f8698c1c8d0>
/Users/runner/work/arrow/arrow/c_glib/test/parquet/test-statistics.rb:53:in `block in <class:TestParquetStatistics>'
     50:   end
     51: 
     52:   test("#has_n_distinct_values?") do
  => 53:     assert do
     54:       @statistics.has_n_distinct_values?
     55:     end
     56:   end
===============================================================================
..............O

emmmm. The problem can is also seen here: https://github.com/apache/arrow/pull/34054/files#r1118029417

mapleFU · 2023-02-26T09:03:17Z

After go through the piece of code, I found that current impl is ok, because we mostly only use statistics on writer(In fact, Statistics should better be renamed to StatisticsBuilder, I think) . But when ExtractStatisticsFromPageHeader or other reader part is in, things will get a bit more complex.

As for now, we can assume that:

Writer can assure that if has right null-count ( if it not has any bugs )
Currently I found that ndv is never collected. If a user collect ndv in page1, but not collect ndv in page 2, it should be abandon.

For reader:

When deserialize, reader should assume that ndv and null_count can be unset ( but currently, it doesn't work like this)
Deserialized statistics can not call merge or other mutation methods

Currently, a writer will not has bug on merging. But if a reader, checks has_null_count or ndv, it will get the wrong result

wgtmac · 2023-02-27T01:42:40Z

cpp/src/parquet/statistics.cc

@@ -495,9 +495,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
                      bool has_null_count, bool has_distinct_count, MemoryPool* pool)
      : TypedStatisticsImpl(descr, pool) {
    TypedStatisticsImpl::IncrementNumValues(num_values);
+    // Currently, `has_null_count` argument is not used.
+    // Internal has_null_count_ would always be true.


IMO, if has_null_count is not used we should remove it to reduce confusion and future misuse. Otherwise, we should respect it from the input argument.

I agree with you, but I need other's opinion

IIUC, this class is only used in the writer and we always write the null count in statistics? If so, then I agree we can remove this.

After a weird patch, this class is used in reader, but ndv is not used.

what is ndv?

Sorry for make misunderstanding. NDV means num-different values. Load statistics is introduced in ColumnChunkMetaData, and it's weird that:

Writer will never set ndv.

Reader should read ndv.

The behavior here is trickey

cpp/src/parquet/statistics.cc

wgtmac · 2023-02-27T01:47:12Z

cpp/src/parquet/statistics.cc

  bool has_null_count_ = false;
+  // Currently, has_distinct_count_ would not be encoded


It seems that statistics_ already has a set of hasXXX variables. Can we reuse those directly?

class PARQUET_EXPORT EncodedStatistics { std::shared_ptr<std::string> max_, min_; bool is_signed_ = false; public: EncodedStatistics() : max_(std::make_shared<std::string>()), min_(std::make_shared<std::string>()) {} int64_t null_count = 0; int64_t distinct_count = 0; bool has_min = false; bool has_max = false; bool has_null_count = false; bool has_distinct_count = false;

wgtmac · 2023-02-27T01:51:20Z

cpp/src/parquet/statistics_test.cc

@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
    ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
  }

+  void TestMergeEmpty() {


It would be better to add a case where either side does not have a valid stats (min/max/null_count/distinct_count) meaning that the merged stats is dropped.

Sure, I will add them

mapleFU · 2023-02-27T06:19:15Z

@kou I guess ruby test should not set has ndv, because current parquet writer will never put valid ndv. Let me set it to false later.

As @wgtmac says, maybe it's better to make it "false"

kou · 2023-02-27T07:17:46Z

OK!

mapleFU · 2023-03-09T08:42:44Z

@pitrou @emkornfield I need some idea about the code here. The problem is that:

When writing statistics, the code is ok.
Statistics::Make would create statistics using EncodedStats for read, then the code here would be a disaster

mapleFU · 2023-05-31T15:15:54Z

This patch is neccessary but not visited for a long time. I'll close and recreate this this weekend.

Statistic: tiny optimization

5f4262d

mapleFU requested a review from wjones127 as a code owner February 26, 2023 04:40

github-actions bot added Component: C++ Component: Parquet labels Feb 26, 2023

kou mentioned this pull request Feb 26, 2023

GH-34053: [C++][Parquet] Write parquet page index #34054

Merged

[Update] revert min-max for float, and adding copying test

ac1f294

[Update] Refine syntax for ndv

7fd3941

wgtmac reviewed Feb 27, 2023

View reviewed changes

mapleFU added 2 commits March 9, 2023 16:09

Merge branch 'main' into parquet/statistics-tiny-optimize

4e95cdd

update stats

f94c5b3

github-actions bot added the awaiting review Awaiting review label Mar 9, 2023

mapleFU marked this pull request as draft March 9, 2023 08:42

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 27, 2023

mapleFU closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

mapleFU commented Feb 26, 2023 •

edited

github-actions bot commented Feb 26, 2023

github-actions bot commented Feb 26, 2023

mapleFU commented Feb 26, 2023 •

edited

mapleFU commented Feb 26, 2023 •

edited

wgtmac Feb 27, 2023

mapleFU Feb 27, 2023

wjones127 Mar 27, 2023

mapleFU Mar 28, 2023

wjones127 Mar 28, 2023

mapleFU Mar 28, 2023

wgtmac Feb 27, 2023

wgtmac Feb 27, 2023

mapleFU Feb 27, 2023

mapleFU commented Feb 27, 2023

kou commented Feb 27, 2023

mapleFU commented Mar 9, 2023

mapleFU commented May 31, 2023

		bool has_null_count_ = false;
		// Currently, has_distinct_count_ would not be encoded

GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

Conversation

mapleFU commented Feb 26, 2023 • edited

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Feb 26, 2023

github-actions bot commented Feb 26, 2023

mapleFU commented Feb 26, 2023 • edited

mapleFU commented Feb 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Feb 27, 2023

kou commented Feb 27, 2023

mapleFU commented Mar 9, 2023

mapleFU commented May 31, 2023

mapleFU commented Feb 26, 2023 •

edited

mapleFU commented Feb 26, 2023 •

edited

mapleFU commented Feb 26, 2023 •

edited