Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34351: [C++][Parquet] Statistic: tiny optimization #34355

Closed
wants to merge 5 commits into from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 26, 2023

Rationale for this change

This patch does some tiny optimizations on Parquet C++ Statistics. It does:

  1. For min-max, using std::string. Because assume the case like that:
EncodedStatistics c1;
// do some operations
EncodedStatistics c2 = c1;
c2.set_max("dasdasdassd");

After c2 set, c1 would be set too. So I use std::string here.

  1. Force setting ndv count during merging, and add some comments

What changes are included in this PR?

As we talked above.

Are these changes tested?

Test by unittest.

Are there any user-facing changes?

No

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34351 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 26, 2023

===============================================================================
Omission: parquet::RowGroupMetaData::Equals() isn't stable. [test: #==(TestParquetRowGroupMetadata)]
/Users/runner/work/arrow/arrow/c_glib/test/parquet/test-row-group-metadata.rb:48:in `block in <class:TestParquetRowGroupMetadata>'
===============================================================================
.........F
===============================================================================
Failure: test: #has_n_distinct_values?(TestParquetStatistics):
        @statistics.has_n_distinct_values?
        |           |
        |           false
        #<Parquet::Int32Statistics:0x7f86979871f8 ptr=0x7f8698c1c8d0>
/Users/runner/work/arrow/arrow/c_glib/test/parquet/test-statistics.rb:53:in `block in <class:TestParquetStatistics>'
     50:   end
     51: 
     52:   test("#has_n_distinct_values?") do
  => 53:     assert do
     54:       @statistics.has_n_distinct_values?
     55:     end
     56:   end
===============================================================================
..............O

emmmm. The problem can is also seen here: https://github.com/apache/arrow/pull/34054/files#r1118029417

@mapleFU
Copy link
Member Author

mapleFU commented Feb 26, 2023

After go through the piece of code, I found that current impl is ok, because we mostly only use statistics on writer(In fact, Statistics should better be renamed to StatisticsBuilder, I think) . But when ExtractStatisticsFromPageHeader or other reader part is in, things will get a bit more complex.

As for now, we can assume that:

  1. Writer can assure that if has right null-count ( if it not has any bugs )
  2. Currently I found that ndv is never collected. If a user collect ndv in page1, but not collect ndv in page 2, it should be abandon.

For reader:

  1. When deserialize, reader should assume that ndv and null_count can be unset ( but currently, it doesn't work like this)
  2. Deserialized statistics can not call merge or other mutation methods

Currently, a writer will not has bug on merging. But if a reader, checks has_null_count or ndv, it will get the wrong result

@@ -495,9 +495,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
bool has_null_count, bool has_distinct_count, MemoryPool* pool)
: TypedStatisticsImpl(descr, pool) {
TypedStatisticsImpl::IncrementNumValues(num_values);
// Currently, `has_null_count` argument is not used.
// Internal has_null_count_ would always be true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if has_null_count is not used we should remove it to reduce confusion and future misuse. Otherwise, we should respect it from the input argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but I need other's opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this class is only used in the writer and we always write the null count in statistics? If so, then I agree we can remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a weird patch, this class is used in reader, but ndv is not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is ndv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for make misunderstanding. NDV means num-different values. Load statistics is introduced in ColumnChunkMetaData, and it's weird that:

  1. Writer will never set ndv.
  2. Reader should read ndv.

The behavior here is trickey

cpp/src/parquet/statistics.cc Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
bool has_null_count_ = false;
// Currently, has_distinct_count_ would not be encoded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that statistics_ already has a set of hasXXX variables. Can we reuse those directly?

class PARQUET_EXPORT EncodedStatistics {
  std::shared_ptr<std::string> max_, min_;
  bool is_signed_ = false;

 public:
  EncodedStatistics()
      : max_(std::make_shared<std::string>()), min_(std::make_shared<std::string>()) {}


  int64_t null_count = 0;
  int64_t distinct_count = 0;

  bool has_min = false;
  bool has_max = false;
  bool has_null_count = false;
  bool has_distinct_count = false;

@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
}

void TestMergeEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a case where either side does not have a valid stats (min/max/null_count/distinct_count) meaning that the merged stats is dropped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add them

@mapleFU
Copy link
Member Author

mapleFU commented Feb 27, 2023

@kou I guess ruby test should not set has ndv, because current parquet writer will never put valid ndv. Let me set it to false later.

As @wgtmac says, maybe it's better to make it "false"

@kou
Copy link
Member

kou commented Feb 27, 2023

OK!

@github-actions github-actions bot added the awaiting review Awaiting review label Mar 9, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Mar 9, 2023

@pitrou @emkornfield I need some idea about the code here. The problem is that:

  1. When writing statistics, the code is ok.
  2. Statistics::Make would create statistics using EncodedStats for read, then the code here would be a disaster

@mapleFU mapleFU marked this pull request as draft March 9, 2023 08:42
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 27, 2023
@mapleFU
Copy link
Member Author

mapleFU commented May 31, 2023

This patch is neccessary but not visited for a long time. I'll close and recreate this this weekend.

@mapleFU mapleFU closed this May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Statistics Merge ignore setting has flag
4 participants