Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization #35989

Merged
merged 30 commits into from
Jul 6, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jun 8, 2023

Rationale for this change

What changes are included in this PR?

  1. This patch does some tiny optimizations on Parquet C++ Statistics. It does:
For min-max, using std::string. Because assume the case like that:
EncodedStatistics c1;
// do some operations
EncodedStatistics c2 = c1;
c2.set_max("dasdasdassd");
After c2 set, c1 would be set too. So I use std::string here.
  1. Force clear ndv count during merging, and set has_distinct_count_ = false, and add some comments
  2. Add some specification in Statistics API

Are these changes tested?

Yes

Are there any user-facing changes?

No

@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from bc598dd to fcda54d Compare June 8, 2023 05:10
@mapleFU
Copy link
Member Author

mapleFU commented Jun 8, 2023

Some notes:

  1. Builder of Statistics will always have has_null_count_, and ignore has_distinct_count_ (not build when building EncodedStatistics). has_min_max_ will be false if page is all nulls or nulls and NaN
  2. Reader will able to set distinct_count and has_distinct_count. Which would be valid, but when it call Encoded, the status will not be included in EncodedStatistics

@mapleFU
Copy link
Member Author

mapleFU commented Jun 8, 2023

Another problem is that:

template <typename DType>
static std::shared_ptr<Statistics> MakeTypedColumnStats(
    const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
  // If ColumnOrder is defined, return max_value and min_value
  if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
    return MakeStatistics<DType>(
        descr, metadata.statistics.min_value, metadata.statistics.max_value,
        metadata.num_values - metadata.statistics.null_count,
        metadata.statistics.null_count, metadata.statistics.distinct_count,
        metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
        metadata.statistics.__isset.null_count,
        metadata.statistics.__isset.distinct_count);
  }
  // Default behavior
  return MakeStatistics<DType>(
      descr, metadata.statistics.min, metadata.statistics.max,
      metadata.num_values - metadata.statistics.null_count,
      metadata.statistics.null_count, metadata.statistics.distinct_count,
      metadata.statistics.__isset.max || metadata.statistics.__isset.min,
      metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
}

For the Arrow write file, this is ok, however, when !metadata.statistics.__isset.null_count, the num_values would be wrong, because it's metadata.num_values - metadata.statistics.null_count. Currently I don't know how to fix it.

And when max_rep_level > 1 and has null, I guess num_value would be so confusing...

@mapleFU
Copy link
Member Author

mapleFU commented Jun 8, 2023

cc @wgtmac @pitrou would you mind take a look? Since Statistics is released, I'm afraid my change will break some invariant...

cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.h Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 8, 2023
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from 30100b0 to 8262868 Compare June 17, 2023 06:36
@mapleFU
Copy link
Member Author

mapleFU commented Jun 18, 2023

data = [1, 2, 2, None, 4], type = DataType(uint16), physical_type = 'INT32'
min_value = 1, max_value = 4, null_count = 1, num_values = 4, distinct_count = 0

After my reflect, the distinct_count need to be fixed. Can you confirm that @pitrou ?

@pitrou
Copy link
Member

pitrou commented Jun 19, 2023

@mapleFU How was this data produced?

@pitrou
Copy link
Member

pitrou commented Jun 19, 2023

(and, yes, distinct_count = 0 is obviously wrong here :-))

@mapleFU
Copy link
Member Author

mapleFU commented Jun 19, 2023

From here: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47327471
Let me draft the fix the testing here...

@mapleFU
Copy link
Member Author

mapleFU commented Jun 20, 2023

#35989 (comment) @pitrou @wgtmac Should we trying to fix this?

when !metadata.statistics.__isset.null_count, the num_values would be wrong

@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from 57e0a02 to e90f209 Compare June 29, 2023 15:56
@mapleFU mapleFU requested a review from pitrou July 4, 2023 15:29
@mapleFU
Copy link
Member Author

mapleFU commented Jul 4, 2023

[ RUN      ] TwoLevelCacheWithExpirationTest.CleanupPeriodOk
/arrow/cpp/src/parquet/encryption/two_level_cache_with_expiration_test.cc:118: Failure
Expected equality of these values:
  lifetime2->size()
    Which is: 0
  2
[  FAILED  ] TwoLevelCacheWithExpirationTest.CleanupPeriodOk (525 ms)

No idea why this is failed...

cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from 96cd715 to 9099ae0 Compare July 4, 2023 16:04
Co-authored-by: Gang Wu <ustcwg@gmail.com>
@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from 9099ae0 to 0f86ba0 Compare July 4, 2023 16:07
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from b068767 to b1e2222 Compare July 5, 2023 14:26
Co-authored-by: Gang Wu <ustcwg@gmail.com>
@mapleFU mapleFU force-pushed the parquet/tiny-optimize-statistics branch from b1e2222 to e4d8003 Compare July 5, 2023 14:28
@wgtmac wgtmac changed the title GH-34351: [C++][Parquet] Statistic: tiny optimization and specification GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization Jul 5, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@mapleFU
Copy link
Member Author

mapleFU commented Jul 6, 2023

Gently ping @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, just two minor comments

}
// num_values_ is reliable and it means number of non-null values.
s.all_null_value = num_values_ == 0;
// FIXME(mwish): distinct count is not encoded for now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue for this? FIXMEs in the code will be forgotten...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created. But actually I found a lot of issues be forgotten #36505

cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
@pitrou pitrou merged commit 60fdc25 into apache:main Jul 6, 2023
30 of 31 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jul 6, 2023
@mapleFU mapleFU deleted the parquet/tiny-optimize-statistics branch July 6, 2023 17:22
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 60fdc255.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Statistics Merge ignore setting has flag
3 participants