Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Statistics Merge ignore setting has flag #34351

Closed
mapleFU opened this issue Feb 25, 2023 · 1 comment · Fixed by #35989
Closed

[C++][Parquet] Statistics Merge ignore setting has flag #34351

mapleFU opened this issue Feb 25, 2023 · 1 comment · Fixed by #35989

Comments

@mapleFU
Copy link
Member

mapleFU commented Feb 25, 2023

Describe the bug, including details regarding any error messages, version, and platform.

In src/parquet/statistics.cc:

   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
     if (other.HasNullCount()) {
-      this->statistics_.null_count += other.null_count();
+      this->IncrementNullCount(other.null_count());
     }
     if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+      this->IncrementDistinctCount(other.distinct_count());
     }
     if (other.HasMinMax()) {

The original code ignore setting has flag.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Feb 25, 2023

Probly the syntax here is trickey. Our Statistics will only have has_null_count_ to be true. I will have a survey on parquet-mr implementions.

pitrou added a commit that referenced this issue Jul 6, 2023
…y optimization (#35989)

### Rationale for this change

### What changes are included in this PR?

1. This patch does some tiny optimizations on Parquet C++ Statistics. It does:

```
For min-max, using std::string. Because assume the case like that:
EncodedStatistics c1;
// do some operations
EncodedStatistics c2 = c1;
c2.set_max("dasdasdassd");
After c2 set, c1 would be set too. So I use std::string here.
```

2. Force clear ndv count during merging, and set `has_distinct_count_ = false`, and add some comments
3. Add some specification in Statistics API

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

* Closes: #34351

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment