Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Fix parsing stats from min_value/max_value #34138

Closed
wgtmac opened this issue Feb 11, 2023 · 0 comments · Fixed by #34112
Closed

[C++][Parquet] Fix parsing stats from min_value/max_value #34138

wgtmac opened this issue Feb 11, 2023 · 0 comments · Fixed by #34112

Comments

@wgtmac
Copy link
Member

wgtmac commented Feb 11, 2023

Describe the bug, including details regarding any error messages, version, and platform.

The code below does not check and read from stats.min_value/max_value. If reading from a parquet file where the stats apply min_value/max_value, we are unable to read any column statistics at all.

// Extracts encoded statistics from V1 and V2 data page headers
template <typename H>
EncodedStatistics ExtractStatsFromHeader(const H& header) {
  EncodedStatistics page_statistics;
  if (!header.__isset.statistics) {
    return page_statistics;
  }
  const format::Statistics& stats = header.statistics;
  if (stats.__isset.max) {
    page_statistics.set_max(stats.max);
  }
  if (stats.__isset.min) {
    page_statistics.set_min(stats.min);
  }
  if (stats.__isset.null_count) {
    page_statistics.set_null_count(stats.null_count);
  }
  if (stats.__isset.distinct_count) {
    page_statistics.set_distinct_count(stats.distinct_count);
  }
  return page_statistics;
}

Component(s)

C++, Parquet

wjones127 pushed a commit that referenced this issue Feb 17, 2023
…34112)

### Rationale for this change

The code below does not read from stats.min_value/max_value at all.
```cpp
// Extracts encoded statistics from V1 and V2 data page headers
template <typename H>
EncodedStatistics ExtractStatsFromHeader(const H& header) {
  EncodedStatistics page_statistics;
  if (!header.__isset.statistics) {
    return page_statistics;
  }
  const format::Statistics& stats = header.statistics;
  if (stats.__isset.max) {
    page_statistics.set_max(stats.max);
  }
  if (stats.__isset.min) {
    page_statistics.set_min(stats.min);
  }
  if (stats.__isset.null_count) {
    page_statistics.set_null_count(stats.null_count);
  }
  if (stats.__isset.distinct_count) {
    page_statistics.set_distinct_count(stats.distinct_count);
  }
  return page_statistics;
}
```

### What changes are included in this PR?

Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats.

### Are these changes tested?

Some test cases fail after the fix. Fixed them to make sure it is covered.

### Are there any user-facing changes?
No.

* Closes: #34138

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 12.0.0 milestone Feb 17, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…alue (apache#34112)

### Rationale for this change

The code below does not read from stats.min_value/max_value at all.
```cpp
// Extracts encoded statistics from V1 and V2 data page headers
template <typename H>
EncodedStatistics ExtractStatsFromHeader(const H& header) {
  EncodedStatistics page_statistics;
  if (!header.__isset.statistics) {
    return page_statistics;
  }
  const format::Statistics& stats = header.statistics;
  if (stats.__isset.max) {
    page_statistics.set_max(stats.max);
  }
  if (stats.__isset.min) {
    page_statistics.set_min(stats.min);
  }
  if (stats.__isset.null_count) {
    page_statistics.set_null_count(stats.null_count);
  }
  if (stats.__isset.distinct_count) {
    page_statistics.set_distinct_count(stats.distinct_count);
  }
  return page_statistics;
}
```

### What changes are included in this PR?

Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats.

### Are these changes tested?

Some test cases fail after the fix. Fixed them to make sure it is covered.

### Are there any user-facing changes?
No.

* Closes: apache#34138

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…alue (apache#34112)

### Rationale for this change

The code below does not read from stats.min_value/max_value at all.
```cpp
// Extracts encoded statistics from V1 and V2 data page headers
template <typename H>
EncodedStatistics ExtractStatsFromHeader(const H& header) {
  EncodedStatistics page_statistics;
  if (!header.__isset.statistics) {
    return page_statistics;
  }
  const format::Statistics& stats = header.statistics;
  if (stats.__isset.max) {
    page_statistics.set_max(stats.max);
  }
  if (stats.__isset.min) {
    page_statistics.set_min(stats.min);
  }
  if (stats.__isset.null_count) {
    page_statistics.set_null_count(stats.null_count);
  }
  if (stats.__isset.distinct_count) {
    page_statistics.set_distinct_count(stats.distinct_count);
  }
  return page_statistics;
}
```

### What changes are included in this PR?

Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats.

### Are these changes tested?

Some test cases fail after the fix. Fixed them to make sure it is covered.

### Are there any user-facing changes?
No.

* Closes: apache#34138

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants