[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

tcrasset · 2023-11-24T15:47:16Z

Describe the enhancement requested

As far as I can tell, from reading the related pull requests #37016 and #35989, the current implementation can't compute distinct counts as it cannot know if a distinct value in one ColumnChunkMetadData has already appeared in another chunk.

The chunks would need some kind of way to know which distinct values it already encountered, so that it would not count them multiple times.

I'm opening this issue to discuss if implementing this feature is at all possible, and if it is, to see how it could be done.

If successful, I'd be available to write the implementation.

Component(s)

C++

The text was updated successfully, but these errors were encountered:

tcrasset · 2023-11-24T15:58:37Z

@mapleFU I thought this would be more appropriate to continue our discussion we had here.

You suggested that a good first implementation could be done in DictEncoder.
I'm not well versed in the implementation details of the pyarrow library, but if I'm looking at this right, we are using parquet::TypedColumnWriterImpl::WriteArrow() to write the data into the Parquet File Format, so this is the method I should look at, is that correct?

mapleFU · 2023-11-24T16:21:10Z

Some notes:

Parquet using Statistics [1] to store the distinct_count, is an optional field in thrift. Statistics can occur in PageHeader and ColumnChunkMetadata. I think it's a bit hard to maintaining distinct_count in PageHeader, so I think it's only ok to store a "ColumnChunk"-level distinct count
For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We cannot regard it as a whole-file distinct-count.
We may need to survey that how other implementation handles distinct_count during writing

As I said in DictEncoder, if user choose dict encoding, it will has a Dictionary for non-null values. So, after writing a ColumnChunk, it's ok to get the distinct_count from the dictionary. For other encoders, currently we didn't maintain a dict, so it's just impossible to get a distinct_count here.

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244

mapleFU · 2023-11-24T16:24:46Z

  EncodedStatistics GetChunkStatistics() override {
    EncodedStatistics result;
    if (chunk_statistics_) result = chunk_statistics_->Encode();
+  if (current_dict_encoder_ != nullptr && !fallback_) {
+      chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());
+   }
    return result;
  }

A likely implementation might be the code above ( SetDistinctCount doesn't exists in current code )

Also maybe we can find whether parquet-mr and arrow-rs/parquet writing the distinct_count and follow the style of them

tcrasset · 2023-11-24T17:05:02Z

Thank you for your notes @mapleFU.

I need a clarification though:

For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We cannot regard it as a whole-file distinct-count.

As I understand it from the spec, a file consists of one or more RowGroups, which contain one or more ColumnChunks.

I understand we cannot regard it as a whole file distinct count (as in the distinct count of all the columns combined), but is it a per-column distinct count, or a per-column-chunk distinct count? You seem to say it's a per-column-chunk, but I want to be sure I understand correctly.

+-------+--------+
| col_1 | col2_2 |
+-------+--------+
| a     | b      |
| x     | b      |
================== Row group boundary
| b     | d      |
| x     | d      |
+-------+--------+

Here we have 4 column chunks.

It is


<Column col_1 Chunk 1 + Column Metadata> --> distinct_count = 2 ("a", "x")
<Column col_2 Chunk 1 + Column Metadata> --> distinct_count = 1 ("c")
<Column col_1 Chunk 2 + Column Metadata> --> distinct_count = 2 ("b", "x")
<Column col_2 Chunk 2 + Column Metadata> --> distinct_count = 1 ("d")

then, right? But the actual distinct count of col_1 is 3, so we cannot add them up.

mapleFU · 2023-11-24T17:07:34Z

Yeah, actually if you want file-level stats, currently maybe you can only try to maintain the count and write the key-value metadata as a self defined stats: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1125

This interface might be a bit hard to use. Also, currently, parquet writer implementation doesn't maintaining a global dictionary for global distinct_count.

tcrasset added the Type: enhancement label Nov 24, 2023

github-actions bot added the Component: C++ label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

tcrasset commented Nov 24, 2023

tcrasset commented Nov 24, 2023

mapleFU commented Nov 24, 2023

mapleFU commented Nov 24, 2023

tcrasset commented Nov 24, 2023

mapleFU commented Nov 24, 2023 •

edited

[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

Comments

tcrasset commented Nov 24, 2023

Describe the enhancement requested

Component(s)

tcrasset commented Nov 24, 2023

mapleFU commented Nov 24, 2023

mapleFU commented Nov 24, 2023

tcrasset commented Nov 24, 2023

mapleFU commented Nov 24, 2023 • edited

mapleFU commented Nov 24, 2023 •

edited