Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877

Open
tcrasset opened this issue Nov 24, 2023 · 5 comments

Comments

@tcrasset
Copy link

Describe the enhancement requested

As far as I can tell, from reading the related pull requests #37016 and #35989, the current implementation can't compute distinct counts as it cannot know if a distinct value in one ColumnChunkMetadData has already appeared in another chunk.

The chunks would need some kind of way to know which distinct values it already encountered, so that it would not count them multiple times.

I'm opening this issue to discuss if implementing this feature is at all possible, and if it is, to see how it could be done.

If successful, I'd be available to write the implementation.

Component(s)

C++

@tcrasset
Copy link
Author

@mapleFU I thought this would be more appropriate to continue our discussion we had here.

You suggested that a good first implementation could be done in DictEncoder.
I'm not well versed in the implementation details of the pyarrow library, but if I'm looking at this right, we are using parquet::TypedColumnWriterImpl::WriteArrow() to write the data into the Parquet File Format, so this is the method I should look at, is that correct?

@mapleFU
Copy link
Member

mapleFU commented Nov 24, 2023

Some notes:

  1. Parquet using Statistics [1] to store the distinct_count, is an optional field in thrift. Statistics can occur in PageHeader and ColumnChunkMetadata. I think it's a bit hard to maintaining distinct_count in PageHeader, so I think it's only ok to store a "ColumnChunk"-level distinct count
  2. For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We cannot regard it as a whole-file distinct-count.
  3. We may need to survey that how other implementation handles distinct_count during writing

As I said in DictEncoder, if user choose dict encoding, it will has a Dictionary for non-null values. So, after writing a ColumnChunk, it's ok to get the distinct_count from the dictionary. For other encoders, currently we didn't maintain a dict, so it's just impossible to get a distinct_count here.

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244

@mapleFU
Copy link
Member

mapleFU commented Nov 24, 2023

  EncodedStatistics GetChunkStatistics() override {
    EncodedStatistics result;
    if (chunk_statistics_) result = chunk_statistics_->Encode();
+  if (current_dict_encoder_ != nullptr && !fallback_) {
+      chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());
+   }
    return result;
  }

A likely implementation might be the code above ( SetDistinctCount doesn't exists in current code )

Also maybe we can find whether parquet-mr and arrow-rs/parquet writing the distinct_count and follow the style of them

@tcrasset
Copy link
Author

Thank you for your notes @mapleFU.

I need a clarification though:

For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We cannot regard it as a whole-file distinct-count.

As I understand it from the spec, a file consists of one or more RowGroups, which contain one or more ColumnChunks.

I understand we cannot regard it as a whole file distinct count (as in the distinct count of all the columns combined), but is it a per-column distinct count, or a per-column-chunk distinct count? You seem to say it's a per-column-chunk, but I want to be sure I understand correctly.

+-------+--------+
| col_1 | col2_2 |
+-------+--------+
| a     | b      |
| x     | b      |
================== Row group boundary
| b     | d      |
| x     | d      |
+-------+--------+

Here we have 4 column chunks.

It is


<Column col_1 Chunk 1 + Column Metadata> --> distinct_count = 2 ("a", "x")
<Column col_2 Chunk 1 + Column Metadata> --> distinct_count = 1 ("c")
<Column col_1 Chunk 2 + Column Metadata> --> distinct_count = 2 ("b", "x")
<Column col_2 Chunk 2 + Column Metadata> --> distinct_count = 1 ("d")

then, right? But the actual distinct count of col_1 is 3, so we cannot add them up.

@mapleFU
Copy link
Member

mapleFU commented Nov 24, 2023

Yeah, actually if you want file-level stats, currently maybe you can only try to maintain the count and write the key-value metadata as a self defined stats: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1125

This interface might be a bit hard to use. Also, currently, parquet writer implementation doesn't maintaining a global dictionary for global distinct_count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants