New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Allow computing distinct_count accross multiple ColumnChunkMetadata #38877
Comments
@mapleFU I thought this would be more appropriate to continue our discussion we had here. You suggested that a good first implementation could be done in |
Some notes:
As I said in [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244 |
EncodedStatistics GetChunkStatistics() override {
EncodedStatistics result;
if (chunk_statistics_) result = chunk_statistics_->Encode();
+ if (current_dict_encoder_ != nullptr && !fallback_) {
+ chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());
+ }
return result;
} A likely implementation might be the code above ( Also maybe we can find whether |
Thank you for your notes @mapleFU. I need a clarification though:
As I understand it from the spec, a file consists of one or more RowGroups, which contain one or more ColumnChunks. I understand we cannot regard it as a whole file distinct count (as in the distinct count of all the columns combined), but is it a per-column distinct count, or a per-column-chunk distinct count? You seem to say it's a per-column-chunk, but I want to be sure I understand correctly.
Here we have 4 column chunks. It is
then, right? But the actual distinct count of col_1 is 3, so we cannot add them up. |
Yeah, actually if you want file-level stats, currently maybe you can only try to maintain the count and write the This interface might be a bit hard to use. Also, currently, parquet writer implementation doesn't maintaining a global dictionary for global distinct_count. |
Describe the enhancement requested
As far as I can tell, from reading the related pull requests #37016 and #35989, the current implementation can't compute distinct counts as it cannot know if a distinct value in one
ColumnChunkMetadData
has already appeared in another chunk.The chunks would need some kind of way to know which distinct values it already encountered, so that it would not count them multiple times.
I'm opening this issue to discuss if implementing this feature is at all possible, and if it is, to see how it could be done.
If successful, I'd be available to write the implementation.
Component(s)
C++
The text was updated successfully, but these errors were encountered: