Skip to content

PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field#6370

Closed
omega-bigstream wants to merge 4 commits intoapache:masterfrom
omega-bigstream:PARQUET-1780
Closed

PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field#6370
omega-bigstream wants to merge 4 commits intoapache:masterfrom
omega-bigstream:PARQUET-1780

Conversation

@omega-bigstream
Copy link
Contributor

This is to solve the issue PARQUET-1780:
ColumnMetadata.encoding_stats field is empty in parquet-cpp implementation.
This leads to metadata mismatches between 2 parquet files generated by cpp and scala(parquet-mr).

encoding_stat is a vector of PageEncodingStats.
PageEncodingStats has three attributes:

  • page_type: (data or dict)
  • encoding: encoding of the page
  • count:number of pages of this type with this encoding

From above first to can be extracted from available information. But for count I have to create a add some attributes to exisiting classes.
Modifications:
For the class SerializedPageWriter, added following two attributes.
int32_t num_dict_pages_;
std::pair<int32_t, int32_t> num_data_pages_; (first: number of un-encoded pages,
second:number of encoded pages )

@github-actions
Copy link

github-actions bot commented Feb 6, 2020

@pitrou
Copy link
Member

pitrou commented Feb 6, 2020

This PR doesn't seem correct. You want to create a distinct PageEncodingStats for each (PageType, Encoding) pair. How you achieve that with only three counters I don't understand.

@wesm
Copy link
Member

wesm commented Feb 6, 2020

Definitely will need some unit tests for this to verify correctness

@omega-bigstream
Copy link
Contributor Author

@pitrou , Yes you are correct. My initial idea was that counting number of encoded and un-encoded pages will be enough. That is why I used std::pair, using std::map will address this issue. I will fix it.

@wesm , I will add some unit test to verify the correctness.

Thanks

@wesm
Copy link
Member

wesm commented Feb 21, 2020

I just squashed this branch to fix the bad merges. Please use git fetch omega-gamage && git reset --hard omega-gamage/PARQUET-1780 and not git merge or it will be in a bad state again

omega-bigstream and others added 4 commits March 2, 2020 16:28
Author: Omega Gamage <omega@bigstream.co>
Date:   Tue Feb 18 14:23:08 2020 +0530

    used std::map instead of std::unordered_map to store num_data_pages

commit 6822fc6509e0b2e18e909b8a650605a6773a7df8
Author: Omega Gamage <omega@bigstream.co>
Date:   Mon Feb 17 11:45:03 2020 +0530

    remove default arguments for page number counts in ColumnChunkMetaDataBuilder::Finish

commit 49e1861e0395541c0b6e6376c5d270b3f57dfae4
Author: Omega Gamage <omega@bigstream.co>
Date:   Fri Feb 14 16:22:41 2020 +0530

    Added the class PageEncodingStats to types.h

commit 9dc5b6b40dfd8cb747469748078246a52fe97935
Merge: 9b776f3b6 d65a71a9b
Author: Omega Gamage <omega@bigstream.co>
Date:   Wed Feb 12 12:25:50 2020 +0530

    resolved merge conflicts

commit 9b776f3b6049a512f8825daa03cba98e23142290
Author: Omega Gamage <omega@bigstream.co>
Date:   Thu Feb 6 16:12:31 2020 +0530

    PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field

    Fixed lint errors

    Use std::map to store datapage count

    Added unit test to test encoding_stats

commit d65a71a9b67826c39d900814ffea43620f49bc93
Author: Omega Gamage <omega@bigstream.co>
Date:   Thu Feb 6 20:11:09 2020 +0530

    Fixed lint errors

commit 053ce4d4c018e9f7e498c9fbc2bc1ad1ee7aa4ea
Author: Omega Gamage <omega@bigstream.co>
Date:   Thu Feb 6 16:12:31 2020 +0530

    PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I took care of the last couple small items. Will merge once the builds pass

@wesm
Copy link
Member

wesm commented Mar 3, 2020

Merging. The Appveyor failure is https://issues.apache.org/jira/browse/ARROW-7992

@wesm wesm closed this in b4acb0b Mar 3, 2020
@wesm
Copy link
Member

wesm commented Mar 3, 2020

@omega-gamage thanks for the work on this. Can you confirm that I assigned this issue to the right person in https://issues.apache.org/jira/browse/PARQUET-1780 (as opposed to someone else with nearly your name)?

@omega-bigstream
Copy link
Contributor Author

@wesm . Yes Issue was assigned to me. That is my correct profile.
Thank you very much for the support and valid comments. Looking forward to do ,ore open source contributions. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments