PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field#6370
PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field#6370omega-bigstream wants to merge 4 commits intoapache:masterfrom
Conversation
|
This PR doesn't seem correct. You want to create a distinct |
|
Definitely will need some unit tests for this to verify correctness |
|
I just squashed this branch to fix the bad merges. Please use |
Author: Omega Gamage <omega@bigstream.co>
Date: Tue Feb 18 14:23:08 2020 +0530
used std::map instead of std::unordered_map to store num_data_pages
commit 6822fc6509e0b2e18e909b8a650605a6773a7df8
Author: Omega Gamage <omega@bigstream.co>
Date: Mon Feb 17 11:45:03 2020 +0530
remove default arguments for page number counts in ColumnChunkMetaDataBuilder::Finish
commit 49e1861e0395541c0b6e6376c5d270b3f57dfae4
Author: Omega Gamage <omega@bigstream.co>
Date: Fri Feb 14 16:22:41 2020 +0530
Added the class PageEncodingStats to types.h
commit 9dc5b6b40dfd8cb747469748078246a52fe97935
Merge: 9b776f3b6 d65a71a9b
Author: Omega Gamage <omega@bigstream.co>
Date: Wed Feb 12 12:25:50 2020 +0530
resolved merge conflicts
commit 9b776f3b6049a512f8825daa03cba98e23142290
Author: Omega Gamage <omega@bigstream.co>
Date: Thu Feb 6 16:12:31 2020 +0530
PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field
Fixed lint errors
Use std::map to store datapage count
Added unit test to test encoding_stats
commit d65a71a9b67826c39d900814ffea43620f49bc93
Author: Omega Gamage <omega@bigstream.co>
Date: Thu Feb 6 20:11:09 2020 +0530
Fixed lint errors
commit 053ce4d4c018e9f7e498c9fbc2bc1ad1ee7aa4ea
Author: Omega Gamage <omega@bigstream.co>
Date: Thu Feb 6 16:12:31 2020 +0530
PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field
wesm
left a comment
There was a problem hiding this comment.
+1. I took care of the last couple small items. Will merge once the builds pass
|
Merging. The Appveyor failure is https://issues.apache.org/jira/browse/ARROW-7992 |
|
@omega-gamage thanks for the work on this. Can you confirm that I assigned this issue to the right person in https://issues.apache.org/jira/browse/PARQUET-1780 (as opposed to someone else with nearly your name)? |
|
@wesm . Yes Issue was assigned to me. That is my correct profile. |
This is to solve the issue PARQUET-1780:
ColumnMetadata.encoding_stats field is empty in parquet-cpp implementation.
This leads to metadata mismatches between 2 parquet files generated by cpp and scala(parquet-mr).
encoding_stat is a vector of PageEncodingStats.
PageEncodingStats has three attributes:
From above first to can be extracted from available information. But for count I have to create a add some attributes to exisiting classes.
Modifications:
For the class SerializedPageWriter, added following two attributes.
int32_t num_dict_pages_;
std::pair<int32_t, int32_t> num_data_pages_; (first: number of un-encoded pages,
second:number of encoded pages )