-
Notifications
You must be signed in to change notification settings - Fork 63
Closed
Description
Hi All,
Looks like Parquet files generated using 0.10.1 & 0.10.2 are causing Apache Impala to fail reading them with the following error:
"metadata is corrupt. Dictionary page (offset=2943) must come before any data pages (offset=2943)"
This error message looks to occur when col_chunk.meta_data.dictionary_page_offset >= col_start:
Status ParquetMetadataUtils::ValidateColumnOffsets(const string& filename,
+ int64_t file_length, const parquet::RowGroup& row_group) {
+ for (int i = 0; i < row_group.columns.size(); ++i) {
+ const parquet::ColumnChunk& col_chunk = row_group.columns[i];
+ RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+ col_chunk.meta_data.data_page_offset, "data page offset"));
+ int64_t col_start = col_chunk.meta_data.data_page_offset;
+ // The file format requires that if a dictionary page exists, it be before data pages.
+ if (col_chunk.meta_data.__isset.dictionary_page_offset) {
+ RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+ col_chunk.meta_data.dictionary_page_offset, "dictionary page offset"));
+ if (col_chunk.meta_data.dictionary_page_offset >= col_start) {
+ return Status(Substitute("Parquet file '$0': metadata is corrupt. Dictionary "
+ "page (offset=$1) must come before any data pages (offset=$2).",
+ filename, col_chunk.meta_data.dictionary_page_offset, col_start));
+ }
https://lists.apache.org/thread/0mmlmt02hgb2btlr9hg1n2fs01dylskl
Downgrading to 0.10.0 works perfectly.
Regards
GP
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels