Summary of Parquet reader Issues #9560

8dukongjian · 2024-04-22T04:37:13Z

Bug description

Using the test method provided by @qqibrow #7478, four compression formats(GZIP, SNAPPY, LZO and UNCOMPRESSED) and two parquet versions(V1 and V2) were tested, with a total of eight test scenarios. Summarize the problems discovered by the test into the table below.

Bugs

Seq	Error	Compression formats	Parquet Version	issue	PR	File to Reproduce	State	Owner
1	~~Empty collection in array or map lead to incorrect results in parquet reader~~	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2	#7776	#9187		Merged	jaystarshot
2	~~Parsing complex type errors, eg, `ARRAY<STRUCT<test:string>>` is parsed into `ARRAY<string>`~~	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2	#9242	#9533	array_struct.zip	Merged	chliang71 qqibrow
3	~~children size should not be larger than 2~~	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2		#9533	children.zip	Merged	chliang71 qqibrow
4	presetNullsConsumed_ == presetNullsSize_	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2	#9238	#9728	presetNulls.zip	Under repair	hitarth
5	~~ColumnMetaData does not exist for schema Id~~	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2	#9239	#9223	ColumnMetaData.zip	Co	makagonov
6	~~decompression failed, decompressedSize is not equal to remainingOutputSize~~	LZO	V1 and V2	#9618	#10123	lzo.zip	Merged	majetideepak
7	~~For raw decompression, compressedLength should be greater than zero~~	SNAPPY	V2	None	#10121	snappy.zip	Merged	majetideepak
8	Null pointer (missing decoder for TINYINT)	GZIP/SNAPPY/LZO/UNCOMPRESSED	V2	None	None	array.zip	Not repaired	hitarth
10	~~core dump in StringColumnReader::processFilter (missing decoder for VARBINARY on FLBA)~~	GZIP/SNAPPY/LZO/UNCOMPRESSED	V1	#9757	#9887	str_gzip.zip	Merged	majetideepak
11	~~Parquet PageReader incorrectly skips rep/def levels when the max values are 0~~	all	V2	#9924	#9939		Merged	yingsu00
12	Parquet reader: can't read parquet file with no column indexes (Map backward compatibility)	all	V1,V2	#9463			In Progress	majetideepak
13	Suspected bug related to read with Mutation (possibly Iceberg)	all	V1,V2	#8973			In triage	yingsu00
14	~~Velox parquet scan fail when select row index column before data column~~	all	V1,V2	#9867	Fix makeScanSpec wrong index issue if skip rowindex column #9866		Merged	gaoyangxiaozhu
15	Tokenizer cannot parse column name with space in it(special char in column names)	all	V1,V2	#10348			Under repair	yingsu00
16	Support "parquet_use_column_names" = false in Velox (special char in column names)	all	V1,V2	#10388 #10085			Under repair	agrawalreetika yingsu00
17	Ubuntu build failing when VELOX_ENABLE_PARQUET flag is enabled(Build issue)	all	V1,V2	#10323			Under triage
18	Parquet reader: can't read null map row in a single line file	all	V1,V2	#10510	Remove fast path for all null in filter		Under repair	yma11

File to Reproduce:

System information

None

Relevant logs

No response

The more complete feature request is in #9767
Feature requests (back up):

Seq	Error	Compression formats	Parquet Version	issue	PR	File to Reproduce	State	Owner
1	Reading INT 96 is not supported	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2		dictionary encoding: #4680 plain encoding: oap-project#456	int96.zip	Development	?
2	Reading INT64(TIMESTAMP_MICROS) is not supported	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2		#8325	timestamp_micros.zip	Development	mskapilks
3	Reading INT64(TIMESTAMP_MILLIS) is not supported	GZIP, SNAPPY, LZO and UNCOMPRESSED	V1 and V2		#8325	timestamp_mills.zip	Development	mskapilks
4	Encoding not supported yet: DELTA_BYTE_ARRAY	GZIP/SNAPPY/LZO/UNCOMPRESSED	V2	None	None	delta_byte_array.zip	Not started	yingsu00

The text was updated successfully, but these errors were encountered:

xumingming · 2024-04-22T05:12:25Z

@8dukongjian Very informative, thanks! I'd suggest submit issues for each problem you found with detailed description and reproducible case, and use this issue as an umbrella issue to track all of them.

8dukongjian · 2024-04-22T05:34:39Z

@8dukongjian Very informative, thanks! I'd suggest submit issues for each problem you found with detailed description and reproducible case, and use this issue as an umbrella issue to track all of them.

Thanks, good suggestions,I will work it later.

8dukongjian · 2024-04-22T05:40:23Z

cc @mbasmanova @qqibrow @yingsu00 @majetideepak

mbasmanova · 2024-04-22T15:27:40Z

@8dukongjian Thank you for the summary. Do you plan to work on fixing these?

CC: @FelixYBW @majetideepak

FelixYBW · 2024-04-22T17:40:03Z

Thank you! It's what we need. @yma11 can add more tests result from parquet-mr later.

yma11 · 2024-04-23T04:00:14Z

Thank you! It's what we need. @yma11 can add more tests result from parquet-mr later.

Currently issue 9463 is opened for the failure found by parquet-mr. I can add an additional table here when new more failures found.

qqibrow · 2024-04-23T18:28:59Z

Thanks! looks there are more issues related to encoding support which are not covered before.

aditi-pandit · 2024-04-23T21:19:10Z

@yzhang1991

yingsu00 · 2024-04-24T06:09:40Z

@8dukongjian Thanks Sitao. I'll triage them tomorrow myself, then we will go over the list in this week's Parquet sync meeting on Friday 1pm. Please feel free to join.

yingsu00 · 2024-04-24T06:12:05Z

Thanks! looks there are more issues related to encoding support which are not covered before.

Hi @qqibrow are you talking about the V2 encodings or existing ones?

liujiayi771 · 2024-04-24T11:42:38Z

@yingsu00 Are there any plans to support Parquet v2 encoding?

yingsu00 · 2024-04-24T20:11:10Z

@yingsu00 Are there any plans to support Parquet v2 encoding?

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

ethanyzhang · 2024-04-24T21:51:47Z

@yingsu00 So I have a silly question: Reetika tried to write hive tables in Prestissimo and the parquet file has format version 2.6 when I inspected it with parquet-tools. So that’s not Parquet v2? Where should I look at if I want to tell the version of a Parquet file?

ethanyzhang · 2024-04-24T21:53:05Z

Or maybe that file didn’t use any Parquet v2 specific features so Prestissimo can still read it…

yingsu00 · 2024-04-24T23:28:34Z

@yzhang1991 I think the 2.x version refers to the DataPage version, and does not necessarily mean the encodings of the data are all V2. The writer could just be encoding the data in V1 encodings.

We usually say the following encodings are V2 encodings, see https://parquet.apache.org/docs/file-format/data-pages/encodings/

5 DELTA_BINARY_PACKED,
6 DELTA_LENGTH_BYTE_ARRAY
7 DELTA_BYTE_ARRAY
9 BYTE_STREAM_SPLIT

Currently the Velox Parquet reader can read both V1 and V2 DataPage headers, but only support 9 in from this PR. Support BYTE_STREAM_SPLIT encoding in native Parquet reader 5,6,7 need to be added.

liujiayi771 · 2024-04-25T01:33:34Z

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

@yingsu00 Yes, we need to use DELTA_BYTE_ARRAY encoding.

yingsu00 · 2024-04-25T05:20:18Z

@8dukongjian I took the liberty to edit the table and added a "seq" column to help us quickly identify the issues

Also created Parquet reading failed to decompress LZO files and @nmahadevuni will take a look.

yingsu00 · 2024-04-25T05:22:21Z

Yes I do have the plan to support v2 encodings, but it will be second half. Do you need it urgently?

@yingsu00 Yes, we need to use DELTA_BYTE_ARRAY encoding.

Ok, we'll prioritize this encoding then.

8dukongjian · 2024-04-25T06:18:37Z

@8dukongjian I took the liberty to edit the table and added a "seq" column to help us quickly identify the issues

Also created Parquet reading failed to decompress LZO files and @nmahadevuni will take a look.

Thanks, now the table is clearer.

qqibrow · 2024-04-30T23:27:44Z

@8dukongjian does your team have bandwidth to take No.9 in bugs? currently no one is working on that.

yingsu00 · 2024-05-02T01:11:54Z

@qqibrow @8dukongjian I will take a look at bug no. 9

qqibrow · 2024-05-02T02:34:49Z

@yingsu00 thanks. I haven't got the time for a detail check. I am wondering whether

velox/velox/dwio/parquet/reader/PageReader.cpp

Line 608 in 8567d4d

dictionaryIdDecoder_ = std::make_unique<RleBpDataDecoder>(

or some of it can be reused.

yingsu00 · 2024-05-06T22:43:03Z

@qqibrow We need a BooleanRleBpDataDecoder. There are a number of other types/decoders need to be added. I'll create a separate issue for it.

hitarth · 2024-05-06T22:54:59Z

Took a quick looks at bug 8 mentioned above, named Null pointer , It seems to be due to unsupported DELTA_BINARY_PACKED encoding as mentioned in the above comment here.

yma11 · 2024-05-07T02:43:32Z

@qqibrow @yingsu00 one core dump issue found in parquet read fuzzer test. I added it in the bug list with file provided, please also help take a look. Thanks! cc @FelixYBW .

qqibrow · 2024-05-08T16:44:58Z

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

yma11 · 2024-05-09T01:35:19Z

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

FelixYBW · 2024-05-09T20:14:44Z

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

Let's create one issue for each failure reason, and document clearly the failure message. It's at least easy to know the issue we meet during Gluten run is already known in the list or not.

8dukongjian · 2024-05-10T08:39:17Z

@yma11 Thanks! could you create a issue and share the stacktrace and file to reproduce there? also, are you going to work on that?

9757 is created for track. So we need create issue for each failure? I thought we would like to keep in this consolidated place. I don't have bandwidth for fix for now but firstly focus on finding more issues leverage parquet fuzzer test. Will you can help on it?

Thanks, I updated #9757 in the table

Summary: Resolves the compression issue 7 here #9560 Pull Request resolved: #10121 Reviewed By: Yuhta Differential Revision: D59012499 Pulled By: bikramSingh91 fbshipit-source-id: 804464fa0f22e7311650c23d48aa0a75fcd9f601

weixiuli · 2024-07-05T07:44:44Z

@8dukongjian #10395

8dukongjian added bug Something isn't working triage Newly created issue that needs attention. labels Apr 22, 2024

mbasmanova changed the title ~~Summary of Bugs in Velox Parquet Reader~~ Summary of bugs in Parquet reader Apr 22, 2024

chliang71 mentioned this issue Apr 22, 2024

Create row type for list children when there are >1 child fields #9533

Closed

yingsu00 changed the title ~~Summary of bugs in Parquet reader~~ Summary of Parquet reader Issues May 2, 2024

majetideepak mentioned this issue Jun 10, 2024

Fix reading empty Parquet DataPage #10121

Closed

majetideepak added the parquet label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary of Parquet reader Issues #9560

Summary of Parquet reader Issues #9560

8dukongjian commented Apr 22, 2024 •

edited by yingsu00

Loading

xumingming commented Apr 22, 2024

8dukongjian commented Apr 22, 2024

8dukongjian commented Apr 22, 2024

mbasmanova commented Apr 22, 2024

FelixYBW commented Apr 22, 2024

yma11 commented Apr 23, 2024

qqibrow commented Apr 23, 2024

aditi-pandit commented Apr 23, 2024

yingsu00 commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

liujiayi771 commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

ethanyzhang commented Apr 24, 2024

ethanyzhang commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

liujiayi771 commented Apr 25, 2024

yingsu00 commented Apr 25, 2024

yingsu00 commented Apr 25, 2024

8dukongjian commented Apr 25, 2024

qqibrow commented Apr 30, 2024

yingsu00 commented May 2, 2024

qqibrow commented May 2, 2024

yingsu00 commented May 6, 2024

hitarth commented May 6, 2024

yma11 commented May 7, 2024

qqibrow commented May 8, 2024

yma11 commented May 9, 2024

FelixYBW commented May 9, 2024

8dukongjian commented May 10, 2024

weixiuli commented Jul 5, 2024

Summary of Parquet reader Issues #9560

Summary of Parquet reader Issues #9560

Comments

8dukongjian commented Apr 22, 2024 • edited by yingsu00 Loading

Bug description

System information

Relevant logs

xumingming commented Apr 22, 2024

8dukongjian commented Apr 22, 2024

8dukongjian commented Apr 22, 2024

mbasmanova commented Apr 22, 2024

FelixYBW commented Apr 22, 2024

yma11 commented Apr 23, 2024

qqibrow commented Apr 23, 2024

aditi-pandit commented Apr 23, 2024

yingsu00 commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

liujiayi771 commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

ethanyzhang commented Apr 24, 2024

ethanyzhang commented Apr 24, 2024

yingsu00 commented Apr 24, 2024

liujiayi771 commented Apr 25, 2024

yingsu00 commented Apr 25, 2024

yingsu00 commented Apr 25, 2024

8dukongjian commented Apr 25, 2024

qqibrow commented Apr 30, 2024

yingsu00 commented May 2, 2024

qqibrow commented May 2, 2024

yingsu00 commented May 6, 2024

hitarth commented May 6, 2024

yma11 commented May 7, 2024

qqibrow commented May 8, 2024

yma11 commented May 9, 2024

FelixYBW commented May 9, 2024

8dukongjian commented May 10, 2024

weixiuli commented Jul 5, 2024

8dukongjian commented Apr 22, 2024 •

edited by yingsu00

Loading