Skip to content

Fail to read 2-level structure Parquet #16520

@hudi-bot

Description

@hudi-bot

If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} explicitly set - to being able to write nulls inside arrays(the only way), Hudi starts to write Parquets with the following schema inside:
 {{ required group internal_list (LIST) {
repeated group list {
required int64 element;
}
}}}
 
But if I had some files produced before setting {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they have the following schema inside
 {{ required group internal_list (LIST) {
repeated int64 array;
}}}
 
And Hudi 0.14.x at least fails to read records from such file - failing with exception
{{Caused by: java.lang.RuntimeException: Null-value for required field: }}

Even though the contents of arrays is {{{}not null{}}}(it cannot be null in fact since Avro requires {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write {{{}null{}}}s.
h3. Expected behavior

Taken from Hudi 0.12.1(not sure what exactly broke that):

If I have a file with 2 level structure and update(not matter having nulls inside array or not - both produce the same) arrives with "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it into 3 level.({}fails in 0.14.1{})

If I have 3 level structure with nulls and update cames(not matter with nulls or without) - read and write correctly

The simple reproduction of issue can be found here:
[https://github.com/VitoMakarevich/hudi-issue-014]

Highly likely, the problem appeared after Hudi made some changes, so values from Hadoop conf started to propagate into Reader instance(likely they were not propagated before).

JIRA info

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions