-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} explicitly set - to being able to write nulls inside arrays(the only way), Hudi starts to write Parquets with the following schema inside:
{{ required group internal_list (LIST) {
repeated group list {
required int64 element;
}
}}}
But if I had some files produced before setting {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they have the following schema inside
{{ required group internal_list (LIST) {
repeated int64 array;
}}}
And Hudi 0.14.x at least fails to read records from such file - failing with exception
{{Caused by: java.lang.RuntimeException: Null-value for required field: }}
Even though the contents of arrays is {{{}not null{}}}(it cannot be null in fact since Avro requires {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write {{{}null{}}}s.
h3. Expected behavior
Taken from Hudi 0.12.1(not sure what exactly broke that):
If I have a file with 2 level structure and update(not matter having nulls inside array or not - both produce the same) arrives with "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it into 3 level.({}fails in 0.14.1{})
If I have 3 level structure with nulls and update cames(not matter with nulls or without) - read and write correctly
The simple reproduction of issue can be found here:
[https://github.com/VitoMakarevich/hudi-issue-014]
Highly likely, the problem appeared after Hudi made some changes, so values from Hadoop conf started to propagate into Reader instance(likely they were not propagated before).
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-7874
- Type: Bug