Skip to content

Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT] #1550

@badion

Description

@badion

Currenlty we are working with Hudi 0.5.0 and AWS Glue, everything working fine for .parquet and COW mode, with complex types in data and different nullable options.

After switching to Hudi 0.5.2 , start facing the issues related to:

#1406

Spark application fails while writing Dataframe into Hudi table when using complex types like:

{
   "city":[
      {
         "name":"some_name",
         "index":"some_index"
      }
   ]
}

And having nullable fields = true for it. Till the moment of saving, everything is fine, and we are able to see complete dataframe:

+----------------------------+
|city                        |
|[[some_name, some_index]].  |
+----------------------------+
root
 |-- city: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- index: string (nullable = true)

Note that All simple types working fine with saving data into Hudi table, as well as complex types using nullable = false

Steps to reproduce the behavior:

from pathlib import Path

spark = SparkSession.builder \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.jars.packages",
            "org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4") \
    .appName('nested_type_hudi') \
    .enableHiveSupport() \
    .getOrCreate()



PROJECT_PATH = str(Path(__file__).parent)

input_data = """{"city":[{"name":"some_name","index":"some_index"}]}"""

schema = StructType([
        StructField('city', ArrayType(StructType([StructField('name', StringType(), True),
                                                  StructField('index', StringType(), True)]), True), True)
    ])

options = {
        'hoodie.table.name': "nested_hierarchy_example",
        'hoodie.datasource.write.precombine.field': "object_ts",
        'hoodie.datasource.write.recordkey.field': "recordkey"
    }

nested_hierarchy_df = spark.read.schema(schema).json(spark.sparkContext.parallelize([input_data])) \
        .withColumn('object_ts', lit(123)) \
        .withColumn('recordkey', lit('abc')) 

write_table(nested_hierarchy_df, options, 'append', f'file://{PROJECT_PATH}/test_data/nested_output')


def write_table(df, options, mode, output_dir):
    df.write.format("org.apache.hudi").options(**options).mode(mode).save(output_dir)

Expected behavior
Hudi table should be successfully saved in parquet format with complex type fields, which contains nullable = true. Hudi 0.5.0 working fine with all variety of complex types and nullable fields.

Local/AWS Glue 1.0:

  • Language: Python 3.7.5
  • Hudi version : 0.5.2
  • Spark version : 2.4.4(tried locally)/2.4.3(tried on AWS Glue)
  • Hive version : Not applicable
  • Hadoop version : 2.8.5
  • Storage (HDFS/S3/GCS..) : S3
  • Running on Docker? (yes/no) : no

Stacktrace

java.io.IOException: Could not create payload for class: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
	at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
	at org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:178)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:102)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:99)

...

Caused by: org.apache.avro.UnresolvedUnionException: Not in union [{"type":"record","name":"city","namespace":"hoodie.nested_hierarchy_example.nested_hierarchy_example_record","fields":[{"name":"name","type":["string","null"]},{"name":"index","type":["string","null"]}]},"null"]: {"name": "some_name", "index": "some_index"}
	at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
	at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)

...

Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class 
	at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
	at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:122)
	... 28 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

Is this is already a known issue for Hudi greater 0.5.0?
if there is a workaround that would allow us to upgrade to 0.5.2?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions