-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Currenlty we are working with Hudi 0.5.0 and AWS Glue, everything working fine for .parquet and COW mode, with complex types in data and different nullable options.
After switching to Hudi 0.5.2 , start facing the issues related to:
Spark application fails while writing Dataframe into Hudi table when using complex types like:
{
"city":[
{
"name":"some_name",
"index":"some_index"
}
]
}
And having nullable fields = true for it. Till the moment of saving, everything is fine, and we are able to see complete dataframe:
+----------------------------+
|city |
|[[some_name, some_index]]. |
+----------------------------+
root
|-- city: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- index: string (nullable = true)
Note that All simple types working fine with saving data into Hudi table, as well as complex types using nullable = false
Steps to reproduce the behavior:
from pathlib import Path
spark = SparkSession.builder \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.jars.packages",
"org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4") \
.appName('nested_type_hudi') \
.enableHiveSupport() \
.getOrCreate()
PROJECT_PATH = str(Path(__file__).parent)
input_data = """{"city":[{"name":"some_name","index":"some_index"}]}"""
schema = StructType([
StructField('city', ArrayType(StructType([StructField('name', StringType(), True),
StructField('index', StringType(), True)]), True), True)
])
options = {
'hoodie.table.name': "nested_hierarchy_example",
'hoodie.datasource.write.precombine.field': "object_ts",
'hoodie.datasource.write.recordkey.field': "recordkey"
}
nested_hierarchy_df = spark.read.schema(schema).json(spark.sparkContext.parallelize([input_data])) \
.withColumn('object_ts', lit(123)) \
.withColumn('recordkey', lit('abc'))
write_table(nested_hierarchy_df, options, 'append', f'file://{PROJECT_PATH}/test_data/nested_output')
def write_table(df, options, mode, output_dir):
df.write.format("org.apache.hudi").options(**options).mode(mode).save(output_dir)
Expected behavior
Hudi table should be successfully saved in parquet format with complex type fields, which contains nullable = true. Hudi 0.5.0 working fine with all variety of complex types and nullable fields.
Local/AWS Glue 1.0:
- Language: Python 3.7.5
- Hudi version : 0.5.2
- Spark version : 2.4.4(tried locally)/2.4.3(tried on AWS Glue)
- Hive version : Not applicable
- Hadoop version : 2.8.5
- Storage (HDFS/S3/GCS..) : S3
- Running on Docker? (yes/no) : no
Stacktrace
java.io.IOException: Could not create payload for class: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
at org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:178)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:102)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:99)
...
Caused by: org.apache.avro.UnresolvedUnionException: Not in union [{"type":"record","name":"city","namespace":"hoodie.nested_hierarchy_example.nested_hierarchy_example_record","fields":[{"name":"name","type":["string","null"]},{"name":"index","type":["string","null"]}]},"null"]: {"name": "some_name", "index": "some_index"}
at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
...
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
at org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:122)
... 28 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
Is this is already a known issue for Hudi greater 0.5.0?
if there is a workaround that would allow us to upgrade to 0.5.2?