New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]hudi how to upsert a non null array data to a existing column with array of nulls,optional binary. java.lang.ClassCastException: optional binary element (UTF8) is not a group #5701
Comments
@nsivabalan @n3nash @umehrot2 @ Kindly suggest what should be done in this use case, we are stuck with this issue for 1 month now.
This resulted in below schema for this column in hudi table during root
|-- NWDepStatus: array (nullable = true)
| |-- element: string (containsNull = true) New incoming record schema for the same column is as below. This record is meant to be saved via with value as {
"id": 1,
"NWDepCount": 0,
"NWDepStatus": [
{
"ClassId": "metric.DepStatus",
"Id": 21,
"Name": "MyNW_3",
"ObjectType": "metric.DepStatus",
"Status": "NA"
},
{
"ClassId": "metric.DepStatus",
"Id": 22,
"Name": "MyNW2",
"ObjectType": "metric.DepStatus",
"Status": "NA"
}
]
} Resulting in schema as below which is different from the existing schema saved in hudi root
|-- NWDepStatus: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ClassId: string (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Name: string (nullable = true)
| | |-- ObjectType: string (nullable = true)
| | |-- Status: string (nullable = true) I even tried altering the existing column schema before writing the new records by making the schema similar to new records with non empty array and retaining nulls in it but with no success. +------------------------+
|NWDepStatus|
+------------------------+
|null |
|null |
+------------------------+ Configs are as follows: commonConfig = {
'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.precombine.field': 'MdTimestamp',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.table.name': 'hudi-table',
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': args['database_name'],
'hoodie.datasource.write.reconcile.schema': 'true',
'hoodie.datasource.hive_sync.table': 'hudi + prefix.replace("/", "_").lower(),
'hoodie.datasource.hive_sync.enable': 'true', 'path': 's3://' + args['curated_bucket'] + '/hudi' + prefix,
'hoodie.parquet.small.file.limit': '134217728' # 1,024 * 1,024 * 128 = 134,217,728 (128 MB)
}
unpartitionDataConfig = {
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
}
initLoadConfig = {
'hoodie.bulkinsert.shuffle.parallelism': 68,
'hoodie.datasource.write.operation': 'bulk_insert'
}
incrementalConfig = {
'hoodie.upsert.shuffle.parallelism': 68,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.cleaner.commits.retained': 10
} Checked this issue #2265 and the fix #2927. But even with the configs given as solution its not working and failing with the same error
|
Small update i tried to drop the column with nulls during upsert.
Please suggest/correct whats am I doing wrong here.
|
are you able to try spark 3.2 which has major parquet upgrade to 1.12 ? |
Thanks for getting back @xushiyan AWS glue supports Spark 3.1, but i suppose with
|
@gtwuser For the first bulk insert, are values of |
I have opened similar issue throwing same exception during update. I have a spark-shell example in it. |
@phillycoder It looks like we have a workaround in the other issue. |
Looks like it is an open issue of Parquet format iself that has not yet been resolved https://issues.apache.org/jira/browse/PARQUET-1681 |
We need to upgrade parquet-avro once the above issues are fixed. |
Describe the problem you faced
We are trying to update an existing column
col1
which has schema of a empty array, which is by default taken asarray<string>
. Perhaps the issue is that the new upcoming records has data in this existing columncol1
that is it's an array of not null values. While upserting it throws error of
•••binary Utf8 optional element of not group ••••
. We don't have any predefined schema for these records, it's all inferred by default. Hence during insert this columncol1
schema becomes array by default. But since the new upcoming records have non null or non empty array values while upserting them to tu his column it fails the upsert operation.In short this issue comes whenever we are trying to update the schema of a column from
array<string>
toarray<struct<>>
orarray<array<>>
. Kindly let me know if there is a work around or solution for it.A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
Insert
records which has a column with only empty array as valueExpected behavior
Expected behaviour would be to upgrade schema of columns which had a default schema for an empty array(i.e array) to the new recieved non empty array value schema.
That is upgrade a array based column schema from default array to a more complex schema of the data which the non empty array holds.
Environment Description
AWS glue 3.0
Hudi version : 0.10.1
Spark version : 3.1.2
Running on Docker? (yes/no) : no, we are running glue jobs using pyspark
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
java.lang.ClassCastException: optional binary element (UTF8) is not a group
The text was updated successfully, but these errors were encountered: