-
Notifications
You must be signed in to change notification settings - Fork 2.8k
AWS: Make the error message of ALTER TABLE RENAME TO compatible with the error for Glue Data Catalog #3448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Make the error message of ALTER TABLE RENAME TO compatible with the error for Glue Data Catalog #3448
Conversation
…Data Catalog compatible with the SparkSQL error message like 'UnsupportedOperationException'
Question: If you add If that resolves the issue, I'd personally prefer that as that is already part of the Iceberg library and it's one less difference to have to think about (it's needed to use Hive at all anyway so I think that might be the issue). Relevant docs: https://iceberg.apache.org/hive/#table-property-configuration Thanks for the code link to where you found those. I'm not well versed enough in Glue to make a comment on this, but I'd check if |
@kbendick thanks for the comment, I somehow ignored this PR, my bad. I believe we have merged the correct fix you made in #3468, so we can close this one @tomtongue For anyone with the same line of thought, there were a few attempts to add this already in the past. However, as you see the input and out format and serdes you set here are really just "hacks" to make Hive happy, they are not actually the correct information in the table, adding that would just mislead users. By not setting these values, we are trying to ask people to follow the right way to use an Iceberg catalog in the Hive engine, and Glue here is just an implementation of that Iceberg catalog interface similar to any other implementations out there. I don't want to create this backdoor for only GlueCatalog simply based on the argument that Glue is "Hive compatible", as in fact it is not Hive 2 and 3 compatible especially in the write path. |
Thanks for your suggestion and comment for this, @kbendick @jackye1995 . It's because that renaming a Glue Data Catalog table with SparkSQL itself (not through Iceberg renaming) needs the input/outformat and serdelib in the StorageDescriptor part at least. The Glue Data Catalog doesn't support renaming a table. So if we try renaming the table whose input/output format and serdelib part is filled in, I totally agree your comment on this change, the problem is not critical and the change might not be flexible for the future. However the error message is also a bit misleading and I will think about a better solution. Closing this. Thanks for your kind discussion. |
Thank you for your contribution @tomtongue, even if ultimately it wasn’t the right direction as @jackye1995 mentioned. I agree with Jack’s assessment that we should stick to using the Iceberg catalogs as they’re intended and avoid any hacks that might confuse other users. While the error message for Glue users would arguably be a bit more clear (though still an error), people looking through the Iceberg code and people looking to contribute to Iceberg could get very confused about adding in this unnecessary SerDe information. As Jack mentioned, it might also confuse Glue users into thinking that it’s fully Hive compatible (which I can’t speak to personally but can easily know that we should defer to Jack’s expertise in this area). Thanks for taking the time to submit a patch and for your overall interest in Iceberg. While this patch wasn’t right for the project, we’d absolutely love to have more contributions from you in the future! |
Changes
Making the error message of ALTER TABLE RENAME TO for Glue Data Catalog compatible with the error
java.lang.UnsupportedOperationException
which is shown by Spark 3.1.1/Glue 3.0 by adding input/out format and SerdeLib for create table operation.Current situation
Currently when running ALTER TABLE RENAME TO for an iceberg table in Glue Data Catalog by SparkSQL in Glue 3.0 (/Spark3.1.1), the following error message is shown in the logs.
Running script in Glue:
This error message is caused by no iceberg table input format because the input format is not added to Glue Data Catalog table when creating an iceberg table by the SparkSQL DDL.
The similar errors occur if there's no output format or serdelib in the Glue Data Catalog.
The error messages are not expected.
Expected result
If a table information is correctly filled in (for example, by Glue Crawler), we can get the following error message. (As you know, the Glue Data Catalog currently doesn't support ALTER TABLE RENAME TO by SparkSQL. I understand Iceberg can handle this query by Drop and Re-create a table).
After changes
After adding input/output format and serdelib to the iceberg table in Glue Data Catalog, the error message is shown as follows:
And the result of GetTable API is here:
Why these input/outformat and serdelib are selected?
The values for input/output format and serdlib are chosen from https://github.com/apache/iceberg/blob/master/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L404. it's because that these values are used for Glue Catalog (If i misunderstand, please correct me.)
Best regards,
Tom