Skip to content

Conversation

tomtongue
Copy link
Contributor

@tomtongue tomtongue commented Nov 2, 2021

Changes

Making the error message of ALTER TABLE RENAME TO for Glue Data Catalog compatible with the error java.lang.UnsupportedOperationException which is shown by Spark 3.1.1/Glue 3.0 by adding input/out format and SerdeLib for create table operation.

Current situation

Currently when running ALTER TABLE RENAME TO for an iceberg table in Glue Data Catalog by SparkSQL in Glue 3.0 (/Spark3.1.1), the following error message is shown in the logs.

// Example error:
Exception in User Class: org.apache.spark.sql.AnalysisException : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table iceberg_1635860355. StorageDescriptor#InputFormat cannot be null for table: iceberg_1635860355 (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

Running script in Glue:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.errors.CallSite
import org.apache.spark.{SparkContext, SparkConf}
import scala.collection.JavaConverters._
import java.time.Instant

object GlueApp {
    def main(sysArgs: Array[String]) {
        val sc: SparkContext = new SparkContext()
        val gc: GlueContext = new GlueContext(sc)
        val spark = gc.getSparkSession
     
        // CREATE ICEBERG TABLE
        val table = s"iceberg_${Instant.now.getEpochSecond}"
        val ddl = s"""
            CREATE TABLE glue_catalog.db_name.$table(id bigint, data string) USING iceberg
        """
        spark.sql(ddl)

        // ALTER TABLE RENAME TO by SparkSQL
        val renamedTable = table + "_rename"
        println("Running ALTER TABLE query.")
        spark.sql(s"ALTER TABLE db_name.$table RENAME TO db_name.$renamedTable")  // Query by SparkSQL, NOT ALTER TABLE query with iceberg, 
    }
}

This error message is caused by no iceberg table input format because the input format is not added to Glue Data Catalog table when creating an iceberg table by the SparkSQL DDL.

The similar errors occur if there's no output format or serdelib in the Glue Data Catalog.

The error messages are not expected.

Expected result

If a table information is correctly filled in (for example, by Glue Crawler), we can get the following error message. (As you know, the Glue Data Catalog currently doesn't support ALTER TABLE RENAME TO by SparkSQL. I understand Iceberg can handle this query by Drop and Re-create a table).

// Expected error message if a table in Glue Data Catalog has input/output format and serdelib.
Exception in User Class: org.apache.spark.sql.AnalysisException : java.lang.UnsupportedOperationException: Table rename is not supported

After changes

After adding input/output format and serdelib to the iceberg table in Glue Data Catalog, the error message is shown as follows:

Exception in User Class: org.apache.spark.sql.AnalysisException : java.lang.UnsupportedOperationException: Table rename is not supported

And the result of GetTable API is here:

{
    "Table": {
        "Name": "iceberg_1635861451",
        "DatabaseName": "db_name",
        "CreateTime": 1635861458.0,
        "UpdateTime": 1635861458.0,
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "id",
                    "Type": "bigint",
                    "Parameters": {
                        "iceberg.field.id": "1",
                        "iceberg.field.optional": "true",
                        "iceberg.field.type.string": "bigint",
                        "iceberg.field.type.typeid": "LONG",
                        "iceberg.field.usage": "schema-column"
                    }
                },
                {
                    "Name": "data",
                    "Type": "string",
                    "Parameters": {
                        "iceberg.field.id": "2",
                        "iceberg.field.optional": "true",
                        "iceberg.field.type.string": "string",
                        "iceberg.field.type.typeid": "STRING",
                        "iceberg.field.usage": "schema-column"
                    }
                }
            ],
            "Location": "s3://bucket/db_name.db/iceberg_1635861451",
            "InputFormat": "org.apache.hadoop.mapred.FileInputFormat",
            "OutputFormat": "org.apache.hadoop.mapred.FileOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
            },
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://bucket/db_name.db/iceberg_1635861451/metadata/00000-af1973c4-44f8-4a98-95a1-457b309a4f9d.metadata.json",
            "table_type": "ICEBERG"
        },
        "CreatedBy": "arn:aws:sts::account_id:assumed-role/role_name",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "account_id",
        "IsRowFilteringEnabled": false
    }
}

Why these input/outformat and serdelib are selected?

The values for input/output format and serdlib are chosen from https://github.com/apache/iceberg/blob/master/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L404. it's because that these values are used for Glue Catalog (If i misunderstand, please correct me.)

Best regards,
Tom

…Data Catalog compatible with the SparkSQL error message like 'UnsupportedOperationException'
@github-actions github-actions bot added the AWS label Nov 2, 2021
@tomtongue tomtongue changed the title Make the error message of ALTER TABLE RENAME TO by SparkSQL for Glue … AWS: Make the error message of ALTER TABLE RENAME TO compatible with the error for Glue Data Catalog Nov 2, 2021
@kbendick
Copy link
Contributor

kbendick commented Nov 5, 2021

Question: If you add engine.hive.enabled = true as a table property via ALTER TBABLE glue_catalog.db.tbl_name SET TBLPROPERTIES('engine.hive.enabled'='true'), does that resolve this issue?

If that resolves the issue, I'd personally prefer that as that is already part of the Iceberg library and it's one less difference to have to think about (it's needed to use Hive at all anyway so I think that might be the issue).

Relevant docs: https://iceberg.apache.org/hive/#table-property-configuration

Thanks for the code link to where you found those. I'm not well versed enough in Glue to make a comment on this, but I'd check if engine.hive.enabled works first and foremost.

@jackye1995
Copy link
Contributor

jackye1995 commented Nov 5, 2021

@kbendick thanks for the comment, I somehow ignored this PR, my bad.

I believe we have merged the correct fix you made in #3468, so we can close this one @tomtongue

For anyone with the same line of thought, there were a few attempts to add this already in the past. However, as you see the input and out format and serdes you set here are really just "hacks" to make Hive happy, they are not actually the correct information in the table, adding that would just mislead users.

By not setting these values, we are trying to ask people to follow the right way to use an Iceberg catalog in the Hive engine, and Glue here is just an implementation of that Iceberg catalog interface similar to any other implementations out there. I don't want to create this backdoor for only GlueCatalog simply based on the argument that Glue is "Hive compatible", as in fact it is not Hive 2 and 3 compatible especially in the write path.

@tomtongue
Copy link
Contributor Author

tomtongue commented Nov 5, 2021

Thanks for your suggestion and comment for this, @kbendick @jackye1995 .
Actually I tried using hive engine enabled option, however it didn't work well, failed with org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table iceberg_1636005278. StorageDescriptor#InputFormat cannot be null for table: iceberg_1636005278 (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null). Also the #3468 still doesn't fix this.

It's because that renaming a Glue Data Catalog table with SparkSQL itself (not through Iceberg renaming) needs the input/outformat and serdelib in the StorageDescriptor part at least.

The Glue Data Catalog doesn't support renaming a table. So if we try renaming the table whose input/output format and serdelib part is filled in, java.lang.UnsupportedOperationException: Table rename is not supported error will be thrown. I think this message be expected for users.

I totally agree your comment on this change, the problem is not critical and the change might not be flexible for the future. However the error message is also a bit misleading and I will think about a better solution.

Closing this. Thanks for your kind discussion.

@tomtongue tomtongue closed this Nov 5, 2021
@kbendick
Copy link
Contributor

kbendick commented Nov 5, 2021

Thanks for your suggestion and comment for this, @kbendick @jackye1995 . Actually I tried using hive engine enabled option, however it didn't work well, failed with org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table iceberg_1636005278. StorageDescriptor#InputFormat cannot be null for table: iceberg_1636005278 (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null). Also the #3468 still doesn't fix this.

It's because that renaming a Glue Data Catalog table with SparkSQL itself (not through Iceberg renaming) needs the input/outformat and serdelib in the StorageDescriptor part at least.

The Glue Data Catalog doesn't support renaming a table. So if we try renaming the table whose input/output format and serdelib part is filled in, java.lang.UnsupportedOperationException: Table rename is not supported error will be thrown. I think this message be expected for users.

I totally agree your comment on this change, the problem is not critical and the change might not be flexible for the future. However the error message is also misleading and I will think about a better solution.

Closing this. Thanks for your kind discussion.

Thank you for your contribution @tomtongue, even if ultimately it wasn’t the right direction as @jackye1995 mentioned. I agree with Jack’s assessment that we should stick to using the Iceberg catalogs as they’re intended and avoid any hacks that might confuse other users. While the error message for Glue users would arguably be a bit more clear (though still an error), people looking through the Iceberg code and people looking to contribute to Iceberg could get very confused about adding in this unnecessary SerDe information. As Jack mentioned, it might also confuse Glue users into thinking that it’s fully Hive compatible (which I can’t speak to personally but can easily know that we should defer to Jack’s expertise in this area).

Thanks for taking the time to submit a patch and for your overall interest in Iceberg. While this patch wasn’t right for the project, we’d absolutely love to have more contributions from you in the future!

@tomtongue tomtongue deleted the glue-catalog-format-and-serdelib branch October 26, 2023 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants