[Spark] Add a new option to workaround incorrect schema automatically created in external catalog #4431

moomindani · 2025-04-17T09:07:45Z

Which Delta project/connector is this regarding?

Description

This PR is continuation of #2310.

This PR resolves issue 1 in the issue [Design Doc] Catalog implementation for AWS Glue Data Catalog #1679.
The PR will change how the catalog schema is saved in Hive Metastore.

Technical details:

Added a new boolean parameter spark.databricks.delta.schema.forceAlterTableDataSchema in DeltaSqlConf.
When this parameter is true, then in the class CreateDeltaTableCommand, in updateCatalog function, after create a table in the catalog it will update the table schema using a session catalog function (alterTableDataSchema)

How was this patch tested?

See https://docs.google.com/document/d/e/2PACX-1vRl4lWFyx1ASeYCUoj575fLM0dFKhXM5MekWS__NVnwJJESfMflGS71OKHoFjEoHFjmidLHeEXHoreb/pub

Reason not to add unit tests

The new config and logic is designed to trigger ALTER TABLE.
Since we don't have a straightforward way to mock or spy on the catalog in this test environment, additional tet cases are not added.

Reason not to add integration tests

The original issue occurs with external catalog such as AWS Glue Data Catalog.
Since we do not want to introduce extra dependency just to test this patch into the integration test, additional test cases are not added.
While we can't directly test against AWS Glue Data Catalog in unit tests, I created the above doc to summarize the integration test results.

Reason not to cover the existing test in the previous PR

The previous PR #2310 had two test cases, but I verified that both test cases did not capture the original issue.

Does this PR introduce any user-facing changes?

Yes. This PR introduces a new configuration spark.databricks.delta.schema.forceAlterTableDataSchema to force ALTER TABLE when Delta table is created into catalog.

…rect schema automatically created in external catalog

moomindani · 2025-04-17T09:15:11Z

@harperjiang Could you please review this PR? This is continuation of #2310.

Followings are replies to your existing comments:

Can we integrate this change into the match/case above so we can reduce the number of calls to catalog?

We need to run ALTER TABLE in any of those case conditions, so we need to keep the current place.
I have addressed indentation issue.

In 3.1 we have introduced DeltaSQLConf.DELTA_UPDATE_CATALOG_ENABLED, which does exactly the same thing as this PR. Please check link

This was a great feedback. I reused the existing config and logic instead of introducing similar logic.

Can we compare the results of enabling and disabling the conf in the test cases?

I have compared the results of those test cases, and noticed that the existing test cases did not capture the original issue successfully. I have tried to implement unit/integ test cases, but it was not easy. I concluded to add test scenario document rather than making super complex test cases which may introduce deep dependency. Hope this works.

moomindani · 2025-04-18T06:52:30Z

Tagging @dennyglee for visibility.

dennyglee · 2025-04-20T17:08:31Z

Thanks @moomindani - including @roeap for review - thanks!

moomindani · 2025-05-08T00:51:02Z

Hi @roeap, could you please take a look at this?

dennyglee · 2025-05-10T18:24:09Z

Hey @moomindani - could you take a look at the conflicts for this? @roeap and I will find time to review this next week as well.

roeap · 2025-05-12T15:12:04Z

@moomindani @dennyglee - taking a look now!

moomindani · 2025-05-14T07:37:43Z

Resolved the conflict with the other refactoring commit. Moved the code change to CreateDeltaTableLike.scala. Tested the new implementation and verified the same result.

vkorukanti · 2025-05-22T17:16:46Z

@prakharjain09 could you please take a look at this PR?

LukasRupprecht · 2025-05-29T19:43:51Z

@moomindani Could you clarify the current plan: Is #2310 still targeted to be merged and this PR is a follow up or is this PR intended as a replacement of #2310?

moomindani · 2025-05-29T22:54:21Z

@LukasRupprecht #2310 is not being worked on today, and this is the replacement of that. We want to unblock these use cases by merging this PR instead.

LukasRupprecht · 2025-06-02T17:15:04Z

spark/src/main/scala/org/apache/spark/sql/delta/commands/CreateDeltaTableLike.scala

+    if (conf.getConf(DeltaSQLConf.FORCE_ALTER_TABLE_DATA_SCHEMA)) {
+      spark.sessionState.catalog.alterTableDataSchema(cleaned.identifier, cleaned.schema)
+    }


As @harperjiang mentioned on the previous PR, this shouldn't be necessary anymore if DELTA_UPDATE_CATALOG_ENABLED is set.

@moomindani I saw your comment on the previous PR that you've tried setting this config but it didn't work. Could you share the exact steps that you used to test whether the schema is correctly written to the catalog? If this is not working, then something is wrong (either with the test setup or the code) and we should fix it.

@LukasRupprecht Here's the details. Is this enough? Actually three different folks tried the way described in the previous PR, but no one succeeded. I do not think this is my config issue.

Environment

AWS

us-east-1 Region

AWS Glue for Apache Spark (Glue version 5.0 / Apache Spark 3.5.4)

Glue Data Catalog enabled (In Glue job console, Use Glue data catalog as the Hive metastore is selected)

Delta Lake 3.3.0 enabled (In Glue job console, job parameter is provided: --datalake-formats delta)

Spark conf

--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension

--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

--conf spark.databricks.delta.catalog.update.enabled=true

Script

additional_options = { "path": "s3://<path_to_data>/database/table/", "write.parquet.compression-codec": "snappy", } df.write.format("delta").saveAsTable("database.table")

Steps

Create a Glue job with the above configurations

Run the job

Check database.table on Glue Data Catalog to see if the schema turned out to col (array)

I think we need to figure out why the existing config it is not working as expected before we add a new config that does the exact same thing.

To start, could you

Add a unit test to DeltaUpdateCatalogSuite that creates a table and checks that the schema is correct in the in-memory catalog that is used in the test when DELTA_UPDATE_CATALOG_ENABLED is set (there should already be tests like this in that suite but we can create one from scratch just to verify).

Add the same unit test again but now with DELTA_UPDATE_CATALOG_ENABLED off and FORCE_ALTER_TABLE_DATA_SCHEMA on? This should result in the same schema being uploaded to the catalog as in the first unit test.

If the two unit tests indeed work the same, then we know DELTA_UPDATE_CATALOG_ENABLED works at least in the tests and we can start debugging things in your setup.

@LukasRupprecht I have added the two unit tests as you requested. The test 1 succeeded but the test 2 failed.
From this observation, it is clear that both work differently. This is also observed in the integration test result linked from the PR overview: https://docs.google.com/document/d/e/2PACX-1vRl4lWFyx1ASeYCUoj575fLM0dFKhXM5MekWS__NVnwJJESfMflGS71OKHoFjEoHFjmidLHeEXHoreb/pub.

To make it extremely clear, I pushed the commit including those unit tests (but it failed).

As far as I understand, this issue occurs only with external Hive catalog (including Glue Data Catalog), and DELTA_UPDATE_CATALOG_ENABLED is not enough.

Let me trace through what happens when DELTA_UPDATE_CATALOG_ENABLED is used:

Step 1: Delta Sets Schema in CatalogTable

// In cleanupTableDefinition() if (conf.getConf(DeltaSQLConf.DELTA_UPDATE_CATALOG_ENABLED)) { table.copy( schema = truncatedSchema, // ← Delta sets the schema here properties = UpdateCatalog.updatedProperties(snapshot), storage = storageProps, // ← This contains Delta-specific storage info tracksPartitionsInCatalog = true) }

Step 2: Spark Tries to Create Table in External Catalog

// In updateCatalog() spark.sessionState.catalog.createTable(cleaned, ignoreIfExists = false, validateLocation = false)

Step 3: Hive Metastore Processes the Table Definition

When Spark calls createTable() on an external Hive-compatible catalog, here's what happens:

SerDe Detection: Hive looks at the table's storage format to determine the appropriate SerDe

Delta Format Check: Hive sees Delta-specific storage properties but doesn't recognize Delta as a valid SerDe

Fallback Behavior: Since Delta isn't a registered Hive SerDe, Hive falls back to default behavior

Schema Stripping: The default behavior is to ignore/strip the provided schema and use an empty schema

Step 4: The Schema Gets Lost

// What Delta sends to Hive: CatalogTable( schema = StructType(Array( StructField("id", LongType), StructField("name", StringType), StructField("value", DoubleType) )), storage = CatalogStorageFormat( locationUri = Some("s3://bucket/path"), inputFormat = Some("org.apache.hadoop.mapred.SequenceFileInputFormat"), // Delta-specific outputFormat = Some("org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat"), serde = Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe") // Default, not Delta-aware ) ) // What Hive actually stores: CatalogTable( schema = StructType(Array()), // ← EMPTY! Schema was stripped storage = CatalogStorageFormat(...) // Storage info might be modified too )

Why FORCE_ALTER_TABLE_DATA_SCHEMA Works

Your approach bypasses this SerDe validation entirely:

Step 1: Create Table with Empty Schema

// First, create table normally (schema gets stripped as expected) spark.sessionState.catalog.createTable(cleaned, ...) // Result: Table exists in catalog with empty schema

Step 2: Force Schema Update

// Then, directly update the schema using Spark's catalog API if (conf.getConf(DeltaSQLConf.FORCE_ALTER_TABLE_DATA_SCHEMA)) { spark.sessionState.catalog.alterTableDataSchema(cleaned.identifier, cleaned.schema) }

Why This Works:

• alterTableDataSchema() is a direct catalog metadata operation
• It doesn't go through SerDe validation - it directly updates the catalog's metadata store
• It bypasses Hive's SerDe compatibility checks entirely
• The catalog accepts the schema update because it's a metadata-only operation

@moomindani thanks for the detailed explanation and we think that makes sense.
The only concern is this solution is too HMS specific. Can you add HMS to the conf name (sth like HMS_FORCE_UPDATE_DATA_SCHEMA) and provide some context in the doc (e.g., this is a problem known only to HMS etc.)?

Sure I will do that and update this PR shortly with better explanation and config name.

…_TABLE_DATA_SCHEMA behavior - Test 1: Verifies DELTA_UPDATE_CATALOG_ENABLED stores schema correctly in catalog - Test 2: Verifies FORCE_ALTER_TABLE_DATA_SCHEMA behavior when DELTA_UPDATE_CATALOG_ENABLED is disabled - These tests help validate the different approaches to catalog schema management

- Test name exceeded 100 character limit - Split the test name using string concatenation to comply with scalastyle rules

…cific to HMS.

… created in external catalog (delta-io#4431)  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description This PR is continuation of delta-io#2310. * This PR resolves issue 1 in the issue delta-io#1679. * The PR will change how the catalog schema is saved in Hive Metastore. Technical details: * Added a new boolean parameter `spark.databricks.delta.schema.forceAlterTableDataSchema` in `DeltaSqlConf`. * When this parameter is true, then in the class `CreateDeltaTableCommand`, in `updateCatalog` function, after create a table in the catalog it will update the table schema using a session catalog function (`alterTableDataSchema`) ## How was this patch tested? See https://docs.google.com/document/d/e/2PACX-1vRl4lWFyx1ASeYCUoj575fLM0dFKhXM5MekWS__NVnwJJESfMflGS71OKHoFjEoHFjmidLHeEXHoreb/pub ### Reason not to add unit tests The new config and logic is designed to trigger ALTER TABLE. Since we don't have a straightforward way to mock or spy on the catalog in this test environment, additional tet cases are not added. ### Reason not to add integration tests The original issue occurs with external catalog such as AWS Glue Data Catalog. Since we do not want to introduce extra dependency just to test this patch into the integration test, additional test cases are not added. While we can't directly test against AWS Glue Data Catalog in unit tests, I created the above doc to summarize the integration test results. ### Reason not to cover the existing test in the previous PR The previous PR delta-io#2310 had two test cases, but I verified that both test cases did not capture the original issue. ## Does this PR introduce _any_ user-facing changes? Yes. This PR introduces a new configuration `spark.databricks.delta.schema.forceAlterTableDataSchema` to force ALTER TABLE when Delta table is created into catalog. --------- Co-authored-by: Robert Pack <42610831+roeap@users.noreply.github.com>

Add a new option schema.forceAlterTableDataSchema to workaround incor…

2f64a1f

…rect schema automatically created in external catalog

moomindani mentioned this pull request Apr 17, 2025

[Spark] Resolves #1679 issue glue catalog #2310

Open

1 task

Merged updates in master branch

919ad85

Minor fix

2c9e05b

Merge branch 'master' into fix_glue_catalog_schema_col_array_issue

9cfb1ef

LukasRupprecht reviewed Jun 2, 2025

View reviewed changes

moomindani force-pushed the fix_glue_catalog_schema_col_array_issue branch from 9cfb1ef to a5f48a2 Compare July 28, 2025 23:51

Merge branch 'master' into fix_glue_catalog_schema_col_array_issue

ae5f0c5

moomindani force-pushed the fix_glue_catalog_schema_col_array_issue branch from a5f48a2 to ae5f0c5 Compare July 28, 2025 23:54

moomindani added 3 commits July 29, 2025 09:11

Fix scalastyle error: split long test name across multiple lines

ac07d13

- Test name exceeded 100 character limit - Split the test name using string concatenation to comply with scalastyle rules

Removed unneeded unit tests. Made the config name and explanation spe…

f4e1ab2

…cific to HMS.

harperjiang approved these changes Aug 18, 2025

View reviewed changes

OussamaSaoudi merged commit 80926bd into delta-io:master Aug 18, 2025
23 checks passed

[Spark] Add a new option to workaround incorrect schema automatically created in external catalog #4431

[Spark] Add a new option to workaround incorrect schema automatically created in external catalog #4431

Conversation

moomindani commented Apr 17, 2025

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Reason not to add unit tests

Reason not to add integration tests

Reason not to cover the existing test in the previous PR

Does this PR introduce any user-facing changes?

Uh oh!

moomindani commented Apr 17, 2025

Uh oh!

moomindani commented Apr 18, 2025

Uh oh!

dennyglee commented Apr 20, 2025

Uh oh!

moomindani commented May 8, 2025

Uh oh!

dennyglee commented May 10, 2025

Uh oh!

roeap commented May 12, 2025

Uh oh!

moomindani commented May 14, 2025

Uh oh!

vkorukanti commented May 22, 2025

Uh oh!

LukasRupprecht commented May 29, 2025

Uh oh!

moomindani commented May 29, 2025

Uh oh!

LukasRupprecht Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

moomindani Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Environment

Spark conf

Script

Steps

Uh oh!

LukasRupprecht Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

moomindani Jul 29, 2025

Choose a reason for hiding this comment

Step 1: Delta Sets Schema in CatalogTable

Step 2: Spark Tries to Create Table in External Catalog

Step 3: Hive Metastore Processes the Table Definition

Step 4: The Schema Gets Lost

Why FORCE_ALTER_TABLE_DATA_SCHEMA Works

Step 1: Create Table with Empty Schema

Step 2: Force Schema Update

Why This Works:

Uh oh!

harperjiang Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moomindani Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

moomindani Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

moomindani Jun 2, 2025 •

edited

Loading

harperjiang Aug 15, 2025 •

edited

Loading