Skip to content

[HUDI-6866]When invalidate the table in the spark sql query cache, verify if the…#9425

Merged
zhangyue19921010 merged 1 commit intoapache:masterfrom
empcl:master_hivesync_db_exists
Oct 23, 2023
Merged

[HUDI-6866]When invalidate the table in the spark sql query cache, verify if the…#9425
zhangyue19921010 merged 1 commit intoapache:masterfrom
empcl:master_hivesync_db_exists

Conversation

@empcl
Copy link
Contributor

@empcl empcl commented Aug 11, 2023

… hive-async database exists

Change Logs

When invalidate the table in the spark sql query cache, verify if the hive-async database exists

Impact

When invalidate the table in the spark sql query cache, verify if the hive-async database exists

Risk level (write none, low medium or high below)

none

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@empcl
Copy link
Contributor Author

empcl commented Aug 11, 2023

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

val qualifiedTableName = String.join(".", hoodieConfig.getStringOrDefault(HIVE_DATABASE), name)
if (spark.catalog.tableExists(qualifiedTableName)) {
val syncDb = hoodieConfig.getStringOrDefault(HIVE_DATABASE)
val qualifiedTableName = String.join(".", syncDb, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable, should we also take the default database name into consideration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, let me first talk about the current background. When I use the spark sync hive function, if the spark version is 3.1 or below, there is a problem with the database when performing the validate sync hive table operation. After reviewing the source code, it was found that Spark 3.1 tableExists needs to verify whether the database exists.

protected def requireDbExists(db: String): Unit = { if (!databaseExists(db)) { throw new NoSuchDatabaseException(db) } }

if (spark.catalog.tableExists(qualifiedTableName)) {
val syncDb = hoodieConfig.getStringOrDefault(HIVE_DATABASE)
val qualifiedTableName = String.join(".", syncDb, name)
if (spark.catalog.databaseExists(syncDb) && spark.catalog.tableExists(qualifiedTableName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.catalog.tableExists(qualifiedTableName) will contain dbName to check table, why need check db before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, in Spark 3.1 and earlier versions, when detecting the existence of a table, it is mandatory for the database to exist. However, here, the database is not registered in the catalog in advance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on, correct me if I'm wrong, it's mean if I use default as dbName to check, and it throw error, so check db first is reasonable. But I have a question, in which scenario is this parameter META_SYNC_DATABASE_NAME not set, I think we need fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think what you're saying makes sense. Let me see where to complete the registration of databases and tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 @KnightChess Hello, the invalidate table operation should only be executed when enableHiveSupport() is enabled, but sometimes we do not need to enable enableHiveSupport(), such as for testing purposes.

@empcl empcl changed the title When invalidate the table in the spark sql query cache, verify if the… [HUDI-6866]When invalidate the table in the spark sql query cache, verify if the… Sep 15, 2023
@empcl
Copy link
Contributor Author

empcl commented Sep 18, 2023

@yihua Hello, do you have time to help take a look at this PR? thanks

Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.Nice catch

@zhangyue19921010 zhangyue19921010 merged commit fe010bb into apache:master Oct 23, 2023
nsivabalan pushed a commit that referenced this pull request Nov 21, 2023
… hive-async database exists (#9425)

Co-authored-by: chenlei677 <chenlei677@jd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants