Skip to content

[SPARK-29987][SQL] Add CatalogTable cache in SessionCatalog to improve performance#26627

Closed
ulysses-you wants to merge 9 commits intoapache:masterfrom
ulysses-you:add-cache-in-catalog
Closed

[SPARK-29987][SQL] Add CatalogTable cache in SessionCatalog to improve performance#26627
ulysses-you wants to merge 9 commits intoapache:masterfrom
ulysses-you:add-cache-in-catalog

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Nov 21, 2019

What changes were proposed in this pull request?

This pr is an another idea to improve sql performance. More details see 26603

Here is the idea:
Execute command plan always simply and quickly, but session catalog maybe called multi times. We can cache CatalogTable and keep a short time to improve this scene. This pr use a time-based
policy to cache CatalogTable.

Why are the changes needed?

Improve sql performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add UT.

@ulysses-you
Copy link
Contributor Author

@rdblue @dongjoon-hyun Do you have time to take a review ?


private val validNameFormat = "([\\w_]+)".r

private val cachedCatalogTable = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cache, not a table. So the variable name would make more sense if it were catalogTableCache.

cachedCatalogTable.put(qtn, catalogTable)
}

private[sql] def invalidateAllCachedCatalogTable(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be plural: invalidateAllCachedCatalogTables

}
}

private[sql] def cacheCatalogTable(qtn: QualifiedTableName, catalogTable: CatalogTable): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caches should pass a Callable so that populating the cache can be combined with a get operation (get or initialize).

Instead of cacheCatalogTable, this should be getOrCacheCatalogTable(qtn: QualifiedTableName, init: Callable[CatalogTable]) that calls catalogTableCache.get(qtn, init).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fix these.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here getOrCacheCatalogTable is just a simple wrapper doing nothing.
I suggest to move the lambda parameter in the caller into this method. It is in fact a part of "Get or Cache".

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114256 has finished for PR 26627 at commit 9d88db8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114259 has finished for PR 26627 at commit 83edf2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114260 has finished for PR 26627 at commit ba900c6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114266 has finished for PR 26627 at commit e710324.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114271 has finished for PR 26627 at commit 9d0ed7b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.internal()
.doc("The maximum expire seconds time of cache table catalog.")
.intConf
.checkValue(cacheSize => cacheSize >= 0 && cacheSize < 8,
Copy link

@konjac konjac Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheSize seems being copied from existing config items. Need rewording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

buildStaticConf("spark.sql.filesourceTableCatalogCacheExpireSeconds")
.internal()
.doc("The maximum expire seconds time of cache table catalog.")
.intConf
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about timeConf ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114291 has finished for PR 26627 at commit 8dc7f91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -222,6 +228,7 @@ class SessionCatalog(
if (cascade && databaseExists(dbName)) {
listTables(dbName).foreach { t =>
invalidateCachedTable(QualifiedTableName(dbName, t.table))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two methods invalidateCachedTable and invalidateCachedCatalogTable are really confusing in their names. I suggest to introduce some rewording to let the names more intuitive.

invalidateAllCachedCatalogTables()
}

private[sql] def getCachedCatalogTable(qtn: QualifiedTableName): Option[CatalogTable] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments. Other new methods as well.

}
}

private[sql] def cacheCatalogTable(qtn: QualifiedTableName, catalogTable: CatalogTable): Unit = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here getOrCacheCatalogTable is just a simple wrapper doing nothing.
I suggest to move the lambda parameter in the caller into this method. It is in fact a part of "Get or Cache".

@ulysses-you
Copy link
Contributor Author

ulysses-you commented Nov 22, 2019

It seems that StatisticsSuite use the underlying method hiveClient.alterTable that makes test failed.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114298 has finished for PR 26627 at commit 9bc6abd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114301 has finished for PR 26627 at commit be169d8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants