-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29947][SQL] Improve ResolveRelations performance #26589
Conversation
Test build #114049 has finished for PR 26589 at commit
|
cc @cloud-fan |
Could you resolve the conflicts, @wangyum ? So, the basic idea is caching the result with Map, isn't it? |
should this be a per-query thing? e.g. for each query, we only need to resolve a table once. |
for each query, we only need to resolve a table once in |
A table may appear several times in the query, e.g. self-join. Shall we handle it as well using the cache? |
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Yes. We can handle self-join using the cache. |
Test build #114384 has finished for PR 26589 at commit
|
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Test build #115362 has finished for PR 26589 at commit
|
case other => other | ||
} | ||
def apply(plan: LogicalPlan): LogicalPlan = { | ||
var logicalPlans = Map.empty[UnresolvedRelation, LogicalPlan] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use a mutable map while we're here? Also, can you add a short comment with the JIRA ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment to doc:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Lines 94 to 96 in 07e6b7f
* @param relationToLogicalPlanMaps The UnresolvedRelation to LogicalPlan mapping, this can ensure | |
* that the table is resolved only once if a table is used | |
* multiple times in a query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems okay
shall we put the map in |
Test build #115884 has finished for PR 26589 at commit
|
retest this please |
Test build #115886 has finished for PR 26589 at commit
|
retest this please |
Test build #115887 has finished for PR 26589 at commit
|
retest this please |
Test build #115893 has finished for PR 26589 at commit
|
Test build #115911 has finished for PR 26589 at commit
|
*/ | ||
case class AnalysisContext( | ||
defaultDatabase: Option[String] = None, | ||
nestedViewDepth: Int = 0) | ||
nestedViewDepth: Int = 0, | ||
relationToLogicalPlanMaps: mutable.Map[UnresolvedRelation, LogicalPlan] = mutable.Map.empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about just relationCache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it's simpler to use Seq[String]
as key.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Test build #116051 has finished for PR 26589 at commit
|
Test build #116053 has finished for PR 26589 at commit
|
@@ -92,10 +92,13 @@ object FakeV2SessionCatalog extends TableCatalog { | |||
* views. | |||
* @param nestedViewDepth The nested depth in the view resolution, this enables us to limit the | |||
* depth of nested views. | |||
* @param relationCache The UnresolvedRelation to LogicalPlan mapping, this can ensure that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mapping from qualified table names to resolved relations.
@@ -858,7 +861,12 @@ class Analyzer( | |||
} | |||
|
|||
case u: UnresolvedRelation => | |||
lookupRelation(u.multipartIdentifier).map(resolveViews).getOrElse(u) | |||
val relationCache = AnalysisContext.get.relationCache | |||
relationCache.getOrElse(u.tableName, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we apply the cache in lookupRelation
where we expand the name and get the qualified table name?
@@ -870,7 +874,8 @@ class Analyzer( | |||
case SessionCatalogAndIdentifier(catalog, ident) => | |||
CatalogV2Util.loadTable(catalog, ident).map { | |||
case v1Table: V1Table => | |||
v1SessionCatalog.getRelation(v1Table.v1Table) | |||
AnalysisContext.get.relationCache.getOrElseUpdate( | |||
v1Table.v1Table.qualifiedName, v1SessionCatalog.getRelation(v1Table.v1Table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key should be a fully qualified name, including catalog name. We may have different tables in different catalogs with the same name.
how about
val key = catalog.name +: ident.namespace :+ ident.name
relationCache.getOrElseUpdate(key, ...)
Test build #116069 has finished for PR 26589 at commit
|
Test build #116073 has finished for PR 26589 at commit
|
retest this please |
Test build #116085 has finished for PR 26589 at commit
|
thanks, merging to master! |
@@ -870,7 +874,9 @@ class Analyzer( | |||
case SessionCatalogAndIdentifier(catalog, ident) => | |||
CatalogV2Util.loadTable(catalog, ident).map { | |||
case v1Table: V1Table => | |||
v1SessionCatalog.getRelation(v1Table.v1Table) | |||
val key = catalog.name +: ident.namespace :+ ident.name | |||
AnalysisContext.get.relationCache.getOrElseUpdate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work. The table is already loaded and it's too late to use the cache. I've sent #27341 to fix it.
### What changes were proposed in this pull request? Fix a bug in #26589 , to make this feature work. ### Why are the changes needed? This feature doesn't work actually. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test Closes #27341 from cloud-fan/cache. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
*/ | ||
case class AnalysisContext( | ||
catalogAndNamespace: Seq[String] = Nil, | ||
nestedViewDepth: Int = 0) | ||
nestedViewDepth: Int = 0, | ||
relationCache: mutable.Map[Seq[String], LogicalPlan] = mutable.Map.empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the anti-pattern from the style guide ... https://github.com/databricks/scala-style-guide#case-classes-and-immutability
… with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc0709f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
It is very common for a SQL query to query a table more than once. For example:
This PR try to improve
ResolveTables
andResolveRelations
performance by reducing the connection times to Hive Metastore Server in such case.Why are the changes needed?
ResolveTables
andResolveRelations
performance.Does this PR introduce any user-facing change?
No.
How was this patch tested?
manual test.
After SPARK-29606 and before this PR:
After SPARK-29606 and after this PR: