-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621
Conversation
Test build #71522 has finished for PR 16621 at commit
|
Test build #71549 has finished for PR 16621 at commit
|
} | ||
|
||
// Also invalidate the table relation cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about keep the previous comment here?
// refreshTable does not eagerly reload the cache. It just invalidate the cache.
// Next time when we use the table, it will be populated in the cache.
// Since we also cache ParquetRelations converted from Hive Parquet tables and
// adding converted ParquetRelations into the cache is not defined in the load function
// of the cache (instead, we add the cache entry in convertToParquetRelation),
// it is better at here to invalidate the cache to avoid confusing waring logs from the
// cache loader (e.g. cannot find data source provider, which is only defined for
// data source table.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After an offline discussion, I am fine to remove it. Thanks!
} | ||
|
||
/** A fully qualified identifier for a table (i.e., database.tableName) */ | ||
case class QualifiedTableName(database: String, name: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are having a case sensitivity issue here, right? Previously, we always make both database and table to lower cases. Database and table names are not case sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users don't change the case sensitive config at runtime, it will be ok.
@@ -386,7 +386,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { | |||
case relation: CatalogRelation if DDLUtils.isHiveTable(relation.catalogTable) => | |||
relation.catalogTable.identifier | |||
} | |||
EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { | |||
|
|||
val tableRelation = df.sparkSession.table(tableIdentWithDB).queryExecution.analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to avoid overriding lookupRelation
in HiveMetastoreCatalog
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, now lookupRelation
will return SimpleCatalogRelation
and other analyzer rules will convert it to LogicalRelation
or MetastoreRelation
val relation = EliminateSubqueryAliases( | ||
sessionState.catalog.lookupRelation(TableIdentifier(tableName))) | ||
var relation: LogicalPlan = null | ||
withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the test cases. ORC
has the same issue, but the default value is false
currently. Thus, I think we should set CONVERT_METASTORE_ORC
to false
too, in case we will change the default value of CONVERT_METASTORE_ORC
in the future.
val dataSource = | ||
DataSource( | ||
sparkSession, | ||
userSpecifiedSchema = Some(table.schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// In older version(prior to 2.1) of Spark, the table schema can be empty and should be
// inferred at runtime. We should still support it.
Is it still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
Could we rename |
can we do it later? We are going to merge |
Test build #71567 has finished for PR 16621 at commit
|
Sure. No problem. |
Test build #71572 has started for PR 16621 at commit |
Test build #71574 has started for PR 16621 at commit |
} else { | ||
SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), None) | ||
} | ||
} else { | ||
SubqueryAlias(relationAlias, tempTables(table), Option(name)) | ||
SubqueryAlias(relationAlias, tempTables(table), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the existing way? This was introduced for the EXPLAIN command of view. See the PR: #14657
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing way is to set None
, see https://github.com/apache/spark/pull/16621/files#diff-ca4533edbf148c89cc0c564ab6b0aeaaL75
This shows the evil of duplicated code, we have inconsistent behaviors with and without hive support. I think we should only set table identifier for persisted view, @hvanhovell is that true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @hvanhovell again. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have been living under a rock for the past month or so.
This is not really needed anymore. Lets remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@@ -1799,6 +1799,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { | |||
.getTableMetadata(TableIdentifier("tbl")).storage.locationUri.get | |||
|
|||
sql(s"ALTER TABLE tbl SET LOCATION '${dir.getCanonicalPath}'") | |||
spark.catalog.refreshTable("tbl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 :)
No more comments. It looks pretty good! Let us see whether all the test cases can pass. |
Test build #71576 has started for PR 16621 at commit |
// TODO: improve `InMemoryCatalog` and remove this limitation. | ||
catalogTable = if (withHiveSupport) Some(table) else None) | ||
|
||
LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, previously we will set expectedOutputAttributes
here, which was added by #15182
However, this doesn't work when the table schema needs to be inferred at runtime, and it turns out that we don't need to do it at all. AnalyzeColumnCommand
now gets attributes from the resolved table relation plan , so it's fine for rule FindDataSourceTable
to change outputs during analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wzhfy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then we can revert it without changing the analyze part.
@@ -1626,17 +1626,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
assert(d.size == d.distinct.size) | |||
} | |||
|
|||
test("SPARK-17625: data source table in InMemoryCatalog should guarantee output consistency") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this test anymore, see https://github.com/apache/spark/pull/16621/files#r96577427
@@ -1322,4 +1322,26 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue) | |||
} | |||
} | |||
|
|||
test("SPARK-18464: support old table which doesn't store schema in table properties") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test was removed in #16003, but I find it's still useful and is not covered by other tests, so I add it back.
After merging #16517, it introduces a few conflicts. |
Test build #71583 has finished for PR 16621 at commit
|
* A cache of qualified table name to table relation plan. | ||
*/ | ||
val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = { | ||
// TODO: create a config instead of hardcode 1000 here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan .
Why not making this config in this PR? It seems to be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it's easy, but I wanna minimal the code changes so it's easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Sure~
LGTM |
Thanks! Merging to master. |
…end on hive ## What changes were proposed in this pull request? We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc. However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support. It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually. main changes: 1. move the table relation cache to `SessionCatalog` 2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore 3. `FindDataSourceTable` will read/write the table relation cache. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16621 from cloud-fan/plan-cache.
…end on hive ## What changes were proposed in this pull request? We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc. However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support. It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually. main changes: 1. move the table relation cache to `SessionCatalog` 2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore 3. `FindDataSourceTable` will read/write the table relation cache. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16621 from cloud-fan/plan-cache.
What changes were proposed in this pull request?
We have a table relation plan cache in
HiveMetastoreCatalog
, which caches a lot of things: file status, resolved data source, inferred schema, etc.However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support.
It can also reduce the size of
HiveMetastoreCatalog
, so that it's easier to remove it eventually.main changes:
SessionCatalog
SessionCatalog.lookupRelation
will returnSimpleCatalogRelation
and the analyzer will convert it toLogicalRelation
orMetastoreRelation
later, thenHiveSessionCatalog
doesn't need to overridelookupRelation
anymoreFindDataSourceTable
will read/write the table relation cache.How was this patch tested?
existing tests.