Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

We have a table relation plan cache in HiveMetastoreCatalog, which caches a lot of things: file status, resolved data source, inferred schema, etc.

However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support.

It can also reduce the size of HiveMetastoreCatalog, so that it's easier to remove it eventually.

main changes:

  1. move the table relation cache to SessionCatalog
  2. SessionCatalog.lookupRelation will return SimpleCatalogRelation and the analyzer will convert it to LogicalRelation or MetastoreRelation later, then HiveSessionCatalog doesn't need to override lookupRelation anymore
  3. FindDataSourceTable will read/write the table relation cache.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71522 has finished for PR 16621 at commit a7845cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class QualifiedTableName(database: String, name: String)
  • class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71549 has finished for PR 16621 at commit 98a5483.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class QualifiedTableName(database: String, name: String)
  • class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

}

// Also invalidate the table relation cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keep the previous comment here?

    // refreshTable does not eagerly reload the cache. It just invalidate the cache.
    // Next time when we use the table, it will be populated in the cache.
    // Since we also cache ParquetRelations converted from Hive Parquet tables and
    // adding converted ParquetRelations into the cache is not defined in the load function
    // of the cache (instead, we add the cache entry in convertToParquetRelation),
    // it is better at here to invalidate the cache to avoid confusing waring logs from the
    // cache loader (e.g. cannot find data source provider, which is only defined for
    // data source table.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it still valid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion, I am fine to remove it. Thanks!

}

/** A fully qualified identifier for a table (i.e., database.tableName) */
case class QualifiedTableName(database: String, name: String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are having a case sensitivity issue here, right? Previously, we always make both database and table to lower cases. Database and table names are not case sensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users don't change the case sensitive config at runtime, it will be ok.

@@ -386,7 +386,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
case relation: CatalogRelation if DDLUtils.isHiveTable(relation.catalogTable) =>
relation.catalogTable.identifier
}
EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {

val tableRelation = df.sparkSession.table(tableIdentWithDB).queryExecution.analyzed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to avoid overriding lookupRelation in HiveMetastoreCatalog , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, now lookupRelation will return SimpleCatalogRelation and other analyzer rules will convert it to LogicalRelation or MetastoreRelation

val relation = EliminateSubqueryAliases(
sessionState.catalog.lookupRelation(TableIdentifier(tableName)))
var relation: LogicalPlan = null
withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the test cases. ORC has the same issue, but the default value is false currently. Thus, I think we should set CONVERT_METASTORE_ORC to false too, in case we will change the default value of CONVERT_METASTORE_ORC in the future.

val dataSource =
DataSource(
sparkSession,
userSpecifiedSchema = Some(table.schema),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// In older version(prior to 2.1) of Spark, the table schema can be empty and should be
// inferred at runtime. We should still support it.

Is it still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@gatorsmile
Copy link
Member

Could we rename SimpleCatalogRelation to UnresolvedCatalogRelation? The current name looks very confusing to me.

@cloud-fan
Copy link
Contributor Author

can we do it later? We are going to merge CatalogRelation implementations and unify the table relation representations soon.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71567 has finished for PR 16621 at commit 919aaa2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class QualifiedTableName(database: String, name: String)
  • class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

@gatorsmile
Copy link
Member

Sure. No problem.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71572 has started for PR 16621 at commit 2883c8b.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71574 has started for PR 16621 at commit bbccdae.

} else {
SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), None)
}
} else {
SubqueryAlias(relationAlias, tempTables(table), Option(name))
SubqueryAlias(relationAlias, tempTables(table), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the existing way? This was introduced for the EXPLAIN command of view. See the PR: #14657

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing way is to set None, see https://github.com/apache/spark/pull/16621/files#diff-ca4533edbf148c89cc0c564ab6b0aeaaL75

This shows the evil of duplicated code, we have inconsistent behaviors with and without hive support. I think we should only set table identifier for persisted view, @hvanhovell is that true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @hvanhovell again. : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have been living under a rock for the past month or so.

This is not really needed anymore. Lets remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@@ -1799,6 +1799,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
.getTableMetadata(TableIdentifier("tbl")).storage.locationUri.get

sql(s"ALTER TABLE tbl SET LOCATION '${dir.getCanonicalPath}'")
spark.catalog.refreshTable("tbl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 :)

@gatorsmile
Copy link
Member

No more comments. It looks pretty good! Let us see whether all the test cases can pass.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71576 has started for PR 16621 at commit d636389.

// TODO: improve `InMemoryCatalog` and remove this limitation.
catalogTable = if (withHiveSupport) Some(table) else None)

LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, previously we will set expectedOutputAttributes here, which was added by #15182

However, this doesn't work when the table schema needs to be inferred at runtime, and it turns out that we don't need to do it at all. AnalyzeColumnCommand now gets attributes from the resolved table relation plan , so it's fine for rule FindDataSourceTable to change outputs during analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wzhfy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then we can revert it without changing the analyze part.

@@ -1626,17 +1626,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
assert(d.size == d.distinct.size)
}

test("SPARK-17625: data source table in InMemoryCatalog should guarantee output consistency") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this test anymore, see https://github.com/apache/spark/pull/16621/files#r96577427

@@ -1322,4 +1322,26 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue)
}
}

test("SPARK-18464: support old table which doesn't store schema in table properties") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was removed in #16003, but I find it's still useful and is not covered by other tests, so I add it back.

@gatorsmile
Copy link
Member

After merging #16517, it introduces a few conflicts.

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71583 has finished for PR 16621 at commit 096fcc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class QualifiedTableName(database: String, name: String)
  • class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

* A cache of qualified table name to table relation plan.
*/
val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
// TODO: create a config instead of hardcode 1000 here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
Why not making this config in this PR? It seems to be easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea it's easy, but I wanna minimal the code changes so it's easier to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Sure~

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 2e62560 Jan 19, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…end on hive

## What changes were proposed in this pull request?

We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc.

However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support.

It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually.

main changes:
1. move the table relation cache to `SessionCatalog`
2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore
3. `FindDataSourceTable` will read/write the table relation cache.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16621 from cloud-fan/plan-cache.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…end on hive

## What changes were proposed in this pull request?

We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc.

However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support.

It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually.

main changes:
1. move the table relation cache to `SessionCatalog`
2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore
3. `FindDataSourceTable` will read/write the table relation cache.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16621 from cloud-fan/plan-cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants