[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621

cloud-fan · 2017-01-17T17:55:51Z

What changes were proposed in this pull request?

We have a table relation plan cache in HiveMetastoreCatalog, which caches a lot of things: file status, resolved data source, inferred schema, etc.

However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support.

It can also reduce the size of HiveMetastoreCatalog, so that it's easier to remove it eventually.

main changes:

move the table relation cache to SessionCatalog
SessionCatalog.lookupRelation will return SimpleCatalogRelation and the analyzer will convert it to LogicalRelation or MetastoreRelation later, then HiveSessionCatalog doesn't need to override lookupRelation anymore
FindDataSourceTable will read/write the table relation cache.

How was this patch tested?

existing tests.

cloud-fan · 2017-01-17T17:56:05Z

cc @yhuai @gatorsmile

SparkQA · 2017-01-17T19:34:47Z

Test build #71522 has finished for PR 16621 at commit a7845cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class QualifiedTableName(database: String, name: String)
class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

SparkQA · 2017-01-18T03:19:01Z

Test build #71549 has finished for PR 16621 at commit 98a5483.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class QualifiedTableName(database: String, name: String)
class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

gatorsmile · 2017-01-18T03:20:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

    }
+
+    // Also invalidate the table relation cache.


How about keep the previous comment here?

// refreshTable does not eagerly reload the cache. It just invalidate the cache. // Next time when we use the table, it will be populated in the cache. // Since we also cache ParquetRelations converted from Hive Parquet tables and // adding converted ParquetRelations into the cache is not defined in the load function // of the cache (instead, we add the cache entry in convertToParquetRelation), // it is better at here to invalidate the cache to avoid confusing waring logs from the // cache loader (e.g. cannot find data source provider, which is only defined for // data source table.).

is it still valid?

After an offline discussion, I am fine to remove it. Thanks!

gatorsmile · 2017-01-18T03:39:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala

 }

+/** A fully qualified identifier for a table (i.e., database.tableName) */
+case class QualifiedTableName(database: String, name: String)


We are having a case sensitivity issue here, right? Previously, we always make both database and table to lower cases. Database and table names are not case sensitive.

If users don't change the case sensitive config at runtime, it will be ok.

gatorsmile · 2017-01-18T03:47:31Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -386,7 +386,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
          case relation: CatalogRelation if DDLUtils.isHiveTable(relation.catalogTable) =>
            relation.catalogTable.identifier
        }
-        EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match {
+
+        val tableRelation = df.sparkSession.table(tableIdentWithDB).queryExecution.analyzed


This change is to avoid overriding lookupRelation in HiveMetastoreCatalog , right?

yea, now lookupRelation will return SimpleCatalogRelation and other analyzer rules will convert it to LogicalRelation or MetastoreRelation

gatorsmile · 2017-01-18T04:23:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

-    val relation = EliminateSubqueryAliases(
-      sessionState.catalog.lookupRelation(TableIdentifier(tableName)))
+    var relation: LogicalPlan = null
+    withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {


I checked the test cases. ORC has the same issue, but the default value is false currently. Thus, I think we should set CONVERT_METASTORE_ORC to false too, in case we will change the default value of CONVERT_METASTORE_ORC in the future.

gatorsmile · 2017-01-18T05:40:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        val dataSource =
+          DataSource(
+            sparkSession,
+            userSpecifiedSchema = Some(table.schema),


// In older version(prior to 2.1) of Spark, the table schema can be empty and should be // inferred at runtime. We should still support it.

Is it still valid?

good catch!

gatorsmile · 2017-01-18T06:11:09Z

Could we rename SimpleCatalogRelation to UnresolvedCatalogRelation? The current name looks very confusing to me.

cloud-fan · 2017-01-18T06:15:22Z

can we do it later? We are going to merge CatalogRelation implementations and unify the table relation representations soon.

SparkQA · 2017-01-18T06:15:34Z

Test build #71567 has finished for PR 16621 at commit 919aaa2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class QualifiedTableName(database: String, name: String)
class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

gatorsmile · 2017-01-18T06:15:40Z

Sure. No problem.

SparkQA · 2017-01-18T06:32:47Z

Test build #71572 has started for PR 16621 at commit 2883c8b.

SparkQA · 2017-01-18T06:38:39Z

Test build #71574 has started for PR 16621 at commit bbccdae.

gatorsmile · 2017-01-18T06:46:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

        } else {
          SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), None)
        }
      } else {
-        SubqueryAlias(relationAlias, tempTables(table), Option(name))
+        SubqueryAlias(relationAlias, tempTables(table), None)


Should we keep the existing way? This was introduced for the EXPLAIN command of view. See the PR: #14657

the existing way is to set None, see https://github.com/apache/spark/pull/16621/files#diff-ca4533edbf148c89cc0c564ab6b0aeaaL75

This shows the evil of duplicated code, we have inconsistent behaviors with and without hive support. I think we should only set table identifier for persisted view, @hvanhovell is that true?

ping @hvanhovell again. : )

Sorry, I have been living under a rock for the past month or so.

This is not really needed anymore. Lets remove it.

gatorsmile · 2017-01-18T06:50:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -1799,6 +1799,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
          .getTableMetadata(TableIdentifier("tbl")).storage.locationUri.get

        sql(s"ALTER TABLE tbl SET LOCATION '${dir.getCanonicalPath}'")
+        spark.catalog.refreshTable("tbl")


gatorsmile · 2017-01-18T06:52:53Z

No more comments. It looks pretty good! Let us see whether all the test cases can pass.

SparkQA · 2017-01-18T07:02:38Z

Test build #71576 has started for PR 16621 at commit d636389.

cloud-fan · 2017-01-18T07:10:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+            // TODO: improve `InMemoryCatalog` and remove this limitation.
+            catalogTable = if (withHiveSupport) Some(table) else None)
+
+        LogicalRelation(dataSource.resolveRelation(), catalogTable = Some(table))


Note that, previously we will set expectedOutputAttributes here, which was added by #15182

However, this doesn't work when the table schema needs to be inferred at runtime, and it turns out that we don't need to do it at all. AnalyzeColumnCommand now gets attributes from the resolved table relation plan , so it's fine for rule FindDataSourceTable to change outputs during analysis.

ok, then we can revert it without changing the analyze part.

cloud-fan · 2017-01-18T07:10:35Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1626,17 +1626,6 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    assert(d.size == d.distinct.size)
  }

-  test("SPARK-17625: data source table in InMemoryCatalog should guarantee output consistency") {


we don't need this test anymore, see https://github.com/apache/spark/pull/16621/files#r96577427

cloud-fan · 2017-01-18T07:11:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

@@ -1322,4 +1322,26 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
      sparkSession.sparkContext.conf.set(DEBUG_MODE, previousValue)
    }
  }
+
+  test("SPARK-18464: support old table which doesn't store schema in table properties") {


this test was removed in #16003, but I find it's still useful and is not covered by other tests, so I add it back.

gatorsmile · 2017-01-18T07:41:22Z

After merging #16517, it introduces a few conflicts.

SparkQA · 2017-01-18T10:49:04Z

Test build #71583 has finished for PR 16621 at commit 096fcc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class QualifiedTableName(database: String, name: String)
class FindHiveSerdeTable(session: SparkSession) extends Rule[LogicalPlan]

dongjoon-hyun · 2017-01-18T20:29:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+   * A cache of qualified table name to table relation plan.
+   */
+  val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
+    // TODO: create a config instead of hardcode 1000 here.


Hi, @cloud-fan .
Why not making this config in this PR? It seems to be easy.

yea it's easy, but I wanna minimal the code changes so it's easier to review.

gatorsmile · 2017-01-19T08:03:42Z

LGTM

gatorsmile · 2017-01-19T08:08:19Z

Thanks! Merging to master.

…end on hive ## What changes were proposed in this pull request? We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc. However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support. It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually. main changes: 1. move the table relation cache to `SessionCatalog` 2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore 3. `FindDataSourceTable` will read/write the table relation cache. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16621 from cloud-fan/plan-cache.

cloud-fan force-pushed the plan-cache branch from a7845cc to 98a5483 Compare January 18, 2017 01:24

gatorsmile reviewed Jan 18, 2017

View reviewed changes

cloud-fan force-pushed the plan-cache branch from 98a5483 to 919aaa2 Compare January 18, 2017 04:11

gatorsmile reviewed Jan 18, 2017

View reviewed changes

cloud-fan force-pushed the plan-cache branch from 919aaa2 to 2883c8b Compare January 18, 2017 06:28

cloud-fan force-pushed the plan-cache branch from 2883c8b to bbccdae Compare January 18, 2017 06:36

gatorsmile reviewed Jan 18, 2017

View reviewed changes

cloud-fan force-pushed the plan-cache branch from bbccdae to d636389 Compare January 18, 2017 06:59

cloud-fan commented Jan 18, 2017

View reviewed changes

make table relation cache general and does not depend on hive

096fcc8

cloud-fan force-pushed the plan-cache branch from d636389 to 096fcc8 Compare January 18, 2017 08:10

dongjoon-hyun reviewed Jan 18, 2017

View reviewed changes

asfgit closed this in 2e62560 Jan 19, 2017

[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621

[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621

Conversation

cloud-fan commented Jan 17, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

gatorsmile Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 18, 2017

cloud-fan commented Jan 18, 2017

SparkQA commented Jan 18, 2017

gatorsmile commented Jan 18, 2017

SparkQA commented Jan 18, 2017

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

cloud-fan Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell Jan 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 18, 2017

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 18, 2017

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 19, 2017

gatorsmile commented Jan 19, 2017

gatorsmile Jan 18, 2017 •

edited

Loading

gatorsmile Jan 18, 2017 •

edited

Loading

cloud-fan Jan 18, 2017 •

edited

Loading

hvanhovell Jan 19, 2017 •

edited

Loading