[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog #25402

brkyvz · 2019-08-09T20:35:49Z

What changes were proposed in this pull request?

We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog.

How was this patch tested?

Unit tests. A lot of tests broke under hive when things were not working properly under ResolveTables, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths.

SparkQA · 2019-08-09T20:42:05Z

Test build #108893 has finished for PR 25402 at commit 9781ae8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-09T22:01:12Z

Test build #108897 has finished for PR 25402 at commit 99ba64d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

pvk2727

looks good

xuanyuanking · 2019-08-11T12:55:00Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

    val session = df.sparkSession
+    val useV1Sources =


duplicated code with save, possible to have a function?

xuanyuanking · 2019-08-11T13:00:19Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

 import org.apache.spark.sql.execution.SQLExecution
 import org.apache.spark.sql.execution.command.DDLUtils
 import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, DataSourceUtils, LogicalRelation}
 import org.apache.spark.sql.execution.datasources.v2._
 import org.apache.spark.sql.internal.SQLConf.PartitionOverwriteMode
 import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister}
+import org.apache.spark.sql.sources.BaseRelation


nit: duplicated import?

SparkQA · 2019-08-12T18:15:00Z

Test build #108985 has finished for PR 25402 at commit f73feb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedDataSourceV2Relation(table: Table) extends LeafNode with NamedRelation

brkyvz · 2019-08-12T20:37:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

 import org.apache.spark.sql.catalyst.analysis.{CastSupport, UnresolvedAttribute}
-import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, CatalogTableType, CatalogUtils, UnresolvedCatalogRelation}


dammit IntelliJ :(

brkyvz · 2019-08-12T20:40:00Z

cc @cloud-fan @rdblue @jzhuge This is ready for review now.

SparkQA · 2019-08-12T21:07:06Z

Test build #108992 has finished for PR 25402 at commit aac9503.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalogTableAsV2(v1Table: CatalogTable) extends UnresolvedTable

SparkQA · 2019-08-12T21:31:56Z

Test build #108996 has finished for PR 25402 at commit 428e82a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-12T21:57:18Z

Test build #108997 has finished for PR 25402 at commit d9f478d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-12T22:05:01Z

Test build #108998 has finished for PR 25402 at commit e489a16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-13T12:17:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

@@ -172,7 +173,7 @@ class V2SessionCatalog(sessionState: SessionState) extends TableCatalog {
 /**
 * An implementation of catalog v2 [[Table]] to expose v1 table metadata.
 */
-case class CatalogTableAsV2(v1Table: CatalogTable) extends Table {
+case class CatalogTableAsV2(v1Table: CatalogTable) extends UnresolvedTable {


If we move CatalogTableAsV2 to catalyst and rename it to UnresolvedTable, then we don't need to create an extra interface. what do you think? @brkyvz @rdblue

CatalogTable is defined in sql unfortunately.

It's defined in catalyst: org.apache.spark.sql.catalyst.catalog.CatalogTable in file sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Oh, I like that a lot more

cloud-fan · 2019-08-13T12:28:46Z

...rc/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2DataFrameSessionCatalogSuite.scala

+    verifyTable(t1, df)
+
+    // Check that appends are by name
+    df.select('data, 'id).write.format(v2Format).mode("append").saveAsTable(t1)


IIRC, in DS v1, saveAsTable fails if the table exists, but the table provider is different from the one specified in df.write.format. Do we have this check in the v2 code path?

I'll add a test

Since the provider isn't necessarily exposed by the table API, I'm not sure if such a check is required/possible.

SparkQA · 2019-08-13T21:42:28Z

Test build #109056 has finished for PR 25402 at commit 762f873.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T00:44:07Z

Test build #109065 has finished for PR 25402 at commit 06cf1f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xianyinxin · 2019-08-14T03:27:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        maybeCatalog.orElse(sessionCatalog)
+          .flatMap(loadTable(_, ident))
+          .map(DataSourceV2Relation.create)
+          .getOrElse(u)


A +1 on this.

brkyvz · 2019-08-14T04:57:14Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala

@@ -493,8 +494,12 @@ class DataSourceV2SQLSuite extends QueryTest with SharedSQLContext with BeforeAn

    sparkSession.sql(s"CREATE TABLE table_name USING parquet AS SELECT id, data FROM source")

-    // use the catalog name to force loading with the v2 catalog
-    checkAnswer(sparkSession.sql(s"TABLE session.table_name"), sparkSession.table("source"))
+    checkAnswer(sparkSession.sql(s"TABLE default.table_name"), sparkSession.table("source"))


We can maintain this behavior, but I'd rather not, as the V2SessionCatalog can't properly handle views and such

SparkQA · 2019-08-14T05:07:47Z

Test build #109081 has finished for PR 25402 at commit 99ae156.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T07:05:01Z

Test build #109080 has finished for PR 25402 at commit 0bd93ae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T07:05:02Z

Test build #109084 has finished for PR 25402 at commit 673d95a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-14T11:12:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

    val session = df.sparkSession
+    val provider = DataSource.lookupDataSource(source, session.sessionState.conf)


the provider here may not be the actual table provider, as saveAsTable can write to an existing table. Maybe we should always use v2 session catalog?

That works for me. Since the V2 code path will fallback to the V1 code path if it sees an UnresolvedTable

hmm. actually that causes issues if the table doesn't exist. Maybe we should use the statements instead of the logical plans?

+1 on using statements.

let's do it in a followup.

SparkQA · 2019-08-14T13:30:59Z

Test build #109102 has finished for PR 25402 at commit 673d95a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-14T19:46:40Z

Test build #109114 has finished for PR 25402 at commit a70e726.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalystDataToAvro(
case class Milliseconds(child: Expression, timeZoneId: Option[String] = None)
case class Microseconds(child: Expression, timeZoneId: Option[String] = None)
case class IsoYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class Epoch(child: Expression, timeZoneId: Option[String] = None)
case class DeleteFromTable(
case class DeleteFromStatement(
case class DeleteFromTableExec(

cloud-fan · 2019-08-15T04:30:12Z

thanks, merging to master! Please address #25402 (comment) in a followup.

brkyvz · 2019-08-15T15:51:05Z

Thanks @cloud-fan!

jzhuge · 2019-08-24T04:54:10Z

@brkyvz Scala source file sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/internal/UnresolvedTable.scala is in src/main/java directory. Is this intended?

brkyvz added 2 commits August 9, 2019 13:34

save all

4904d79

savE

9781ae8

try

99ba64d

dongjoon-hyun added the SQL label Aug 9, 2019

pvk2727 approved these changes Aug 11, 2019

View reviewed changes

xuanyuanking reviewed Aug 11, 2019

View reviewed changes

try this

f73feb8

brkyvz added 4 commits August 12, 2019 12:32

Alternative path

aac9503

Try this way

428e82a

add docs

d9f478d

Button up

0608d99

brkyvz changed the title ~~[WIP][SPARK-28666] Support saveAsTable for V2 tables through Session Catalog~~ [SPARK-28666] Support saveAsTable for V2 tables through Session Catalog Aug 12, 2019

brkyvz commented Aug 12, 2019

View reviewed changes

save

e489a16

cloud-fan reviewed Aug 13, 2019

View reviewed changes

brkyvz added 2 commits August 13, 2019 11:52

just let unresolved tables be

9660673

remove changes

762f873

passes new tests as well as old ones

06cf1f9

cloud-fan mentioned this pull request Aug 14, 2019

[SPARK-28351][SQL] Support DELETE in DataSource V2 #25115

Closed

xianyinxin reviewed Aug 14, 2019

View reviewed changes

brkyvz added 2 commits August 13, 2019 21:32

fix test

0bd93ae

rename CatalogTableAsV2 to UnresolvedTable

99ae156

brkyvz commented Aug 14, 2019

View reviewed changes

Update UnresolvedTable.scala

673d95a

cloud-fan reviewed Aug 14, 2019

View reviewed changes

Merge branch 'master' into saveAsV2

a70e726

cloud-fan closed this in 0526529 Aug 15, 2019

		import org.apache.spark.sql.catalyst.analysis.{CastSupport, UnresolvedAttribute}
		import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, CatalogTableType, CatalogUtils, UnresolvedCatalogRelation}

		val session = df.sparkSession
		val provider = DataSource.lookupDataSource(source, session.sessionState.conf)

[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog #25402

[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog #25402

Conversation

brkyvz commented Aug 9, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 9, 2019

SparkQA commented Aug 9, 2019

pvk2727 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 12, 2019

Choose a reason for hiding this comment

brkyvz commented Aug 12, 2019 • edited

SparkQA commented Aug 12, 2019

SparkQA commented Aug 12, 2019

SparkQA commented Aug 12, 2019

SparkQA commented Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 14, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2019

SparkQA commented Aug 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

Choose a reason for hiding this comment

brkyvz Aug 14, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2019

SparkQA commented Aug 14, 2019

cloud-fan commented Aug 15, 2019

brkyvz commented Aug 15, 2019

jzhuge commented Aug 24, 2019 • edited

brkyvz commented Aug 9, 2019 •

edited

brkyvz commented Aug 12, 2019 •

edited

cloud-fan Aug 14, 2019 •

edited

cloud-fan Aug 13, 2019 •

edited

brkyvz Aug 14, 2019 •

edited

jzhuge commented Aug 24, 2019 •

edited