Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog #25402

Closed
wants to merge 16 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Aug 9, 2019

What changes were proposed in this pull request?

We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog.

How was this patch tested?

Unit tests. A lot of tests broke under hive when things were not working properly under ResolveTables, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths.

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108893 has finished for PR 25402 at commit 9781ae8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108897 has finished for PR 25402 at commit 99ba64d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

@pvk2727 pvk2727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

val session = df.sparkSession
val useV1Sources =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated code with save, possible to have a function?

import org.apache.spark.sql.execution.SQLExecution
import org.apache.spark.sql.execution.command.DDLUtils
import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, DataSourceUtils, LogicalRelation}
import org.apache.spark.sql.execution.datasources.v2._
import org.apache.spark.sql.internal.SQLConf.PartitionOverwriteMode
import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister}
import org.apache.spark.sql.sources.BaseRelation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicated import?

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108985 has finished for PR 25402 at commit f73feb8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedDataSourceV2Relation(table: Table) extends LeafNode with NamedRelation

@brkyvz brkyvz changed the title [WIP][SPARK-28666] Support saveAsTable for V2 tables through Session Catalog [SPARK-28666] Support saveAsTable for V2 tables through Session Catalog Aug 12, 2019
import org.apache.spark.sql.catalyst.analysis.{CastSupport, UnresolvedAttribute}
import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, CatalogTableType, CatalogUtils, UnresolvedCatalogRelation}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dammit IntelliJ :(

@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 12, 2019

cc @cloud-fan @rdblue @jzhuge This is ready for review now.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108992 has finished for PR 25402 at commit aac9503.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalogTableAsV2(v1Table: CatalogTable) extends UnresolvedTable

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108996 has finished for PR 25402 at commit 428e82a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108997 has finished for PR 25402 at commit d9f478d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108998 has finished for PR 25402 at commit e489a16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -172,7 +173,7 @@ class V2SessionCatalog(sessionState: SessionState) extends TableCatalog {
/**
* An implementation of catalog v2 [[Table]] to expose v1 table metadata.
*/
case class CatalogTableAsV2(v1Table: CatalogTable) extends Table {
case class CatalogTableAsV2(v1Table: CatalogTable) extends UnresolvedTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move CatalogTableAsV2 to catalyst and rename it to UnresolvedTable, then we don't need to create an extra interface. what do you think? @brkyvz @rdblue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogTable is defined in sql unfortunately.

Copy link
Contributor

@cloud-fan cloud-fan Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in catalyst: org.apache.spark.sql.catalyst.catalog.CatalogTable in file sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I like that a lot more

verifyTable(t1, df)

// Check that appends are by name
df.select('data, 'id).write.format(v2Format).mode("append").saveAsTable(t1)
Copy link
Contributor

@cloud-fan cloud-fan Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, in DS v1, saveAsTable fails if the table exists, but the table provider is different from the one specified in df.write.format. Do we have this check in the v2 code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the provider isn't necessarily exposed by the table API, I'm not sure if such a check is required/possible.

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109056 has finished for PR 25402 at commit 762f873.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109065 has finished for PR 25402 at commit 06cf1f9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

maybeCatalog.orElse(sessionCatalog)
.flatMap(loadTable(_, ident))
.map(DataSourceV2Relation.create)
.getOrElse(u)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A +1 on this.

@@ -493,8 +494,12 @@ class DataSourceV2SQLSuite extends QueryTest with SharedSQLContext with BeforeAn

sparkSession.sql(s"CREATE TABLE table_name USING parquet AS SELECT id, data FROM source")

// use the catalog name to force loading with the v2 catalog
checkAnswer(sparkSession.sql(s"TABLE session.table_name"), sparkSession.table("source"))
checkAnswer(sparkSession.sql(s"TABLE default.table_name"), sparkSession.table("source"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maintain this behavior, but I'd rather not, as the V2SessionCatalog can't properly handle views and such

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109081 has finished for PR 25402 at commit 99ae156.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109080 has finished for PR 25402 at commit 0bd93ae.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109084 has finished for PR 25402 at commit 673d95a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

val session = df.sparkSession
val provider = DataSource.lookupDataSource(source, session.sessionState.conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the provider here may not be the actual table provider, as saveAsTable can write to an existing table. Maybe we should always use v2 session catalog?

Copy link
Contributor Author

@brkyvz brkyvz Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. Since the V2 code path will fallback to the V1 code path if it sees an UnresolvedTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. actually that causes issues if the table doesn't exist. Maybe we should use the statements instead of the logical plans?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using statements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it in a followup.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109102 has finished for PR 25402 at commit 673d95a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109114 has finished for PR 25402 at commit a70e726.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CatalystDataToAvro(
  • case class Milliseconds(child: Expression, timeZoneId: Option[String] = None)
  • case class Microseconds(child: Expression, timeZoneId: Option[String] = None)
  • case class IsoYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class Epoch(child: Expression, timeZoneId: Option[String] = None)
  • case class DeleteFromTable(
  • case class DeleteFromStatement(
  • case class DeleteFromTableExec(

@cloud-fan
Copy link
Contributor

thanks, merging to master! Please address #25402 (comment) in a followup.

@cloud-fan cloud-fan closed this in 0526529 Aug 15, 2019
@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 15, 2019

Thanks @cloud-fan!

@jzhuge
Copy link
Member

jzhuge commented Aug 24, 2019

@brkyvz Scala source file sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/internal/UnresolvedTable.scala is in src/main/java directory. Is this intended?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants