[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable #16460

cloud-fan · 2017-01-03T13:12:25Z

What changes were proposed in this pull request?

When we append data to a partitioned table with DataFrameWriter.saveAsTable, there are 2 issues:

doesn't work when the partition has custom location.
will recover all partitions

This PR fixes them by moving the special partition handling code from DataSourceAnalysis to InsertIntoHadoopFsRelationCommand, so that the DataFrameWriter.saveAsTable code path can also benefit from it.

How was this patch tested?

newly added regression tests

cloud-fan · 2017-01-03T13:12:42Z

cc @ericl @gatorsmile @yhuai

SparkQA · 2017-01-03T15:28:04Z

Test build #70815 has finished for PR 16460 at commit 7f4f360.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2017-01-03T18:33:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

            options = options,
            query = data.logicalPlan,
            mode = mode,
-            catalogTable = catalogTable)
+            catalogTable = catalogTable,
+            fileIndex = fileIndex)


Should we be issuing refresh table instead of refreshing the index directly?

I was following the original behavior: https://github.com/apache/spark/pull/16460/files#diff-d99813bd5bbc18277e4090475e4944cfL240

Besides, it's hard to issue refresh table at DataSourceAnalysis. The table could be a temp view, and the CatalogTable in LogicalRelation could be empty. Then we lose the table identifier and can not issue refresh table.

ericl · 2017-01-03T18:35:32Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

-        createTable(tableIdent)
+        createTable(tableIdentWithDB)
+        // Refresh the cache of the table in the catalog.
+        catalog.refreshTable(tableIdentWithDB)


Is this already done by the insertion command?

I moved it from insertion command to here, as we only need to refresh the table for overwrite. For append, we only need to refresh the FileIndex

gatorsmile · 2017-01-04T01:36:32Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

@@ -74,12 +69,29 @@ case class InsertIntoHadoopFsRelationCommand(
    val fs = outputPath.getFileSystem(hadoopConf)
    val qualifiedOutputPath = outputPath.makeQualified(fs.getUri, fs.getWorkingDirectory)

+    val partitionsTrackedByCatalog = catalogTable.isDefined &&
+      catalogTable.get.partitionColumnNames.nonEmpty &&
+      catalogTable.get.tracksPartitionsInCatalog


Also check sparkSession.sessionState.conf.manageFilesourcePartitions?

This is something I wanna check with @ericl . What if users create a table with partition management, then turn it off, and read this table? If we treat this table as normal table, then the data in custom partition path will be ignored.

I think we should respect the partition management flag when the table was created, not when the table is read.

Hm, in other parts of the code we assume that the feature is completely disabled when the flag is off. This is probably needed since there is no way to revert a table otherwise.

do you mean we should completely ignore the partition information in metastore, when the flag is off, so that we should also ignore the data in custom partition path?

yeah, I think we should revert to 2.0 behavior as if querying the table from 2.0

gatorsmile · 2017-01-04T05:56:02Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+      fs: FileSystem,
+      qualifiedOutputPath: Path,
+      partitions: Seq[CatalogTablePartition]): Map[TablePartitionSpec, String] = {
+    val table = catalogTable.get


Shall we pass catalogTable as a function parm? .get looks a little bit risky.

yea good idea

gatorsmile · 2017-01-04T06:16:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

            partitionColumns = columns,
            bucketSpec = bucketSpec,
            fileFormat = format,
-            refreshFunction = _ => Unit, // No existing table needs to be refreshed.


Previously, in this case, we do not call refreshPartitionsCallback. After this PR, we always refresh it. Is my understanding right?

How did it work without this PR changes? Does that mean we just rely on Hive to implicitly call AlterTableAddPartitionCommand/createPartition when the existing table does not exist?

Previously, we did not refresh anything here, but we will repair the partitions in CreateDataSourceTableAsSelectCommand. After this PR, we only repair the partitions in CreateDataSourceTableAsSelectCommand when we are creating a new table.

But, datasource.write is also called by DataFrameWriter's save(). Thus, it is not covered by CreateDataSourceTableAsSelectCommand

nvm. It does not store the metadata in catalog.

SparkQA · 2017-01-04T17:22:57Z

Test build #70879 has finished for PR 16460 at commit b7f2cce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-04T21:39:28Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+      catalogTable.get.tracksPartitionsInCatalog
+
+    var initialMatchingPartitions: Seq[TablePartitionSpec] = Nil
+    var customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty


It sounds like we do not need to use var for initialMatchingPartitions and customPartitionLocations .

yea it's true, but then the code may looks ugly, e.g.

val (longVariableNameXXXX: LongTypeNameXXX, longVariableNameXXXX: LongTypeNameXXX) = { ... }

gatorsmile · 2017-01-04T23:17:35Z

LGTM except one minor comment.

cloud-fan · 2017-01-05T05:46:09Z

cc @ericl anymore comments on this PR?

ericl · 2017-01-05T05:53:47Z

looks good

cloud-fan · 2017-01-05T06:12:15Z

thanks for the view, merging to master!

…er.saveAsTable ## What changes were proposed in this pull request? When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues: 1. doesn't work when the partition has custom location. 2. will recover all partitions This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it. ## How was this patch tested? newly added regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16460 from cloud-fan/append.

fix partition related behaviors with DataFrameWriter.saveAsTable

7f4f360

ericl reviewed Jan 3, 2017

View reviewed changes

gatorsmile reviewed Jan 4, 2017

View reviewed changes

address comments

b7f2cce

gatorsmile reviewed Jan 4, 2017

View reviewed changes

asfgit closed this in 30345c4 Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable #16460

[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable #16460

cloud-fan commented Jan 3, 2017

cloud-fan commented Jan 3, 2017

SparkQA commented Jan 3, 2017

ericl Jan 3, 2017

cloud-fan Jan 4, 2017

ericl Jan 3, 2017

cloud-fan Jan 4, 2017 •

edited

Loading

gatorsmile Jan 4, 2017

cloud-fan Jan 4, 2017

ericl Jan 4, 2017

cloud-fan Jan 4, 2017

ericl Jan 4, 2017 •

edited

Loading

gatorsmile Jan 4, 2017

cloud-fan Jan 4, 2017

gatorsmile Jan 4, 2017

cloud-fan Jan 4, 2017

gatorsmile Jan 4, 2017

gatorsmile Jan 4, 2017

SparkQA commented Jan 4, 2017

gatorsmile Jan 4, 2017

cloud-fan Jan 5, 2017

gatorsmile commented Jan 4, 2017

cloud-fan commented Jan 5, 2017

ericl commented Jan 5, 2017

cloud-fan commented Jan 5, 2017

[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable #16460

[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable #16460

Conversation

cloud-fan commented Jan 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 3, 2017

SparkQA commented Jan 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Jan 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 4, 2017

cloud-fan commented Jan 5, 2017

ericl commented Jan 5, 2017

cloud-fan commented Jan 5, 2017

cloud-fan Jan 4, 2017 •

edited

Loading

ericl Jan 4, 2017 •

edited

Loading