[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

gatorsmile · 2017-01-06T00:07:40Z

What changes were proposed in this pull request?

DataFrameWriter's save() API is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in DataFrameWriter. We should avoid it.

The related PR: #16090

How was this patch tested?

Updated the existing test cases.

gatorsmile · 2017-01-06T00:08:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala

-          // check the cache hit, we use the metric of METRIC_FILES_DISCOVERED and
-          // METRIC_PARALLEL_LISTING_JOB_COUNT to check this, while the lock take effect,
-          // only one thread can really do the build, so the listing job count is 2, the other
-          // one is cache.load func. Also METRIC_FILES_DISCOVERED is $partition_num * 2


This comment is not accurate. The extra counts are from the save API call in setupPartitionedHiveTable.

gatorsmile · 2017-01-06T00:10:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
+        if (isForWriteOnly) {
+          // Exit earlier and return null
+          null


I do not know whether returning null is ok here. This is based on a similar early-exit solution used in getOrInferFileFormatSchema.

Maybe we can change it to return an option?

gatorsmile · 2017-01-06T00:10:47Z

cc @ericl @cloud-fan

cloud-fan · 2017-01-06T01:18:06Z

I think it's time to think about why DataSource.write should return BaseRelation. It seems that we only use it in CreateDataSourceTableAsSelect, to get the schema of the written data, only for nullability changes. Can we avoid doing this? Can we figure out the nullability changes more concisely?

ericl · 2017-01-06T01:17:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala

    spark.range(scale).selectExpr("id as fieldOne", "id as partCol1", "id as partCol2").write
      .partitionBy("partCol1", "partCol2")
      .mode("overwrite")
      .parquet(dir.getAbsolutePath)

-    if (clearMetricsBeforeCreate) {


ericl · 2017-01-06T01:19:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
+        if (isForWriteOnly) {
+          // Exit earlier and return null
+          null


Maybe we can change it to return an option?

SparkQA · 2017-01-06T02:28:23Z

Test build #70946 has finished for PR 16481 at commit 5d38f09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-06T02:56:29Z

@cloud-fan I did a try in that direction, but I am afraid it might break the external data source that extends CreatableRelationProvider. Not sure what might be returned in this condition. Maybe more changes than the nullability.

SparkQA · 2017-01-06T05:17:33Z

Test build #70956 has finished for PR 16481 at commit 2a8ce0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2017-01-07T20:51:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        // Replace the schema with that of the DataFrame we just wrote out to avoid re-inferring it.
-        copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
+        if (isForWriteOnly) {
+          // Exit earlier and return null


I'd remove "and return null"

SparkQA · 2017-01-08T05:44:26Z

Test build #71026 has finished for PR 16481 at commit 7d3eefb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-10T04:42:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

    if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
      throw new AnalysisException("Cannot save interval data type into external storage.")
    }

    providingClass.newInstance() match {
      case dataSource: CreatableRelationProvider =>
-        dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)
+        Some(dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data))


it would be really weird if CreatableRelationProvider.createRelation can return a relation with different schema from the written data. Is it safe to assume the schema won't change? cc @marmbrus @yhuai @liancheng

maybe we can set a parameter here, let user to choose true or false, default is not refresh schema

cloud-fan · 2017-01-12T02:33:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+   * Writes the given [[DataFrame]] out to this [[DataSource]].
+   *
+   * @param isForWriteOnly Whether to just write the data without returning a [[BaseRelation]].
+   */
  def write(


let's create a new write method that returns Unit, and rename this write to writeAndRead, which should be removed eventually.

Sure. Will do it.

gatorsmile · 2017-01-12T22:59:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        mode = mode,
+        catalogTable = catalogTable,
+        fileIndex = fileIndex)
+      sparkSession.sessionState.executePlan(plan).toRdd


To the reviewers, the code in writeInFileFormat is copied from the case FileFormat of the original write function.

SparkQA · 2017-01-13T01:20:22Z

Test build #71279 has finished for PR 16481 at commit 111025f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-13T05:07:18Z

LGTM, merging to master!

It conflicts with branch-2.1, can you send a new PR? thanks

cloud-fan · 2017-01-13T05:08:53Z

I'll update JIRA once the service is back.

gatorsmile · 2017-01-13T05:29:51Z

Sure, will do it.

… not scan all the saved files #16481 ### What changes were proposed in this pull request? #### This PR is to backport #16481 to Spark 2.1 --- `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it. ### How was this patch tested? Added and modified the test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #16588 from gatorsmile/backport-19092.

…the saved files ### What changes were proposed in this pull request? `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it. The related PR: apache#16090 ### How was this patch tested? Updated the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16481 from gatorsmile/saveFileScan.

## What changes were proposed in this pull request? As the discussion in #16481 and #18975 (comment) Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths. In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table. ## How was this patch tested? Existing UT Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #19941 from xuanyuanking/SPARK-22753.

fix.

5d38f09

gatorsmile commented Jan 6, 2017

View reviewed changes

ericl approved these changes Jan 6, 2017

View reviewed changes

address comments.

2a8ce0b

jaceklaskowski reviewed Jan 7, 2017

View reviewed changes

clean the comment

7d3eefb

cloud-fan reviewed Jan 10, 2017

View reviewed changes

cloud-fan reviewed Jan 12, 2017

View reviewed changes

gatorsmile added 2 commits January 12, 2017 13:34

Merge remote-tracking branch 'upstream/master' into refreshLoad

c6ef287

address comments.

111025f

gatorsmile commented Jan 12, 2017

View reviewed changes

asfgit closed this in 3356b8b Jan 13, 2017

gatorsmile mentioned this pull request Jan 15, 2017

[SPARK-19092] [SQL] [Backport-2.1] Save() API of DataFrameWriter should not scan all the saved files #16481 #16588

Closed

This was referenced Dec 11, 2017

[SPARK-22753][SQL] Get rid of dataSource.writeAndRead #19941

Closed

[SPARK-18700][SQL] Add StripedLock for each table's relation in cache #16135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

gatorsmile commented Jan 6, 2017

gatorsmile Jan 6, 2017

gatorsmile Jan 6, 2017

ericl Jan 6, 2017

gatorsmile commented Jan 6, 2017

cloud-fan commented Jan 6, 2017

ericl Jan 6, 2017

ericl Jan 6, 2017

SparkQA commented Jan 6, 2017

gatorsmile commented Jan 6, 2017 •

edited

Loading

SparkQA commented Jan 6, 2017

jaceklaskowski Jan 7, 2017

gatorsmile Jan 8, 2017

SparkQA commented Jan 8, 2017

cloud-fan Jan 10, 2017

cenyuhai Jan 11, 2017 •

edited

Loading

cloud-fan Jan 12, 2017

gatorsmile Jan 12, 2017

gatorsmile Jan 12, 2017 •

edited

Loading

SparkQA commented Jan 13, 2017

cloud-fan commented Jan 13, 2017

cloud-fan commented Jan 13, 2017

gatorsmile commented Jan 13, 2017

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

Conversation

gatorsmile commented Jan 6, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 6, 2017

cloud-fan commented Jan 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2017

gatorsmile commented Jan 6, 2017 • edited Loading

SparkQA commented Jan 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 8, 2017

Choose a reason for hiding this comment

cenyuhai Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jan 13, 2017

cloud-fan commented Jan 13, 2017

cloud-fan commented Jan 13, 2017

gatorsmile commented Jan 13, 2017

gatorsmile commented Jan 6, 2017 •

edited

Loading

cenyuhai Jan 11, 2017 •

edited

Loading

gatorsmile Jan 12, 2017 •

edited

Loading