Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19092] [SQL] Save() API of DataFrameWriter should not scan all the saved files #16481

Closed
wants to merge 5 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

DataFrameWriter's save() API is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in DataFrameWriter. We should avoid it.

The related PR: #16090

How was this patch tested?

Updated the existing test cases.

// check the cache hit, we use the metric of METRIC_FILES_DISCOVERED and
// METRIC_PARALLEL_LISTING_JOB_COUNT to check this, while the lock take effect,
// only one thread can really do the build, so the listing job count is 2, the other
// one is cache.load func. Also METRIC_FILES_DISCOVERED is $partition_num * 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not accurate. The extra counts are from the save API call in setupPartitionedHiveTable.

copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
if (isForWriteOnly) {
// Exit earlier and return null
null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know whether returning null is ok here. This is based on a similar early-exit solution used in getOrInferFileFormatSchema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can change it to return an option?

@gatorsmile
Copy link
Member Author

cc @ericl @cloud-fan

@cloud-fan
Copy link
Contributor

I think it's time to think about why DataSource.write should return BaseRelation. It seems that we only use it in CreateDataSourceTableAsSelect, to get the schema of the written data, only for nullability changes. Can we avoid doing this? Can we figure out the nullability changes more concisely?

spark.range(scale).selectExpr("id as fieldOne", "id as partCol1", "id as partCol2").write
.partitionBy("partCol1", "partCol2")
.mode("overwrite")
.parquet(dir.getAbsolutePath)

if (clearMetricsBeforeCreate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
if (isForWriteOnly) {
// Exit earlier and return null
null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can change it to return an option?

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70946 has finished for PR 16481 at commit 5d38f09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jan 6, 2017

@cloud-fan I did a try in that direction, but I am afraid it might break the external data source that extends CreatableRelationProvider. Not sure what might be returned in this condition. Maybe more changes than the nullability.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70956 has finished for PR 16481 at commit 2a8ce0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Replace the schema with that of the DataFrame we just wrote out to avoid re-inferring it.
copy(userSpecifiedSchema = Some(data.schema.asNullable)).resolveRelation()
if (isForWriteOnly) {
// Exit earlier and return null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove "and return null"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@SparkQA
Copy link

SparkQA commented Jan 8, 2017

Test build #71026 has finished for PR 16481 at commit 7d3eefb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
throw new AnalysisException("Cannot save interval data type into external storage.")
}

providingClass.newInstance() match {
case dataSource: CreatableRelationProvider =>
dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)
Some(dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be really weird if CreatableRelationProvider.createRelation can return a relation with different schema from the written data. Is it safe to assume the schema won't change? cc @marmbrus @yhuai @liancheng

Copy link
Contributor

@cenyuhai cenyuhai Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can set a parameter here, let user to choose true or false, default is not refresh schema

* Writes the given [[DataFrame]] out to this [[DataSource]].
*
* @param isForWriteOnly Whether to just write the data without returning a [[BaseRelation]].
*/
def write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a new write method that returns Unit, and rename this write to writeAndRead, which should be removed eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do it.

mode = mode,
catalogTable = catalogTable,
fileIndex = fileIndex)
sparkSession.sessionState.executePlan(plan).toRdd
Copy link
Member Author

@gatorsmile gatorsmile Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the reviewers, the code in writeInFileFormat is copied from the case FileFormat of the original write function.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71279 has finished for PR 16481 at commit 111025f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

It conflicts with branch-2.1, can you send a new PR? thanks

@asfgit asfgit closed this in 3356b8b Jan 13, 2017
@cloud-fan
Copy link
Contributor

I'll update JIRA once the service is back.

@gatorsmile
Copy link
Member Author

Sure, will do it.

asfgit pushed a commit that referenced this pull request Jan 16, 2017
… not scan all the saved files #16481

### What changes were proposed in this pull request?

#### This PR is to backport #16481 to Spark 2.1
---
`DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.

### How was this patch tested?
Added and modified the test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16588 from gatorsmile/backport-19092.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…the saved files

### What changes were proposed in this pull request?
`DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.

The related PR: apache#16090

### How was this patch tested?
Updated the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#16481 from gatorsmile/saveFileScan.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…the saved files

### What changes were proposed in this pull request?
`DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.

The related PR: apache#16090

### How was this patch tested?
Updated the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#16481 from gatorsmile/saveFileScan.
asfgit pushed a commit that referenced this pull request Dec 15, 2017
## What changes were proposed in this pull request?

As the discussion in #16481 and #18975 (comment)
Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths.
In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table.

## How was this patch tested?

Existing UT

Author: Yuanjian Li <xyliyuanjian@gmail.com>

Closes #19941 from xuanyuanking/SPARK-22753.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants