New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23896][SQL]Improve PartitioningAwareFileIndex #21004

Closed
wants to merge 13 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@gengliangwang
Contributor

gengliangwang commented Apr 8, 2018

What changes were proposed in this pull request?

Currently PartitioningAwareFileIndex accepts an optional parameter userPartitionSchema. If provided, it will combine the inferred partition schema with the parameter.

However,

  1. to get userPartitionSchema, we need to combine inferred partition schema with userSpecifiedSchema
  2. to get the inferred partition schema, we have to create a temporary file index.

Only after that, a final version of PartitioningAwareFileIndex can be created.

This can be improved by passing userSpecifiedSchema to PartitioningAwareFileIndex.

With the improvement, we can reduce redundant code and avoid parsing the file partition twice.

How was this patch tested?

Unit test

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 8, 2018

Test build #89034 has finished for PR 21004 at commit 35aff24.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 8, 2018

Test build #89034 has finished for PR 21004 at commit 35aff24.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 9, 2018

Test build #89044 has finished for PR 21004 at commit 10536a6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 9, 2018

Test build #89044 has finished for PR 21004 at commit 10536a6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 9, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 9, 2018

retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 9, 2018

Test build #89049 has finished for PR 21004 at commit 10536a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 9, 2018

Test build #89049 has finished for PR 21004 at commit 10536a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang
Contributor

gengliangwang commented Apr 10, 2018

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 11, 2018

Test build #89215 has finished for PR 21004 at commit d12efab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 11, 2018

Test build #89215 has finished for PR 21004 at commit d12efab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 11, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 11, 2018

retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89224 has finished for PR 21004 at commit d12efab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89224 has finished for PR 21004 at commit d12efab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89233 has finished for PR 21004 at commit 43f6b77.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89233 has finished for PR 21004 at commit 43f6b77.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 12, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 12, 2018

retest this please.

@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Apr 12, 2018

Contributor

retest this please

Contributor

cloud-fan commented Apr 12, 2018

retest this please

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89244 has finished for PR 21004 at commit 43f6b77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89244 has finished for PR 21004 at commit 43f6b77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89245 has finished for PR 21004 at commit 630fb8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89245 has finished for PR 21004 at commit 630fb8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89259 has finished for PR 21004 at commit 91946a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89259 has finished for PR 21004 at commit 91946a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89272 has finished for PR 21004 at commit d871ea8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89272 has finished for PR 21004 at commit d871ea8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89273 has finished for PR 21004 at commit e9b6e90.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89273 has finished for PR 21004 at commit e9b6e90.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 12, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 12, 2018

retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89277 has finished for PR 21004 at commit 60d5b6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89277 has finished for PR 21004 at commit 60d5b6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 12, 2018

Test build #89288 has finished for PR 21004 at commit 60d5b6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 12, 2018

Test build #89288 has finished for PR 21004 at commit 60d5b6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 12, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 12, 2018

retest this please.

@@ -552,6 +523,40 @@ case class DataSource(
sys.error(s"${providingClass.getCanonicalName} does not allow create table as select.")
}
}
/** Returns an [[InMemoryFileIndex]] that can be used to get partition schema and file list. */
private def createInMemoryFileIndex(globbedPaths: Seq[Path]): InMemoryFileIndex = {

This comment has been minimized.

@cloud-fan

cloud-fan Apr 13, 2018

Contributor

this can be def createInMemoryFileIndex(checkEmptyGlobPath: Boolean)

@cloud-fan

cloud-fan Apr 13, 2018

Contributor

this can be def createInMemoryFileIndex(checkEmptyGlobPath: Boolean)

This comment has been minimized.

@cloud-fan

cloud-fan Apr 13, 2018

Contributor

and we can merge checkAndGlobPathIfNecessary and createInMemoryFileIndex

@cloud-fan

cloud-fan Apr 13, 2018

Contributor

and we can merge checkAndGlobPathIfNecessary and createInMemoryFileIndex

This comment has been minimized.

@gengliangwang

gengliangwang Apr 13, 2018

Contributor

No, we can't. In some case we need to check the glob files, while we don't need to create InMemoryFileIndex

@gengliangwang

gengliangwang Apr 13, 2018

Contributor

No, we can't. In some case we need to check the glob files, while we don't need to create InMemoryFileIndex

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2018

Test build #89306 has finished for PR 21004 at commit 60d5b6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 13, 2018

Test build #89306 has finished for PR 21004 at commit 60d5b6b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2018

Test build #89315 has finished for PR 21004 at commit 12ac191.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 13, 2018

Test build #89315 has finished for PR 21004 at commit 12ac191.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@gengliangwang

This comment has been minimized.

Show comment
Hide comment
@gengliangwang

gengliangwang Apr 13, 2018

Contributor

retest this please.

Contributor

gengliangwang commented Apr 13, 2018

retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2018

Test build #89319 has finished for PR 21004 at commit 12ac191.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 13, 2018

Test build #89319 has finished for PR 21004 at commit 12ac191.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Apr 13, 2018

Contributor

thanks, merging to master!

Contributor

cloud-fan commented Apr 13, 2018

thanks, merging to master!

@asfgit asfgit closed this in 4dfd746 Apr 13, 2018

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Apr 14, 2018

Member

(let's avoid to describe the PR title just saying improvement next time)

Member

HyukjinKwon commented Apr 14, 2018

(let's avoid to describe the PR title just saying improvement next time)

mgaido91 added a commit to mgaido91/spark that referenced this pull request Apr 16, 2018

[SPARK-23896][SQL] Improve PartitioningAwareFileIndex
## What changes were proposed in this pull request?

Currently `PartitioningAwareFileIndex` accepts an optional parameter `userPartitionSchema`. If provided, it will combine the inferred partition schema with the parameter.

However,
1. to get `userPartitionSchema`, we need to  combine inferred partition schema with `userSpecifiedSchema`
2. to get the inferred partition schema, we have to create a temporary file index.

Only after that, a final version of `PartitioningAwareFileIndex` can be created.

This can be improved by passing `userSpecifiedSchema` to `PartitioningAwareFileIndex`.

With the improvement, we can reduce redundant code and avoid parsing the file partition twice.
## How was this patch tested?
Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes apache#21004 from gengliangwang/PartitioningAwareFileIndex.

pepinoflo added a commit to pepinoflo/spark that referenced this pull request May 15, 2018

[SPARK-23896][SQL] Improve PartitioningAwareFileIndex
## What changes were proposed in this pull request?

Currently `PartitioningAwareFileIndex` accepts an optional parameter `userPartitionSchema`. If provided, it will combine the inferred partition schema with the parameter.

However,
1. to get `userPartitionSchema`, we need to  combine inferred partition schema with `userSpecifiedSchema`
2. to get the inferred partition schema, we have to create a temporary file index.

Only after that, a final version of `PartitioningAwareFileIndex` can be created.

This can be improved by passing `userSpecifiedSchema` to `PartitioningAwareFileIndex`.

With the improvement, we can reduce redundant code and avoid parsing the file partition twice.
## How was this patch tested?
Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes apache#21004 from gengliangwang/PartitioningAwareFileIndex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment