[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

HyukjinKwon · 2016-08-13T01:18:15Z

What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

#14585 handles a problem for the partition column name having _ and the issue itself is resolved correctly. However, it seems the data sources implementing FileFormat are validating the paths duplicately. Assuming from the comment in CSVFileFormat, // TODO: Move filtering., I guess we don't have to check this duplicately.

Currently, this seems being filtered in PartitioningAwareFileIndex.shouldFilterOut andPartitioningAwareFileIndex.isDataPath. So, FileFormat.inferSchema will always receive leaf files. For example, running to codes below:

spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
spark.read.parquet("/tmp/parquet")

gives the paths below without directories but just valid data files:

/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
...

to FileFormat.inferSchema.

How was this patch tested?

Unit test added in HadoopFsRelationTest and related existing tests.

HyukjinKwon · 2016-08-13T01:21:52Z

FYI, in most data sources, it would try to read duplicately or fail to infer the schema (haven't tested this yet) if directories are allowed as Seq[FileStatus] in FileFormat.inferSchema .

dongjoon-hyun · 2016-08-13T02:24:25Z

Hi, @HyukjinKwon .
Your test code seems to pass without your code. Could you confirm that first?

HyukjinKwon · 2016-08-13T02:32:29Z

@dongjoon-hyun Thanks for taking a look! Actually, I intended to test your fix for all data sources just to make sure the issue in SPARK-16975 is resolved for all (it seems your fix is already correct).

As it is a clean-up, removing the duplicated logics (not a bug fix but a follow-up), I believe existing tests should cover this.

HyukjinKwon · 2016-08-13T02:36:44Z

I am happy to add some more tests but I am a bit confused of what I should test. Maybe a test for HadoopFsRelation.listLeafFiles to make sure directories and invalid paths are not included for the return value? Could I please ask some advice?

dongjoon-hyun · 2016-08-13T02:41:28Z

Then, it's not about fixing new or remaining bugs of the previous one (SPARK-16975). I see!

HyukjinKwon · 2016-08-13T02:43:54Z

Yes, it is not. I am sorry for the confusion. It seems your fix is perfectly fine and correct but it is just a clean-up to get rid of duplicated logics.

BTW, I am also okay with removing the test added here if you think the existing tested added in your PR is enough!

HyukjinKwon · 2016-08-13T02:47:32Z

BTW, I am waiting for the Jenkins tests before cc someone. However, as you are already in here (I appreciate it), I appreciate it if you take a look.

dongjoon-hyun · 2016-08-13T02:47:59Z

Only committers know what is needed at Spark. We are just contributors to send a pull request as a proposal. You can do anything in your PR. I'm not against you or this PR. Good luck! :)

HyukjinKwon · 2016-08-13T02:48:30Z

Ah, yes thank you!

SparkQA · 2016-08-13T03:10:52Z

Test build #63714 has finished for PR 14627 at commit 3fa597c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-08-13T03:12:31Z

Could you take a look please @rxin and @liancheng ?

HyukjinKwon · 2016-09-22T11:18:01Z

This PR removes duplicated logics which I guess is not a safe-guard because they are inconsistent. I would appreciate if you both @rxin and @liancheng take a look please.

HyukjinKwon · 2016-10-08T08:24:52Z

gentle ping @liancheng

SparkQA · 2016-10-16T07:21:56Z

Test build #67027 has finished for PR 14627 at commit bd14038.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-22T14:13:05Z

@liancheng If you are uncertain of OrcFileOperator.scala I will definitely remove this in this PR.

HyukjinKwon · 2016-10-29T01:56:01Z

(gentle ping @liancheng)

HyukjinKwon · 2016-11-19T12:16:45Z

(I minimised the changes here to make the review easier)

SparkQA · 2016-11-19T14:41:18Z

Test build #68887 has finished for PR 14627 at commit b68ce0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-22T17:18:23Z

Test build #70520 has finished for PR 14627 at commit 47c835d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-22T18:00:04Z

Thanks - merging in master.

rxin · 2016-12-22T18:00:49Z

Actually there is a conflict with branch-2.1. Does this fix any bug? If not we don't need to merge it in 2.1.

HyukjinKwon · 2016-12-23T02:12:52Z

@rxin, it does not fix any bug but just gets rid of duplicated logics. I will try to open a separate JIRA in this case in the future to prevent confusion. Thank you.

…ata sources implementing FileFormat ## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14627 from HyukjinKwon/SPARK-16975.

HyukjinKwon force-pushed the SPARK-16975 branch from 3fa597c to bd14038 Compare October 16, 2016 05:16

HyukjinKwon force-pushed the SPARK-16975 branch from bd14038 to 08d07be Compare November 19, 2016 12:14

Do not duplicately check file paths and list twice in ORC

47c835d

HyukjinKwon force-pushed the SPARK-16975 branch from b68ce0c to 47c835d Compare December 22, 2016 14:52

asfgit closed this in 76622c6 Dec 22, 2016

HyukjinKwon deleted the SPARK-16975 branch January 2, 2018 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

dongjoon-hyun commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

dongjoon-hyun commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

dongjoon-hyun commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

SparkQA commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

HyukjinKwon commented Sep 22, 2016

HyukjinKwon commented Oct 8, 2016

SparkQA commented Oct 16, 2016

HyukjinKwon commented Oct 22, 2016

HyukjinKwon commented Oct 29, 2016

HyukjinKwon commented Nov 19, 2016

SparkQA commented Nov 19, 2016

SparkQA commented Dec 22, 2016

rxin commented Dec 22, 2016 •

edited

Loading

rxin commented Dec 22, 2016 •

edited

Loading

HyukjinKwon commented Dec 23, 2016 •

edited

Loading

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

Conversation

HyukjinKwon commented Aug 13, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Aug 13, 2016 • edited Loading

dongjoon-hyun commented Aug 13, 2016 • edited Loading

HyukjinKwon commented Aug 13, 2016 • edited Loading

HyukjinKwon commented Aug 13, 2016 • edited Loading

dongjoon-hyun commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

dongjoon-hyun commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

SparkQA commented Aug 13, 2016

HyukjinKwon commented Aug 13, 2016

HyukjinKwon commented Sep 22, 2016

HyukjinKwon commented Oct 8, 2016

SparkQA commented Oct 16, 2016

HyukjinKwon commented Oct 22, 2016

HyukjinKwon commented Oct 29, 2016

HyukjinKwon commented Nov 19, 2016

SparkQA commented Nov 19, 2016

SparkQA commented Dec 22, 2016

rxin commented Dec 22, 2016 • edited Loading

rxin commented Dec 22, 2016 • edited Loading

HyukjinKwon commented Dec 23, 2016 • edited Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

dongjoon-hyun commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

HyukjinKwon commented Aug 13, 2016 •

edited

Loading

rxin commented Dec 22, 2016 •

edited

Loading

rxin commented Dec 22, 2016 •

edited

Loading

HyukjinKwon commented Dec 23, 2016 •

edited

Loading