Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 13, 2016

What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

#14585 handles a problem for the partition column name having _ and the issue itself is resolved correctly. However, it seems the data sources implementing FileFormat are validating the paths duplicately. Assuming from the comment in CSVFileFormat, // TODO: Move filtering., I guess we don't have to check this duplicately.

Currently, this seems being filtered in PartitioningAwareFileIndex.shouldFilterOut andPartitioningAwareFileIndex.isDataPath. So, FileFormat.inferSchema will always receive leaf files. For example, running to codes below:

spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
spark.read.parquet("/tmp/parquet")

gives the paths below without directories but just valid data files:

/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
...

to FileFormat.inferSchema.

How was this patch tested?

Unit test added in HadoopFsRelationTest and related existing tests.

@HyukjinKwon HyukjinKwon changed the title [SPARK-16975][SQL] Do not duplicately check file paths in data sources implementing FileFormat and prevent to attempt to list twice in ORC [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat and prevent to attempt to list twice in ORC Aug 13, 2016
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 13, 2016

FYI, in most data sources, it would try to read duplicately or fail to infer the schema (haven't tested this yet) if directories are allowed as Seq[FileStatus] in FileFormat.inferSchema .

@HyukjinKwon HyukjinKwon changed the title [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat and prevent to attempt to list twice in ORC [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat and prevent to attempt to list twice in ORC in schema inference Aug 13, 2016
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 13, 2016

Hi, @HyukjinKwon .
Your test code seems to pass without your code. Could you confirm that first?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 13, 2016

@dongjoon-hyun Thanks for taking a look! Actually, I intended to test your fix for all data sources just to make sure the issue in SPARK-16975 is resolved for all (it seems your fix is already correct).

As it is a clean-up, removing the duplicated logics (not a bug fix but a follow-up), I believe existing tests should cover this.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 13, 2016

I am happy to add some more tests but I am a bit confused of what I should test. Maybe a test for HadoopFsRelation.listLeafFiles to make sure directories and invalid paths are not included for the return value? Could I please ask some advice?

@dongjoon-hyun
Copy link
Member

Then, it's not about fixing new or remaining bugs of the previous one (SPARK-16975). I see!

@HyukjinKwon
Copy link
Member Author

Yes, it is not. I am sorry for the confusion. It seems your fix is perfectly fine and correct but it is just a clean-up to get rid of duplicated logics.

BTW, I am also okay with removing the test added here if you think the existing tested added in your PR is enough!

@HyukjinKwon
Copy link
Member Author

BTW, I am waiting for the Jenkins tests before cc someone. However, as you are already in here (I appreciate it), I appreciate it if you take a look.

@dongjoon-hyun
Copy link
Member

Only committers know what is needed at Spark. We are just contributors to send a pull request as a proposal. You can do anything in your PR. I'm not against you or this PR. Good luck! :)

@HyukjinKwon
Copy link
Member Author

Ah, yes thank you!

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63714 has finished for PR 14627 at commit 3fa597c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Could you take a look please @rxin and @liancheng ?

@HyukjinKwon
Copy link
Member Author

This PR removes duplicated logics which I guess is not a safe-guard because they are inconsistent. I would appreciate if you both @rxin and @liancheng take a look please.

@HyukjinKwon
Copy link
Member Author

gentle ping @liancheng

@SparkQA
Copy link

SparkQA commented Oct 16, 2016

Test build #67027 has finished for PR 14627 at commit bd14038.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@liancheng If you are uncertain of OrcFileOperator.scala I will definitely remove this in this PR.

@HyukjinKwon
Copy link
Member Author

(gentle ping @liancheng)

@HyukjinKwon
Copy link
Member Author

(I minimised the changes here to make the review easier)

@SparkQA
Copy link

SparkQA commented Nov 19, 2016

Test build #68887 has finished for PR 14627 at commit b68ce0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat and prevent to attempt to list twice in ORC in schema inference [SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat Nov 23, 2016
@SparkQA
Copy link

SparkQA commented Dec 22, 2016

Test build #70520 has finished for PR 14627 at commit 47c835d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 22, 2016

Thanks - merging in master.

@rxin
Copy link
Contributor

rxin commented Dec 22, 2016

Actually there is a conflict with branch-2.1. Does this fix any bug? If not we don't need to merge it in 2.1.

@asfgit asfgit closed this in 76622c6 Dec 22, 2016
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Dec 23, 2016

@rxin, it does not fix any bug but just gets rid of duplicated logics. I will try to open a separate JIRA in this case in the future to prevent confusion. Thank you.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 24, 2016
…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

   Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

   ``` scala
   spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
   spark.read.parquet("/tmp/parquet")
   ```

   gives the paths below without directories but just valid data files:

   ``` bash
   /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
   ...
   ```

   to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14627 from HyukjinKwon/SPARK-16975.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

   Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

   ``` scala
   spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
   spark.read.parquet("/tmp/parquet")
   ```

   gives the paths below without directories but just valid data files:

   ``` bash
   /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
   /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
   ...
   ```

   to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#14627 from HyukjinKwon/SPARK-16975.
@HyukjinKwon HyukjinKwon deleted the SPARK-16975 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants