-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat #14627
Conversation
FYI, in most data sources, it would try to read duplicately or fail to infer the schema (haven't tested this yet) if directories are allowed as |
Hi, @HyukjinKwon . |
@dongjoon-hyun Thanks for taking a look! Actually, I intended to test your fix for all data sources just to make sure the issue in As it is a clean-up, removing the duplicated logics (not a bug fix but a follow-up), I believe existing tests should cover this. |
I am happy to add some more tests but I am a bit confused of what I should test. Maybe a test for |
Then, it's not about fixing new or remaining bugs of the previous one (SPARK-16975). I see! |
Yes, it is not. I am sorry for the confusion. It seems your fix is perfectly fine and correct but it is just a clean-up to get rid of duplicated logics. BTW, I am also okay with removing the test added here if you think the existing tested added in your PR is enough! |
BTW, I am waiting for the Jenkins tests before cc someone. However, as you are already in here (I appreciate it), I appreciate it if you take a look. |
Only committers know what is needed at Spark. We are just contributors to send a pull request as a proposal. You can do anything in your PR. I'm not against you or this PR. Good luck! :) |
Ah, yes thank you! |
Test build #63714 has finished for PR 14627 at commit
|
Could you take a look please @rxin and @liancheng ? |
This PR removes duplicated logics which I guess is not a safe-guard because they are inconsistent. I would appreciate if you both @rxin and @liancheng take a look please. |
gentle ping @liancheng |
3fa597c
to
bd14038
Compare
Test build #67027 has finished for PR 14627 at commit
|
@liancheng If you are uncertain of |
(gentle ping @liancheng) |
bd14038
to
08d07be
Compare
(I minimised the changes here to make the review easier) |
Test build #68887 has finished for PR 14627 at commit
|
b68ce0c
to
47c835d
Compare
Test build #70520 has finished for PR 14627 at commit
|
Thanks - merging in master. |
Actually there is a conflict with branch-2.1. Does this fix any bug? If not we don't need to merge it in 2.1. |
@rxin, it does not fix any bug but just gets rid of duplicated logics. I will try to open a separate JIRA in this case in the future to prevent confusion. Thank you. |
…ata sources implementing FileFormat ## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14627 from HyukjinKwon/SPARK-16975.
…ata sources implementing FileFormat ## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#14627 from HyukjinKwon/SPARK-16975.
What changes were proposed in this pull request?
This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.
#14585 handles a problem for the partition column name having
_
and the issue itself is resolved correctly. However, it seems the data sources implementingFileFormat
are validating the paths duplicately. Assuming from the comment inCSVFileFormat
,// TODO: Move filtering.
, I guess we don't have to check this duplicately.Currently, this seems being filtered in
PartitioningAwareFileIndex.shouldFilterOut
andPartitioningAwareFileIndex.isDataPath
. So,FileFormat.inferSchema
will always receive leaf files. For example, running to codes below:gives the paths below without directories but just valid data files:
to
FileFormat.inferSchema
.How was this patch tested?
Unit test added in
HadoopFsRelationTest
and related existing tests.