Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13207][SQL][BRANCH-1.6] Make partitioning discovery ignore _SUCCESS files. #11697

Closed
wants to merge 1 commit into from
Closed

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Mar 14, 2016

If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those _SUCCESS files.

In future, it is better to ignore all files/dirs starting with _ or .. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with _ or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.

In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

Author: Yin Huai <yhuai@databricks.com>

Closes #11088 from yhuai/SPARK-13207.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
@yhuai
Copy link
Contributor Author

yhuai commented Mar 14, 2016

@davies How about we also get this in branch 1.6?

@davies
Copy link
Contributor

davies commented Mar 14, 2016

+1, we should.

@davies
Copy link
Contributor

davies commented Mar 14, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53080 has finished for PR 11697 at commit 02908f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 15, 2016

I've merged this in branch-1.6.

asfgit pushed a commit that referenced this pull request Mar 15, 2016
…CCESS files.

If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.

In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

Author: Yin Huai <yhuai@databricks.com>

Closes #11697 from yhuai/SPARK13207_branch16.
@rxin
Copy link
Contributor

rxin commented Mar 15, 2016

(You'd need to close this)

@yhuai yhuai closed this Mar 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants