Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

Closed
wants to merge 5 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Apr 29, 2019

What changes were proposed in this pull request?

If a file is too big (>2GB), we should fail fast and do not try to read the file.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@mengxr
Copy link
Contributor Author

mengxr commented Apr 29, 2019

private[binaryfile]
val CONF_TEST_BINARY_FILE_MAX_LENGTH = "spark.test.data.source.binaryFile.maxLength"
/** An internal conf for testing max length. */
private[binaryfile] val TEST_BINARY_FILE_MAX_LENGTH = SQLConf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually put all conf entries to SQLConf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for internal conf used for tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah if it's testing only, no.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104983 has finished for PR 24483 at commit 11ff2cc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Apr 29, 2019

test this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104998 has finished for PR 24483 at commit 1577966.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104997 has finished for PR 24483 at commit 11ff2cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Apr 29, 2019

@cloud-fan It seems I have to register the conf to verify its default value is INT_MAX. I moved the conf definition to SQLConf.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #105002 has started for PR 24483 at commit f182606.

@shaneknapp
Copy link
Contributor

test this please

private[sql]
val CONF_SOURCES_BINARY_FILE_MAX_LENGTH = "spark.sql.sources.binaryFile.maxLength"
private[sql]
val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf(CONF_SOURCES_BINARY_FILE_MAX_LENGTH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can follow the other SQLConf here by putting the conf key into buildConf without assigning it into a variable. Also, we can remove the private[sql].

val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf("spark.sql.sources.binaryFile.maxLength")...`

We can set the key with SQLConf.SOURCES_BINARY_FILE_MAX_LENGTH.key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -99,6 +101,7 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister {
val binaryFileSourceOptions = new BinaryFileSourceOptions(options)
val pathGlobPattern = binaryFileSourceOptions.pathGlobFilter
val filterFuncs = filters.map(filter => createFilterFunction(filter))
val maxLength = sparkSession.conf.get(SOURCES_BINARY_FILE_MAX_LENGTH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can define a method in SQLConf, like SQLConf.maxRecordsPerFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? The logic is not general enough to be applied outside binary data source.

@@ -115,6 +118,11 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister {
case (MODIFICATION_TIME, i) =>
writer.write(i, DateTimeUtils.fromMillis(status.getModificationTime))
case (CONTENT, i) =>
if (status.getLen > maxLength) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this to line 113.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. The conf is to prevent reading very large files that we are sure about failures. User can still use the data source if they don't need content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am actually OK with either way.

QueryTest.checkAnswer(readContent(), expected)
}
// Disable read. If the implementation attempts to read, the exception would be different.
file.setReadable(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the test can still pass without this line. Maybe we can remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still set the max to content.length, the test will fail. This is to ensure we don't even attempt to read the file if the file is too big.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #105003 has finished for PR 24483 at commit f182606.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #105004 has finished for PR 24483 at commit 0d6f92c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Apr 29, 2019

Merged into master. Thanks for the review!

@asfgit asfgit closed this in 618d6bf Apr 29, 2019
@HyukjinKwon
Copy link
Member

Late LGTM too :)

lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
…to read very large files

If a file is too big (>2GB), we should fail fast and do not try to read the file.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24483 from mengxr/SPARK-27588.

Authored-by: Xiangrui Meng <meng@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants