New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483
Conversation
private[binaryfile] | ||
val CONF_TEST_BINARY_FILE_MAX_LENGTH = "spark.test.data.source.binaryFile.maxLength" | ||
/** An internal conf for testing max length. */ | ||
private[binaryfile] val TEST_BINARY_FILE_MAX_LENGTH = SQLConf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually put all conf entries to SQLConf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for internal conf used for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah if it's testing only, no.
Test build #104983 has finished for PR 24483 at commit
|
test this please |
Test build #104998 has finished for PR 24483 at commit
|
Test build #104997 has finished for PR 24483 at commit
|
This reverts commit 1577966.
@cloud-fan It seems I have to register the conf to verify its default value is INT_MAX. I moved the conf definition to |
Test build #105002 has started for PR 24483 at commit |
test this please |
private[sql] | ||
val CONF_SOURCES_BINARY_FILE_MAX_LENGTH = "spark.sql.sources.binaryFile.maxLength" | ||
private[sql] | ||
val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf(CONF_SOURCES_BINARY_FILE_MAX_LENGTH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can follow the other SQLConf here by putting the conf key into buildConf
without assigning it into a variable. Also, we can remove the private[sql]
.
val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf("spark.sql.sources.binaryFile.maxLength")...`
We can set the key with SQLConf.SOURCES_BINARY_FILE_MAX_LENGTH.key
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -99,6 +101,7 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister { | |||
val binaryFileSourceOptions = new BinaryFileSourceOptions(options) | |||
val pathGlobPattern = binaryFileSourceOptions.pathGlobFilter | |||
val filterFuncs = filters.map(filter => createFilterFunction(filter)) | |||
val maxLength = sparkSession.conf.get(SOURCES_BINARY_FILE_MAX_LENGTH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can define a method in SQLConf
, like SQLConf.maxRecordsPerFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? The logic is not general enough to be applied outside binary data source.
@@ -115,6 +118,11 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister { | |||
case (MODIFICATION_TIME, i) => | |||
writer.write(i, DateTimeUtils.fromMillis(status.getModificationTime)) | |||
case (CONTENT, i) => | |||
if (status.getLen > maxLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move this to line 113.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. The conf is to prevent reading very large files that we are sure about failures. User can still use the data source if they don't need content
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I am actually OK with either way.
QueryTest.checkAnswer(readContent(), expected) | ||
} | ||
// Disable read. If the implementation attempts to read, the exception would be different. | ||
file.setReadable(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the test can still pass without this line. Maybe we can remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we still set the max to content.length
, the test will fail. This is to ensure we don't even attempt to read the file if the file is too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #105003 has finished for PR 24483 at commit
|
Test build #105004 has finished for PR 24483 at commit
|
Merged into master. Thanks for the review! |
Late LGTM too :) |
…to read very large files If a file is too big (>2GB), we should fail fast and do not try to read the file. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24483 from mengxr/SPARK-27588. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>
What changes were proposed in this pull request?
If a file is too big (>2GB), we should fail fast and do not try to read the file.
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.