[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

mengxr · 2019-04-29T04:40:39Z

What changes were proposed in this pull request?

If a file is too big (>2GB), we should fail fast and do not try to read the file.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

mengxr · 2019-04-29T05:37:57Z

cc: @WeichenXu123 @HyukjinKwon @cloud-fan

cloud-fan · 2019-04-29T06:18:20Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+  private[binaryfile]
+  val CONF_TEST_BINARY_FILE_MAX_LENGTH = "spark.test.data.source.binaryFile.maxLength"
+  /** An internal conf for testing max length. */
+  private[binaryfile] val TEST_BINARY_FILE_MAX_LENGTH = SQLConf


we usually put all conf entries to SQLConf.

Even for internal conf used for tests?

ah if it's testing only, no.

SparkQA · 2019-04-29T07:05:02Z

Test build #104983 has finished for PR 24483 at commit 11ff2cc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-04-29T13:30:44Z

test this please

SparkQA · 2019-04-29T15:51:28Z

Test build #104998 has finished for PR 24483 at commit 1577966.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T16:33:38Z

Test build #104997 has finished for PR 24483 at commit 11ff2cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 1577966.

mengxr · 2019-04-29T17:42:01Z

@cloud-fan It seems I have to register the conf to verify its default value is INT_MAX. I moved the conf definition to SQLConf.

SparkQA · 2019-04-29T17:46:19Z

Test build #105002 has started for PR 24483 at commit f182606.

shaneknapp · 2019-04-29T18:25:35Z

test this please

gengliangwang · 2019-04-29T19:41:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  private[sql]
+  val CONF_SOURCES_BINARY_FILE_MAX_LENGTH = "spark.sql.sources.binaryFile.maxLength"
+  private[sql]
+  val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf(CONF_SOURCES_BINARY_FILE_MAX_LENGTH)


Nit: I think we can follow the other SQLConf here by putting the conf key into buildConf without assigning it into a variable. Also, we can remove the private[sql].

val SOURCES_BINARY_FILE_MAX_LENGTH = buildConf("spark.sql.sources.binaryFile.maxLength")...`

We can set the key with SQLConf.SOURCES_BINARY_FILE_MAX_LENGTH.key.

gengliangwang · 2019-04-29T19:45:44Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

@@ -99,6 +101,7 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister {
    val binaryFileSourceOptions = new BinaryFileSourceOptions(options)
    val pathGlobPattern = binaryFileSourceOptions.pathGlobFilter
    val filterFuncs = filters.map(filter => createFilterFunction(filter))
+    val maxLength = sparkSession.conf.get(SOURCES_BINARY_FILE_MAX_LENGTH)


Nit: we can define a method in SQLConf, like SQLConf.maxRecordsPerFile.

? The logic is not general enough to be applied outside binary data source.

gengliangwang · 2019-04-29T19:52:52Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

@@ -115,6 +118,11 @@ class BinaryFileFormat extends FileFormat with DataSourceRegister {
            case (MODIFICATION_TIME, i) =>
              writer.write(i, DateTimeUtils.fromMillis(status.getModificationTime))
            case (CONTENT, i) =>
+              if (status.getLen > maxLength) {


I think we can move this to line 113.

I don't get it. The conf is to prevent reading very large files that we are sure about failures. User can still use the data source if they don't need content.

I see. I am actually OK with either way.

gengliangwang · 2019-04-29T20:02:56Z

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala

+        QueryTest.checkAnswer(readContent(), expected)
+      }
+      // Disable read. If the implementation attempts to read, the exception would be different.
+      file.setReadable(false)


Seems the test can still pass without this line. Maybe we can remove it?

If we still set the max to content.length, the test will fail. This is to ensure we don't even attempt to read the file if the file is too big.

gengliangwang

LGTM

SparkQA · 2019-04-29T21:26:18Z

Test build #105003 has finished for PR 24483 at commit f182606.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-29T23:13:05Z

Test build #105004 has finished for PR 24483 at commit 0d6f92c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-04-29T23:25:45Z

Merged into master. Thanks for the review!

HyukjinKwon · 2019-04-30T00:13:24Z

Late LGTM too :)

…to read very large files If a file is too big (>2GB), we should fail fast and do not try to read the file. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24483 from mengxr/SPARK-27588. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>

fail fast and do not attempt to read very large files

11ff2cc

mengxr force-pushed the SPARK-27588 branch from aaada1f to 11ff2cc Compare April 29, 2019 04:41

cloud-fan reviewed Apr 29, 2019

View reviewed changes

do not use SQLConf to define the test config

1577966

mengxr added 2 commits April 29, 2019 09:50

Revert "do not use SQLConf to define the test config"

15e7700

This reverts commit 1577966.

move conf to SQLConf

f182606

gengliangwang reviewed Apr 29, 2019

View reviewed changes

address comments

0d6f92c

gengliangwang approved these changes Apr 29, 2019

View reviewed changes

asfgit closed this in 618d6bf Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

mengxr commented Apr 29, 2019

mengxr commented Apr 29, 2019

cloud-fan Apr 29, 2019

mengxr Apr 29, 2019

cloud-fan Apr 29, 2019

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

SparkQA commented Apr 29, 2019

shaneknapp commented Apr 29, 2019

gengliangwang Apr 29, 2019

mengxr Apr 29, 2019

gengliangwang Apr 29, 2019

mengxr Apr 29, 2019

gengliangwang Apr 29, 2019

mengxr Apr 29, 2019

gengliangwang Apr 29, 2019

gengliangwang Apr 29, 2019

mengxr Apr 29, 2019

gengliangwang left a comment

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

HyukjinKwon commented Apr 30, 2019

[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files #24483

Conversation

mengxr commented Apr 29, 2019

What changes were proposed in this pull request?

How was this patch tested?

mengxr commented Apr 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

SparkQA commented Apr 29, 2019

shaneknapp commented Apr 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2019

SparkQA commented Apr 29, 2019

mengxr commented Apr 29, 2019

HyukjinKwon commented Apr 30, 2019