[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet #16474

viirya · 2017-01-05T05:05:50Z

What changes were proposed in this pull request?

We have a config spark.sql.files.ignoreCorruptFiles which can be used to ignore corrupt files when reading files in SQL. Currently the ignoreCorruptFiles config has two issues and can't work for Parquet:

We only ignore corrupt files in FileScanRDD . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
In FileScanRDD, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, ignoreCorruptFiles config doesn't work too.

This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.

Two main changes in this patch:

Replace ParquetFileReader.readAllFootersInParallel by implementing the logic to read footers in multi-threaded manner

We can't ignore corrupt files if we use ParquetFileReader.readAllFootersInParallel. So this patch implements the logic to do the similar thing in readParquetFootersInParallel.
In FileScanRDD, we need to ignore corrupt file too when we call readFunction to return iterator.

One thing to notice is:

We read schema from Parquet file's footer. The method to read footer ParquetFileReader.readFooter throws RuntimeException, instead of IOException, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches RuntimeException. One concern is that it might also shadow other runtime exceptions other than reading corrupt files.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-01-05T07:17:56Z

Test build #70903 has finished for PR 16474 at commit 586b347.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-06T02:31:44Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

          // Reads footers in multi-threaded manner within each task
          val footers =
-            ParquetFileReader.readAllFootersInParallel(
-              serializedConf.value, fakeFileStatuses.asJava, skipRowGroups).asScala
+            ParquetFileFormat.readParquetFootersInParallel(


what's happening to readAllFootersInParallel?

We can't make it ignore corrupt files. So it successfully reads all footers or completely fails even just one footer is corrupt.

Can we add some unit tests for readParquetFootersInParallel

rxin · 2017-01-06T02:40:30Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+      partFiles: Seq[FileStatus],
+      ignoreCorruptFiles: Boolean): Seq[Footer] = {
+    val footers = partFiles.map { currentFile =>
+      new Callable[Option[Footer]]() {


this seems pretty convoluted. can we jsut use parallel collections to do this?

ok. let me try to change this to parallel collections.

SparkQA · 2017-01-09T06:11:28Z

Test build #71053 has finished for PR 16474 at commit d6878e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T06:13:57Z

Test build #71055 has finished for PR 16474 at commit 6b562eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-12T02:21:36Z

ping @rxin I do address the previous comments. Can you review again?

cloud-fan · 2017-01-12T12:00:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+                    // E.g., vectorized Parquet reader.
+                    readFunction(currentFile)
+                  } catch {
+                    case e @(_: RuntimeException | _: IOException) =>


shall we also check the error message? or RuntimeException may catch other unexpected exceptions.

yeah, I have this concern too in the pr description.

One problem is the error message is varying across data sources. To list all error messages here looks not a good idea.

cloud-fan · 2017-01-12T12:01:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+                  } catch {
+                    case e @(_: RuntimeException | _: IOException) =>
+                      logWarning(s"Skipped the rest content in the corrupted file: $currentFile", e)
+                      null


return Iterator.empty

cloud-fan · 2017-01-12T12:10:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+                  try {
+                    // The readFunction may read files before consuming the iterator.
+                    // E.g., vectorized Parquet reader.
+                    readFunction(currentFile)


is it possible that we can make readFunction guarantee that data reading must happen after the first Iterator.next?

I think it is hard to guarantee this because readFunction is coming from individual data source. Even we can modify current data sources, we may not be able to prevent other data sources doing this.

cloud-fan · 2017-01-16T06:13:05Z

LGTM

SparkQA · 2017-01-16T07:16:49Z

Test build #71417 has finished for PR 16474 at commit 261e1b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet: 1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html 2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too. This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc. Two main changes in this patch: 1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`. 2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator. One thing to notice is: We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16474 from viirya/fix-ignorecorrupted-parquet-files. (cherry picked from commit 61e48f5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-01-16T07:28:08Z

thanks, merging to master/2.1!

## What changes were proposed in this pull request? We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet: 1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html 2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too. This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc. Two main changes in this patch: 1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`. 2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator. One thing to notice is: We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#16474 from viirya/fix-ignorecorrupted-parquet-files.

Make ignoreCorruptFiles work for Parquet.

586b347

rxin reviewed Jan 6, 2017

View reviewed changes

Use parallel collections and add test.

6b562eb

viirya force-pushed the fix-ignorecorrupted-parquet-files branch from d6878e1 to 6b562eb Compare January 9, 2017 03:23

cloud-fan reviewed Jan 12, 2017

View reviewed changes

Address comment.

261e1b5

asfgit closed this in 61e48f5 Jan 16, 2017

viirya deleted the fix-ignorecorrupted-parquet-files branch December 27, 2023 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet #16474

[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet #16474

viirya commented Jan 5, 2017

SparkQA commented Jan 5, 2017

rxin Jan 6, 2017

viirya Jan 6, 2017

rxin Jan 6, 2017

viirya Jan 6, 2017

rxin Jan 6, 2017

viirya Jan 6, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

viirya commented Jan 12, 2017

cloud-fan Jan 12, 2017

viirya Jan 16, 2017

cloud-fan Jan 12, 2017

cloud-fan Jan 12, 2017

viirya Jan 16, 2017

cloud-fan commented Jan 16, 2017

SparkQA commented Jan 16, 2017

cloud-fan commented Jan 16, 2017

[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet #16474

[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet #16474

Conversation

viirya commented Jan 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

viirya commented Jan 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 16, 2017

SparkQA commented Jan 16, 2017

cloud-fan commented Jan 16, 2017