[SPARK-50483][CORE][SQL][3.5] BlockMissingException should be thrown even if ignoreCorruptFiles is enabled #49105

wangyum · 2024-12-07T16:17:12Z

What changes were proposed in this pull request?

BlockMissingException extends from IOException. When BlockMissingException occurs and ignoreCorruptFiles is enabled, the current task may not get any data and will be marked as successful(code). This will cause data quality issues.
Generally speaking, BlockMissingException is a system issue, not a file corruption issue. Therefore, BlockMissingException should be thrown even if ignoreCorruptFiles is enabled.

Related error message:

24/11/29 01:56:00 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: viewfs://hadoop-cluster/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet, range: 191727616-281354675, partition values: [empty row]
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-169998034-10.210.23.11-1507067630530:blk_83565156183_82548880660 file/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK] DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] Dead nodes:  DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK]

Why are the changes needed?

Avoid data issue if ignoreCorruptFiles is enabled when BlockMissingException occurred.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Manual test.

Was this patch authored or co-authored using generative AI tooling?

No.

…if ignoreCorruptFiles is enabled ### What changes were proposed in this pull request? `BlockMissingException` extends from `IOException`. When `BlockMissingException` occurs and ignoreCorruptFiles is enabled, the current task may not get any data and will be marked as successful([code](https://github.com/apache/spark/blob/0d045db8d15d0aeb0f54a1557fd360363e77ed42/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L271-L273)). This will cause data quality issues. Generally speaking, `BlockMissingException` is a system issue, not a file corruption issue. Therefore, `BlockMissingException` should be thrown even if ignoreCorruptFiles is enabled. Related error message: ``` 24/11/29 01:56:00 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: viewfs://hadoop-cluster/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet, range: 191727616-281354675, partition values: [empty row] org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-169998034-10.210.23.11-1507067630530:blk_83565156183_82548880660 file/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK] DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] Dead nodes: DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK] ``` ![image](https://github.com/user-attachments/assets/e040ce9d-1a0e-44eb-bd03-4cd7a9fff80f) ### Why are the changes needed? Avoid data issue if ignoreCorruptFiles is enabled when `BlockMissingException` occurred. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49089 from wangyum/SPARK-50483. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 934a387)

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @wangyum .

dongjoon-hyun · 2024-12-07T17:04:45Z

cc @LuciferYang

LuciferYang · 2024-12-07T17:37:59Z

Thank you for letting me know @dongjoon-hyun

LuciferYang

LGTM, pending tests

…even if ignoreCorruptFiles is enabled ### What changes were proposed in this pull request? `BlockMissingException` extends from `IOException`. When `BlockMissingException` occurs and ignoreCorruptFiles is enabled, the current task may not get any data and will be marked as successful([code](https://github.com/apache/spark/blob/0d045db8d15d0aeb0f54a1557fd360363e77ed42/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L271-L273)). This will cause data quality issues. Generally speaking, `BlockMissingException` is a system issue, not a file corruption issue. Therefore, `BlockMissingException` should be thrown even if ignoreCorruptFiles is enabled. Related error message: ``` 24/11/29 01:56:00 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: viewfs://hadoop-cluster/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet, range: 191727616-281354675, partition values: [empty row] org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-169998034-10.210.23.11-1507067630530:blk_83565156183_82548880660 file/path/to/data/part-00320-7915e327-3214-4585-a44e-f9c58e362b43.c000.snappy.parquet No live nodes contain current block Block locations: DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK] DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] Dead nodes: DatanodeInfoWithStorage[10.209.146.156:50010,DS-71d8ae97-15d3-454e-a715-d9490e184989,DISK] DatanodeInfoWithStorage[10.209.145.174:50010,DS-c7c0a172-5ffa-4f90-bfb5-717fb1e9ecf2,DISK] DatanodeInfoWithStorage[10.3.22.142:50010,DS-a1ba9ac9-dc92-4131-a2c2-9f7d03b97caf,DISK] ``` ![image](https://github.com/user-attachments/assets/e040ce9d-1a0e-44eb-bd03-4cd7a9fff80f) ### Why are the changes needed? Avoid data issue if ignoreCorruptFiles is enabled when `BlockMissingException` occurred. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49105 from wangyum/SPARK-50483-branch-3.5. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2024-12-08T04:01:03Z

Merged to branch-3.5 for Apache Spark 3.5.4.

dongjoon-hyun · 2024-12-08T04:23:02Z

To @LuciferYang , please let me know if you have any blocker or need my help during your Spark 3.5.4 RC1 preparation step. AFAIK, this is the last one we have been waiting for.

LuciferYang · 2024-12-08T05:12:45Z

To @LuciferYang , please let me know if you have any blocker or need my help during your Spark 3.5.4 RC1 preparation step. AFAIK, this is the last one we have been waiting for.

Yes, I think this should be the last one. Thank you very much for helping resolve many issues over the past few days. @dongjoon-hyun

wangyum added 2 commits December 7, 2024 23:51

github-actions bot added SQL CORE labels Dec 7, 2024

fix

cad39b6

github-actions bot added the AVRO label Dec 7, 2024

dongjoon-hyun approved these changes Dec 7, 2024

View reviewed changes

LuciferYang approved these changes Dec 7, 2024

View reviewed changes

dongjoon-hyun closed this Dec 8, 2024

wangyum deleted the SPARK-50483-branch-3.5 branch December 8, 2024 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50483][CORE][SQL][3.5] BlockMissingException should be thrown even if ignoreCorruptFiles is enabled #49105

[SPARK-50483][CORE][SQL][3.5] BlockMissingException should be thrown even if ignoreCorruptFiles is enabled #49105

Uh oh!

wangyum commented Dec 7, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Dec 7, 2024

Uh oh!

LuciferYang commented Dec 7, 2024

Uh oh!

LuciferYang left a comment

Uh oh!

dongjoon-hyun commented Dec 8, 2024

Uh oh!

dongjoon-hyun commented Dec 8, 2024 •

edited

Loading

Uh oh!

LuciferYang commented Dec 8, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-50483][CORE][SQL][3.5] BlockMissingException should be thrown even if ignoreCorruptFiles is enabled #49105

[SPARK-50483][CORE][SQL][3.5] BlockMissingException should be thrown even if ignoreCorruptFiles is enabled #49105

Uh oh!

Conversation

wangyum commented Dec 7, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 7, 2024

Uh oh!

LuciferYang commented Dec 7, 2024

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 8, 2024

Uh oh!

dongjoon-hyun commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Dec 8, 2024 •

edited

Loading

LuciferYang commented Dec 8, 2024 •

edited

Loading