Skip to content

[SPARK-49094][SQL] Fix ignoreCorruptFiles non-functioning for hive orc impl with mergeSchema off#47583

Closed
yaooqinn wants to merge 4 commits intoapache:masterfrom
yaooqinn:SPARK-49094
Closed

[SPARK-49094][SQL] Fix ignoreCorruptFiles non-functioning for hive orc impl with mergeSchema off#47583
yaooqinn wants to merge 4 commits intoapache:masterfrom
yaooqinn:SPARK-49094

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Aug 2, 2024

What changes were proposed in this pull request?

ignoreCorruptFiles now applies to all file data sources except for hive orc implementation with mergeSchema off

Why are the changes needed?

bugfix

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Aug 2, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 2, 2024

cc @cloud-fan @dongjoon-hyun @HyukjinKwon thanks in advance

}
}

test("SPARK-49094: ignoreCorruptFiles works for hive orc") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we mention mergeSchema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongjoon-hyun , addressed

SQLConf.IGNORE_CORRUPT_FILES.key -> "false",
SQLConf.ORC_IMPLEMENTATION.key -> "hive") {
checkAnswer(spark.read
.option("mergeSchema", value = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be complete, could you add another test case, SQLConf.ORC_SCHEMA_MERGING_ENABLED.key -> "false" without this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

         Seq(true, false).foreach { mergeSchema =>
            checkAnswer(spark.read
              .option("mergeSchema", value = mergeSchema)
              .option("ignoreCorruptFiles", value = true)
              .orc(basePath), Row(0L, 1))
          }

Using mergeSchema w/ true & false can also achieve your request. I will address this in this way.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no, it's different because the previous code doesn't care of OrcOption. I expect the exact requested test case, @yaooqinn .

Copy link
Member Author

@yaooqinn yaooqinn Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous code doesn't care of OrcOption

mergeSchema is actually from OrcOption for both if and else code branches, see

https://github.com/apache/spark/pull/47583/files#diff-9954c15e0525a72c38dd8be7eaff4ba6c2a071b216ecea2524419d49ab4f92fcR68-R69

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I realized the source of our disagreement. I overlooked the outermost withSQLConf wrapper, which is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does everything seem to be okay? The test now tests both mergeSchema on and off.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this. If we add one more test case, this PR looks good to me.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code looks cleaner and correct. Thank you. Yes, I thought that redundant part was your intention of test coverage. We are all good. :)

@dongjoon-hyun
Copy link
Member

BTW, I believe we can convert SPARK-49094 issue to Bug instead of Improvement.
If you want, you can switch your JIRA issue type and backport this to branch-3.5 for your 3.5.2 release as the release manager.

@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 2, 2024

It makes sense to me. I will bring this to master/3.5/3.4

dongjoon-hyun pushed a commit that referenced this pull request Aug 3, 2024
…c impl with mergeSchema off

### What changes were proposed in this pull request?

ignoreCorruptFiles now applies to all file data sources except for hive orc implementation with mergeSchema off

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47583 from yaooqinn/SPARK-49094.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 6631abc)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Aug 3, 2024
…c impl with mergeSchema off

### What changes were proposed in this pull request?

ignoreCorruptFiles now applies to all file data sources except for hive orc implementation with mergeSchema off

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #47583 from yaooqinn/SPARK-49094.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 6631abc)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Thank you, @yaooqinn .
Merged to master/3.5/3.4.

@cloud-fan
Copy link
Contributor

late LGTM

@yaooqinn yaooqinn deleted the SPARK-49094 branch August 5, 2024 07:11
@cloud-fan
Copy link
Contributor

BTW, let's be more mindful of the behavior change definition. Even if it's a good and safe user-facing change, we should still document it in the Does this PR introduce any user-facing change section. For example, this PR can make queries that failed with corrupted files to run successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants