Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46848] XML: Enhance XML bad record handling with partial results support #44875

Closed
wants to merge 14 commits into from

Conversation

shujingyang-db
Copy link
Contributor

What changes were proposed in this pull request?

Enhance XML bad record handling with partial results support

Why are the changes needed?

Otherwise, we will get an empty result if the parser throws an exception

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jan 25, 2024
try {
kvPairs += (key -> convertTo(c.getData, valueType))
} catch {
case partialValueException: PartialValueException =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertTo is not going to throw PartialValueException


test("return partial results for bad records") {
withTempDir { dir =>
val xmlBadRecord1 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a scenario with nested struct object with attributes and valueTag:
Correct data:
<ROW><struct attr=3.0>4.0<b>5.0</b><struct></ROW>

XML data with mismatches:

<ROW><struct attr=mismatch1>4.0<b>5.0</b><struct></ROW>
<ROW><struct attr=3.0>mismatch2<b>5.0</b><struct></ROW>
<ROW><struct attr=3.0>4.0<b>mismatch3</b><struct></ROW>

@@ -429,6 +451,7 @@ class StaxXmlParser(
case e: SparkUpgradeException => throw e
case NonFatal(e) =>
badRecordException = badRecordException.orElse(Some(e))
StaxXmlParserUtils.skipChildren(parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this skip valid entities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not. It only skip the inner elements.

shujingyang-db and others added 13 commits January 24, 2024 18:02
…StaxXmlParser.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
…rces/xml/XmlSuite.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
Copy link
Contributor

@sandip-db sandip-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the try..catch, the code has become difficult to read.

case (f, v) =>
nameToIndex.get(f).foreach { row(_) = v }
try {
convertAttributes(rootAttributes, schema).toSeq.foreach {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first attribute has type mismatch, convertAttributes will drop the subsequent ones.

row.update(index, partialValueException.partialResult)
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link

github-actions bot commented May 6, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 6, 2024
@github-actions github-actions bot closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants