-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46848] XML: Enhance XML bad record handling with partial results support #44875
Conversation
try { | ||
kvPairs += (key -> convertTo(c.getData, valueType)) | ||
} catch { | ||
case partialValueException: PartialValueException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convertTo
is not going to throw PartialValueException
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Outdated
Show resolved
Hide resolved
|
||
test("return partial results for bad records") { | ||
withTempDir { dir => | ||
val xmlBadRecord1 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a scenario with nested struct object with attributes and valueTag:
Correct data:
<ROW><struct attr=3.0>4.0<b>5.0</b><struct></ROW>
XML data with mismatches:
<ROW><struct attr=mismatch1>4.0<b>5.0</b><struct></ROW>
<ROW><struct attr=3.0>mismatch2<b>5.0</b><struct></ROW>
<ROW><struct attr=3.0>4.0<b>mismatch3</b><struct></ROW>
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
Outdated
Show resolved
Hide resolved
@@ -429,6 +451,7 @@ class StaxXmlParser( | |||
case e: SparkUpgradeException => throw e | |||
case NonFatal(e) => | |||
badRecordException = badRecordException.orElse(Some(e)) | |||
StaxXmlParserUtils.skipChildren(parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this skip valid entities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not. It only skip the inner elements.
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
Outdated
Show resolved
Hide resolved
…StaxXmlParser.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
…rces/xml/XmlSuite.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…rces/xml/XmlSuite.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all the try..catch, the code has become difficult to read.
case (f, v) => | ||
nameToIndex.get(f).foreach { row(_) = v } | ||
try { | ||
convertAttributes(rootAttributes, schema).toSeq.foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first attribute has type mismatch, convertAttributes
will drop the subsequent ones.
row.update(index, partialValueException.partialResult) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Enhance XML bad record handling with partial results support
Why are the changes needed?
Otherwise, we will get an empty result if the parser throws an exception
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No