-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
Outdated
Show resolved
Hide resolved
val df = spark.read.format("xml") | ||
.option("rowTag", "ROW") | ||
.option("multiLine", "true") | ||
.load(getTestResourcePath(resDir + "values-simple.xml")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple XML data can be embedded in test suite itself like: using spark.createDataset or writing to a temp file
addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false) | ||
} else { | ||
row | ||
} | ||
} | ||
case (_: Characters, _: StringType) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is parser.next
not required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to move the next event. currentStructureAsString
will move the parser pointer.
case _: EndElement => | ||
// It couldn't be an array of value tags | ||
// as the opening tag is immediately followed by a closing tag. | ||
if (isEmptyString(c)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not allow any whitespace values for valueTag
.
if (isEmptyString(c)) { | |
if (!c.isWhiteSpace) { |
} | ||
case _ => | ||
val row = convertObject(parser, st) | ||
if (!isEmptyString(c)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!isEmptyString(c)) { | |
if (!c.isWhiteSpace) { |
@@ -159,7 +159,7 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean) | |||
parser.peek match { | |||
case _: EndElement => NullType | |||
case _: StartElement => inferObject(parser) | |||
case c: Characters if c.isWhiteSpace => | |||
case c: Characters if isEmptyString(c) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case c: Characters if isEmptyString(c) => | |
case c: Characters if c.isWhiteSpace => |
@@ -171,16 +171,18 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean) | |||
case _: EndElement => StringType | |||
case _ => inferField(parser) | |||
} | |||
case c: Characters if !c.isWhiteSpace => | |||
// what about new line character | |||
case c: Characters if !isEmptyString(c) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, can't we return inferObject(parser)
?
In inferObject(parser)
, the case for StructType
can be updated to "unnest" StructType
with just valueTag
.
Without this, there is lot of code duplication logic for valueTag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO inferObject can't do this. This branch handles both primitive types and nested objects. If we return inferObject(parser)
, the primitive types will be inferred as a structFields of valueTag
@@ -1145,18 +1145,18 @@ class XmlSuite extends QueryTest with SharedSparkSession { | |||
.option("inferSchema", true) | |||
.xml(getTestResourcePath(resDir + "mixed_children.xml")) | |||
val mixedRow = mixedDF.head() | |||
assert(mixedRow.getAs[Row](0).toSeq === Seq(" lorem ")) | |||
assert(mixedRow.getString(1) === " ipsum ") | |||
assert(mixedRow.getAs[Row](0) === Row(List("issue", "text ignored"), "lorem")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update text ignored
with something else.
@@ -1729,9 +1729,15 @@ class XmlSuite extends QueryTest with SharedSparkSession { | |||
val TAG_NAME = "tag" | |||
val VALUETAG_NAME = "_VALUE" | |||
val schema = buildSchema( | |||
field(VALUETAG_NAME), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the fields were rearranged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sort the field name in ascending order. The _VALUE
comes before _attr
sql/core/src/test/resources/test-data/xml-resources/values-simple.xml
Outdated
Show resolved
Hide resolved
…StaxXmlParser.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…ple.xml Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
val indexOpt = getFieldNameToIndex(st).get(options.valueTag) | ||
indexOpt match { | ||
case Some(index) => | ||
convertTo(c.getData, st.fields(index).dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I get that. It looks like the assumption is that convertField
will either return a Row
or a singleton valueTag with just value.
case _ => | ||
val row = convertObject(parser, st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this handle values separated by comment or cdata? If so, we don't need case _: EndElement
above.
<ROW>
<a> 1 <!--this is a comment--> 2 </a>
</ROW>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up! I added some test cases for comments. We still need this branch asconvertObject
cannot handle value tag.
} | ||
case _ => | ||
val row = convertObject(parser, st) | ||
if (!c.isWhiteSpace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to document the behavior of whitespaces for valueTag. Also, the following scenarios, which contain whitespaces with quotes:
<ROW><a>" "</a></ROW>
<ROW><b>" "<c>1</c></b></ROW>
<ROW><d><e attr=" "></e></d></ROW>
case _ => | ||
val row = convertObject(parser, st) | ||
if (!c.isWhiteSpace) { | ||
addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why addToTail
is false here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because in this case, we encounter the interspersed value first and then the nested objects. We want to make sure that the value tag appears before the nested objects
valueTagType: DataType): DataType = { | ||
(objectType, valueTagType) match { | ||
case (st: StructType, _) => | ||
// TODO(shujing): case sensitive? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while the case for valueTag is unlikely to change, its better to add case sensitivity logic to it to make it consistent with other fields. Can be a separate PR. Not a high prio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering this question! I create a Jira ticket for it
index, | ||
ArrayType(compatibleType(st(index).dataType, valueTagType))) | ||
case Some(index) => | ||
updateStructField(st, index, compatibleType(st(index).dataType, valueTagType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't st(index).dataType
will be of ArrayType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this branch handles this case of array type. If it's an array, we will merge the element types.
case c: Characters if !c.isWhiteSpace => | ||
val characterType = inferFrom(c.getData) | ||
parser.nextEvent() | ||
addOrUpdateType(options.valueTag, characterType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test case for this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A valueTag that locates after a closing tag in the inner element and before the closing tag in the outer element will cover this scenario.
<a>
value2
<b>1</b>
value3
</a>
We covered this case in the most our test cases
index, | ||
ArrayType(compatibleType(st(index).dataType, valueTagType))) | ||
case Some(index) => | ||
updateStructField(st, index, compatibleType(st(index).dataType, valueTagType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add this test case scenario where Array<LongType>
is updated to Array<DoubleType>
:
<ROW>
<a>
1
<b>2</b>
3
<b>4</b>
5.0
</a>
</ROW>
Merged to master. |
What changes were proposed in this pull request?
In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named
valueTags
, which is enabled by default, to capture these values. Consider the following example:In this example,
<a>
,<b>
, and<c>
are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element<b>
)We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type.
We will simplify the handling of value tags in a follow-up PR.
inferObject
method, eliminating the need for processing ininferField
. This implies that when we encounter non-whitespace characters, we can invokeinferObject
. For structures with a single primitive field, we'll simplify them into primitive types.inferAndCheckEndElement
function will be updated to align with this approach. If we encounter an opening tag in the first place, we will peek at the next element of the closing tag. If not, we will stop at the closing tag right away.Why are the changes needed?
We should parse the values otherwise there would be data loss
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
No