Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

Closed
wants to merge 22 commits into from

Conversation

shujingyang-db
Copy link
Contributor

@shujingyang-db shujingyang-db commented Dec 12, 2023

What changes were proposed in this pull request?

In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named valueTags, which is enabled by default, to capture these values. Consider the following example:

<ROW>
    <a>1</a>
  value1
  <b>
    value2
    <c>2</c>
    value3
  </b>
</ROW>

In this example, <a>, <b>, and <c> are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element <b>)

We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type.

We will simplify the handling of value tags in a follow-up PR.

  1. As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types.
  2. The inferAndCheckEndElement function will be updated to align with this approach. If we encounter an opening tag in the first place, we will peek at the next element of the closing tag. If not, we will stop at the closing tag right away.

Why are the changes needed?

We should parse the values otherwise there would be data loss

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 12, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-46382] XML: Capture values interspersed between elements [SPARK-46382][SQL] XML: Capture values interspersed between elements Dec 13, 2023
val df = spark.read.format("xml")
.option("rowTag", "ROW")
.option("multiLine", "true")
.load(getTestResourcePath(resDir + "values-simple.xml"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple XML data can be embedded in test suite itself like: using spark.createDataset or writing to a temp file

addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
} else {
row
}
}
case (_: Characters, _: StringType) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is parser.next not required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to move the next event. currentStructureAsString will move the parser pointer.

case _: EndElement =>
// It couldn't be an array of value tags
// as the opening tag is immediately followed by a closing tag.
if (isEmptyString(c)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not allow any whitespace values for valueTag.

Suggested change
if (isEmptyString(c)) {
if (!c.isWhiteSpace) {

}
case _ =>
val row = convertObject(parser, st)
if (!isEmptyString(c)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!isEmptyString(c)) {
if (!c.isWhiteSpace) {

@@ -159,7 +159,7 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
parser.peek match {
case _: EndElement => NullType
case _: StartElement => inferObject(parser)
case c: Characters if c.isWhiteSpace =>
case c: Characters if isEmptyString(c) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case c: Characters if isEmptyString(c) =>
case c: Characters if c.isWhiteSpace =>

@@ -171,16 +171,18 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
case _: EndElement => StringType
case _ => inferField(parser)
}
case c: Characters if !c.isWhiteSpace =>
// what about new line character
case c: Characters if !isEmptyString(c) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, can't we return inferObject(parser)?
In inferObject(parser), the case for StructType can be updated to "unnest" StructType with just valueTag.
Without this, there is lot of code duplication logic for valueTag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO inferObject can't do this. This branch handles both primitive types and nested objects. If we return inferObject(parser), the primitive types will be inferred as a structFields of valueTag

@@ -1145,18 +1145,18 @@ class XmlSuite extends QueryTest with SharedSparkSession {
.option("inferSchema", true)
.xml(getTestResourcePath(resDir + "mixed_children.xml"))
val mixedRow = mixedDF.head()
assert(mixedRow.getAs[Row](0).toSeq === Seq(" lorem "))
assert(mixedRow.getString(1) === " ipsum ")
assert(mixedRow.getAs[Row](0) === Row(List("issue", "text ignored"), "lorem"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update text ignored with something else.

@@ -1729,9 +1729,15 @@ class XmlSuite extends QueryTest with SharedSparkSession {
val TAG_NAME = "tag"
val VALUETAG_NAME = "_VALUE"
val schema = buildSchema(
field(VALUETAG_NAME),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the fields were rearranged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort the field name in ascending order. The _VALUE comes before _attr

shujingyang-db and others added 10 commits December 18, 2023 21:46
…StaxXmlParser.scala

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
…ple.xml

Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>
# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala
val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
indexOpt match {
case Some(index) =>
convertTo(c.getData, st.fields(index).dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I get that. It looks like the assumption is that convertField will either return a Row or a singleton valueTag with just value.

Comment on lines +237 to +238
case _ =>
val row = convertObject(parser, st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this handle values separated by comment or cdata? If so, we don't need case _: EndElement above.

<ROW>
  <a> 1 <!--this is a comment--> 2 </a>
</ROW>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up! I added some test cases for comments. We still need this branch asconvertObject cannot handle value tag.

}
case _ =>
val row = convertObject(parser, st)
if (!c.isWhiteSpace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document the behavior of whitespaces for valueTag. Also, the following scenarios, which contain whitespaces with quotes:

<ROW><a>" "</a></ROW>
<ROW><b>" "<c>1</c></b></ROW>
<ROW><d><e attr=" "></e></d></ROW>

case _ =>
val row = convertObject(parser, st)
if (!c.isWhiteSpace) {
addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why addToTail is false here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in this case, we encounter the interspersed value first and then the nested objects. We want to make sure that the value tag appears before the nested objects

valueTagType: DataType): DataType = {
(objectType, valueTagType) match {
case (st: StructType, _) =>
// TODO(shujing): case sensitive?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the case for valueTag is unlikely to change, its better to add case sensitivity logic to it to make it consistent with other fields. Can be a separate PR. Not a high prio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering this question! I create a Jira ticket for it

index,
ArrayType(compatibleType(st(index).dataType, valueTagType)))
case Some(index) =>
updateStructField(st, index, compatibleType(st(index).dataType, valueTagType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't st(index).dataType will be of ArrayType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this branch handles this case of array type. If it's an array, we will merge the element types.

case c: Characters if !c.isWhiteSpace =>
val characterType = inferFrom(c.getData)
parser.nextEvent()
addOrUpdateType(options.valueTag, characterType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case for this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valueTag that locates after a closing tag in the inner element and before the closing tag in the outer element will cover this scenario.

<a>
    value2
    <b>1</b>
    value3
</a>

We covered this case in the most our test cases

index,
ArrayType(compatibleType(st(index).dataType, valueTagType)))
case Some(index) =>
updateStructField(st, index, compatibleType(st(index).dataType, valueTagType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this test case scenario where Array<LongType> is updated to Array<DoubleType>:

<ROW>
  <a>
    1
    <b>2</b>
    3
    <b>4</b>
    5.0
  </a>
</ROW>

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants