[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

shujingyang-db · 2023-12-12T19:04:45Z

What changes were proposed in this pull request?

In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named valueTags, which is enabled by default, to capture these values. Consider the following example:

<ROW>
    <a>1</a>
  value1
  <b>
    value2
    <c>2</c>
    value3
  </b>
</ROW>

In this example, <a>, , and <c> are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element )

We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type.

We will simplify the handling of value tags in a follow-up PR.

As value tags only exist in structure data, their handling will be confined to the inferObject method, eliminating the need for processing in inferField. This implies that when we encounter non-whitespace characters, we can invoke inferObject. For structures with a single primitive field, we'll simplify them into primitive types.

The inferAndCheckEndElement function will be updated to align with this approach. If we encounter an opening tag in the first place, we will peek at the next element of the closing tag. If not, we will stop at the closing tag right away.

Why are the changes needed?

We should parse the values otherwise there would be data loss

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

sandip-db · 2023-12-14T22:41:02Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

+    val df = spark.read.format("xml")
+      .option("rowTag", "ROW")
+      .option("multiLine", "true")
+      .load(getTestResourcePath(resDir + "values-simple.xml"))


Simple XML data can be embedded in test suite itself like: using spark.createDataset or writing to a temp file

sandip-db · 2023-12-14T22:48:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)
+            } else {
+              row
+            }
        }
      case (_: Characters, _: StringType) =>


Is parser.next not required here?

We don't need to move the next event. currentStructureAsString will move the parser pointer.

sandip-db · 2023-12-14T23:39:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+          case _: EndElement =>
+            // It couldn't be an array of value tags
+            // as the opening tag is immediately followed by a closing tag.
+            if (isEmptyString(c)) {


Lets not allow any whitespace values for valueTag.

Suggested change

if (isEmptyString(c)) {

if (!c.isWhiteSpace) {

sandip-db · 2023-12-14T23:40:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+            }
+          case _ =>
+            val row = convertObject(parser, st)
+            if (!isEmptyString(c)) {


Suggested change

if (!isEmptyString(c)) {

if (!c.isWhiteSpace) {

sandip-db · 2023-12-15T07:02:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

@@ -159,7 +159,7 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
    parser.peek match {
      case _: EndElement => NullType
      case _: StartElement => inferObject(parser)
-      case c: Characters if c.isWhiteSpace =>
+      case c: Characters if isEmptyString(c) =>


Suggested change

case c: Characters if isEmptyString(c) =>

case c: Characters if c.isWhiteSpace =>

sandip-db · 2023-12-15T07:35:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

@@ -171,16 +171,18 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean)
          case _: EndElement => StringType
          case _ => inferField(parser)
        }
-      case c: Characters if !c.isWhiteSpace =>
+      // what about new line character
+      case c: Characters if !isEmptyString(c) =>


For this case, can't we return inferObject(parser)?
In inferObject(parser), the case for StructType can be updated to "unnest" StructType with just valueTag.
Without this, there is lot of code duplication logic for valueTag.

IMHO inferObject can't do this. This branch handles both primitive types and nested objects. If we return inferObject(parser), the primitive types will be inferred as a structFields of valueTag

sandip-db · 2023-12-15T07:40:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

@@ -1145,18 +1145,18 @@ class XmlSuite extends QueryTest with SharedSparkSession {
      .option("inferSchema", true)
      .xml(getTestResourcePath(resDir + "mixed_children.xml"))
    val mixedRow = mixedDF.head()
-    assert(mixedRow.getAs[Row](0).toSeq === Seq(" lorem "))
-    assert(mixedRow.getString(1) === " ipsum ")
+    assert(mixedRow.getAs[Row](0) === Row(List("issue", "text ignored"), "lorem"))


Update text ignored with something else.

sandip-db · 2023-12-15T07:42:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

@@ -1729,9 +1729,15 @@ class XmlSuite extends QueryTest with SharedSparkSession {
    val TAG_NAME = "tag"
    val VALUETAG_NAME = "_VALUE"
    val schema = buildSchema(
+      field(VALUETAG_NAME),


Why the fields were rearranged?

We sort the field name in ascending order. The _VALUE comes before _attr

sql/core/src/test/resources/test-data/xml-resources/values-simple.xml

…StaxXmlParser.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

…ple.xml Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

sandip-db · 2023-12-21T16:51:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+            val indexOpt = getFieldNameToIndex(st).get(options.valueTag)
+            indexOpt match {
+              case Some(index) =>
+                convertTo(c.getData, st.fields(index).dataType)


yes, I get that. It looks like the assumption is that convertField will either return a Row or a singleton valueTag with just value.

sandip-db · 2023-12-21T17:03:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+          case _ =>
+            val row = convertObject(parser, st)


Will this handle values separated by comment or cdata? If so, we don't need case _: EndElement above.

<ROW> <a> 1  2 </a> </ROW>

Thanks for bringing this up! I added some test cases for comments. We still need this branch asconvertObject cannot handle value tag.

sandip-db · 2023-12-21T17:08:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+            }
+          case _ =>
+            val row = convertObject(parser, st)
+            if (!c.isWhiteSpace) {


We need to document the behavior of whitespaces for valueTag. Also, the following scenarios, which contain whitespaces with quotes:

<ROW><a>" "</a></ROW> <ROW>" "<c>1</c></ROW> <ROW><d><e attr=" "></e></d></ROW>

sandip-db · 2023-12-21T17:31:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+          case _ =>
+            val row = convertObject(parser, st)
+            if (!c.isWhiteSpace) {
+              addOrUpdate(row.toSeq(st).toArray, st, options.valueTag, c.getData, addToTail = false)


Why addToTail is false here?

This is because in this case, we encounter the interspersed value first and then the nested objects. We want to make sure that the value tag appears before the nested objects

sandip-db · 2023-12-21T18:19:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+      valueTagType: DataType): DataType = {
+    (objectType, valueTagType) match {
+      case (st: StructType, _) =>
+        // TODO(shujing): case sensitive?


while the case for valueTag is unlikely to change, its better to add case sensitivity logic to it to make it consistent with other fields. Can be a separate PR. Not a high prio.

Thanks for answering this question! I create a Jira ticket for it

sandip-db · 2023-12-21T18:23:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+              index,
+              ArrayType(compatibleType(st(index).dataType, valueTagType)))
+          case Some(index) =>
+            updateStructField(st, index, compatibleType(st(index).dataType, valueTagType))


Won't st(index).dataType will be of ArrayType?

Yes, this branch handles this case of array type. If it's an array, we will merge the element types.

sandip-db · 2023-12-21T19:02:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+        case c: Characters if !c.isWhiteSpace =>
+          val characterType = inferFrom(c.getData)
+          parser.nextEvent()
+          addOrUpdateType(options.valueTag, characterType)


Is there a test case for this scenario?

A valueTag that locates after a closing tag in the inner element and before the closing tag in the outer element will cover this scenario.

<a> value2 1 value3 </a>

We covered this case in the most our test cases

sandip-db · 2023-12-21T19:03:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala

+              index,
+              ArrayType(compatibleType(st(index).dataType, valueTagType)))
+          case Some(index) =>
+            updateStructField(st, index, compatibleType(st(index).dataType, valueTagType))


Lets add this test case scenario where Array<LongType> is updated to Array<DoubleType>:

<ROW> <a> 1 2 3 4 5.0 </a> </ROW>

HyukjinKwon · 2023-12-29T00:57:16Z

Merged to master.

shujingyang-db added 5 commits December 11, 2023 15:48

init

52bb404

Merge remote-tracking branch 'spark/master' into capture-values

68c5eac

revert format

4f63617

fix

815859f

rm todo

4be69e3

github-actions bot added the SQL label Dec 12, 2023

pkg

02193a8

HyukjinKwon changed the title ~~[SPARK-46382] XML: Capture values interspersed between elements~~ [SPARK-46382][SQL] XML: Capture values interspersed between elements Dec 13, 2023

sandip-db suggested changes Dec 15, 2023

View reviewed changes

shujingyang-db and others added 10 commits December 18, 2023 21:46

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/…

0e79565

…StaxXmlParser.scala Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

Update sql/core/src/test/resources/test-data/xml-resources/values-sim…

f60a758

…ple.xml Co-authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com>

whitespace

4f3acc0

whitespace

3de09f4

fix test case

775052a

deeply nested

306cbe6

inline xml

2b1fc93

tailrec

bc89b57

Merge remote-tracking branch 'spark/master' into capture-values

cc27944

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

nit

6599147

shujingyang-db requested a review from sandip-db December 19, 2023 19:34

ignoreSurroundingSpaces

0fa042d

sandip-db suggested changes Dec 21, 2023

View reviewed changes

shujingyang-db added 2 commits December 27, 2023 10:13

Merge branch 'master' of github.com:apache/spark into capture-values

51cdf54

test

dcae962

shujingyang-db requested a review from sandip-db December 27, 2023 19:13

shujingyang-db added 3 commits December 27, 2023 17:35

comments

0eb8aeb

whitespace with quotes

a5c3fbc

fix whitespace

32bd9fe

sandip-db approved these changes Dec 29, 2023

View reviewed changes

HyukjinKwon approved these changes Dec 29, 2023

View reviewed changes

HyukjinKwon closed this in 4ec63be Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

shujingyang-db commented Dec 12, 2023 •

edited

sandip-db Dec 14, 2023

sandip-db Dec 14, 2023

shujingyang-db Dec 19, 2023

sandip-db Dec 14, 2023

sandip-db Dec 14, 2023

sandip-db Dec 15, 2023

sandip-db Dec 15, 2023

shujingyang-db Dec 19, 2023

sandip-db Dec 15, 2023

sandip-db Dec 15, 2023

shujingyang-db Dec 19, 2023

sandip-db Dec 21, 2023

sandip-db Dec 21, 2023

shujingyang-db Dec 28, 2023

sandip-db Dec 21, 2023

sandip-db Dec 21, 2023

shujingyang-db Dec 21, 2023

sandip-db Dec 21, 2023

shujingyang-db Dec 27, 2023

sandip-db Dec 21, 2023

shujingyang-db Dec 21, 2023

sandip-db Dec 21, 2023

shujingyang-db Dec 21, 2023

sandip-db Dec 21, 2023

HyukjinKwon commented Dec 29, 2023

	case c: Characters if isEmptyString(c) =>
	case c: Characters if c.isWhiteSpace =>

[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

[SPARK-46382][SQL] XML: Capture values interspersed between elements #44318

Conversation

shujingyang-db commented Dec 12, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 29, 2023

shujingyang-db commented Dec 12, 2023 •

edited