[SPARK-44751][SQL] XML FileFormat Interface implementation #42462

sandip-db · 2023-08-11T21:51:18Z

What changes were proposed in this pull request?

This is the second PR related to the built-in XML data source implementation (jira).
The previous PR ported the spark-xml package.
This PR addresses the following:

Implement FileFormat interface
Address the review comments in the previous XML PR
Moved from_xml and schema_of_xml to sql/functions
Moved ".xml" to DataFrameReader/DataFrameWriter
Removed old APIs like XmlRelation, XmlReader, etc.
StaxXmlParser changes:
- Use FailureSafeParser
- Convert 'Row' usage to 'InternalRow'
- Convert String to UTF8String
- Handle MapData and ArrayData for MapType and ArrayType respectively
- Use TimestampFormatter to parse timestamp
- Use DateFormatter to parse date
StaxXmlGenerator changes:
- Convert 'Row' usage to 'InternalRow'
- Handle UTF8String for StringType
- Handle MapData and ArrayData for MapType and ArrayType respectively
- Use TimestampFormatter to format timestamp
- Use DateFormatter to format date
Update XML tests accordingly because of the above changes

Why are the changes needed?

These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented.

Does this PR introduce any user-facing change?

Yes, it adds support for XML data source.

How was this patch tested?

Ran all the XML unit tests.
Github Action

yaooqinn

Sorry for missing the dev vote thread. In this case, IP clearance is necessary for bringing databricks/spark-xml to an upstream project of ASF. Steps we may follow:

Verify license dependencies of databricks/spark-xml.
Determine if SGA, CCLA/ICLA is necessary for IP attribution.
Ensure any new PMC/Committer is updated to align with the project.
Conduct a private@s.a.o mail vote.
Update the Incubator IP clearance page, e.g. skywalking-rocketbot-ui, pulsar-manager.
Notify the Incubator via mail.
Contact infra and follow up.

HyukjinKwon · 2023-08-12T04:31:29Z

License-wise there is no problem because I wrote them. It's Apache License 2. I filed CCLA/ICLA when I became a committer.

Through SPIP, we have reached a lazy consensus including PMC votes.

IP clearance is for an external project but this is really a plugin that has very small codebase. We haven't done that for Avro, CSV, cloudpickle for example in which the codebase is really small, and we reviewed them line by line.

In addition, we're NOT just porting it as is but release a modified version per Sparks internal interface that will allows a lot of features such as parttioned table. TBH it's more work and code to modify them than just porting it.

yaooqinn · 2023-08-14T02:02:34Z

Thanks for the explanation @HyukjinKwon. I'm OK with it if we already have precedents like arvo and csv

ericsun95 · 2023-08-15T02:28:03Z

common/utils/src/main/resources/error/error-classes.json

@@ -589,6 +589,11 @@
          "<errors>"


This is great! I was thinking upgrading the xml reader with data source v2 before but really stopped by the refactoring work involved. Thanks for adding it into the spark mainline to unify the interfaces and catch up with the main changes.

Thanks for reviewing this PR.

You're welcome! This is great!

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XMLFileFormat.scala

...src/main/scala/org/apache/spark/sql/execution/datasources/xml/parsers/StaxXmlGenerator.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/util/XmlFile.scala

- Add stub for xml expressions in spark connect - Add exception for xml expressions in sql/functions and pyspark - Corressponding test fixes

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

HyukjinKwon · 2023-08-21T04:17:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala

@@ -83,21 +86,21 @@ private[xml] object StaxXmlGenerator {

    def writeElement(dt: DataType, v: Any, options: XmlOptions): Unit = (dt, v) match {
      case (_, null) | (NullType, _) => writer.writeCharacters(options.nullValue)
+      case (StringType, v: UTF8String) => writer.writeCharacters(v.toString)
      case (StringType, v: String) => writer.writeCharacters(v)
      case (TimestampType, v: Timestamp) =>


I think you can remove this, and case (DecimalType(), v: java.math.BigDecimal) => writer.writeCharacters(v.toString) (See also JacksonGenerator and which types are being handled).

BTW, we would also add the type supports for TimestampNTZType, YearMonthIntervalType and DayTimeIntervalType. But they are orthogonal and can be done separately.

Most StringType arrive here with value of type UTF8String.

Also, JacksonGenerator supports DecimalType and I was planning to add support for DecimalType in a followup. Is that not required?

HyukjinKwon

Looks pretty good

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

sandip-db · 2023-08-21T18:26:57Z

Thanks for the explanation @HyukjinKwon. I'm OK with it if we already have precedents like arvo and csv

@yaooqinn @HyukjinKwon has addressed your concern. Can you please approve?

cloud-fan · 2023-08-22T01:18:14Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   * {@code fieldA [[data1], [data2]]}
+   *
+   * would produce a XML file below. { @code <fieldA> <item>data1</item> </fieldA> <fieldA>
+   * <item>data2</item> </fieldA>}


is this the fixed version?

yes. Changed:
{@code fieldA [[data1, data2]]}
to
{@code fieldA [[data1], [data2]]}

cloud-fan · 2023-08-22T01:26:52Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 4.0.0
+   */
+  // scalastyle:on line.size.limit
+  def from_xml(e: Column, schema: StructType, options: Map[String, String]): Column =


the schema parameter can be StructType or Column, the options parameter can be scala or java map, or omitted. This means we need 6 overloads of from_xml.

Does it really worth it? I know we did the same thing for from_json, but this is really convoluted.

How about something like

TextParsingFunction.newBuilder() .withSchema(...) // It has multiple overloads .withOptions(...) // It has multiple overloads .xml() // returns a Column

Anyway, it's unrelated to this PR. We can do it later. cc @HyukjinKwon

I think we can remove (Scala-specific) signature with Scala map for now. For the same reason, we don't have that scala specific version of to_csv, etc.

from_csv has just two. I can trim from_xml overloads too. Let me know.

Let's remove this signature with Scala map for now in a followup.

cloud-fan · 2023-08-22T01:29:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala

+    copy(timeZoneId = Option(timeZoneId))
+  }
+  override def nullSafeEval(xml: Any): Any = xml match {
+    case arr: GenericArrayData =>


why do we match this case if the handling is exactly the same with case arr: ArrayData?

You are right. It shouldn't be there. Can I address this in a follow-up?
https://issues.apache.org/jira/browse/SPARK-44810

HyukjinKwon · 2023-08-22T02:39:27Z

Let me get this in first because @sandip-db seems like working on another followup. Let's address them in a followup.

HyukjinKwon · 2023-08-22T02:39:40Z

The tests passed.

Merged to master.

HyukjinKwon · 2023-08-22T02:50:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/CreateXmlParser.scala

+    xmlInputFactory.createFilteredReader(eventReader, filter)
+  }
+
+  // Jackson parsers can be ranked according to their performance:


Let's also update the docs

HyukjinKwon · 2023-08-22T02:51:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala

+      s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
+    })
+
+  // SPARK-39731: Enables the backward compatible parsing behavior.


HyukjinKwon · 2023-08-22T02:52:46Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since
+   */
+  // scalastyle:on line.size.limit
+  def from_xml(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr {


TODOs:

Scala and Python implementation for Spark Connect

Python and R implementation

https://issues.apache.org/jira/browse/SPARK-44753

HyukjinKwon · 2023-08-22T02:53:07Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

+   *
+   * @since 4.0.0
+   */
+  def xml(path: String): DataFrame = format("xml").load(path)


ditto for https://github.com/apache/spark/pull/42462/files#r1300848164

HyukjinKwon · 2023-08-22T02:53:20Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   *
+   * @since 4.0.0
+   */
+  def xml(path: String): DataFrame = {


ditto for https://github.com/apache/spark/pull/42462/files#r1300848164

HyukjinKwon · 2023-08-22T02:53:29Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   *
+   * @since 4.0.0
+   */
+  def xml(path: String): Unit = {


ditto for https://github.com/apache/spark/pull/42462/files#r1300848164

cloud-fan · 2023-08-22T03:02:39Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 4.0.0
+   */
+  // scalastyle:on line.size.limit
+  def schema_of_xml(xml: Column, options: java.util.Map[String, String]): Column = {


shall we at least have an overload with scala options?

Actually this is same as schema_of_json. I suggested to only have Java map one only for now .. to avoid having too many overloaded versions.

cloud-fan

The FileFormat integration part LGTM. I assume the parsing code is the same as before.

### What changes were proposed in this pull request? This is the second PR related to the built-in XML data source implementation ([jira](https://issues.apache.org/jira/browse/SPARK-44751)). The previous [PR](apache#41832) ported the spark-xml package. This PR addresses the following: - Implement FileFormat interface - Address the review comments in the previous [XML PR](apache#41832) - Moved from_xml and schema_of_xml to sql/functions - Moved ".xml" to DataFrameReader/DataFrameWriter - Removed old APIs like XmlRelation, XmlReader, etc. - StaxXmlParser changes: - Use FailureSafeParser - Convert 'Row' usage to 'InternalRow' - Convert String to UTF8String - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to parse timestamp - Use DateFormatter to parse date - StaxXmlGenerator changes: - Convert 'Row' usage to 'InternalRow' - Handle UTF8String for StringType - Handle MapData and ArrayData for MapType and ArrayType respectively - Use TimestampFormatter to format timestamp - Use DateFormatter to format date - Update XML tests accordingly because of the above changes ### Why are the changes needed? These changes are required to bring XML data source capability at par with CSV and JSON and supports features like streaming, which requires FileFormat interface to be implemented. ### Does this PR introduce _any_ user-facing change? Yes, it adds support for XML data source. ### How was this patch tested? - Ran all the XML unit tests. - Github Action Closes apache#42462 from sandip-db/xml-file-format-master. Authored-by: Sandip Agarwala <131817656+sandip-db@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

[SPARK-44751][SQL] XML FileFormat Interface implementation

1566b3e

github-actions bot added SQL DOCS labels Aug 11, 2023

yaooqinn requested changes Aug 12, 2023

View reviewed changes

sandip-db added 3 commits August 12, 2023 00:17

javadoc fix

44191f6

javadoc fix2

8803114

Revert changes in unrelated files

6ef36ef