[SPARK-49062][SQL] Migrate XML to File Data Source V2 #47539

wayneguow · 2024-07-30T18:39:47Z

What changes were proposed in this pull request?

This PR aims to Migrate XML to File Data Source V2.

Why are the changes needed?

Add v2 support for XML.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA and transform XmlSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

wayneguow · 2024-07-31T06:01:04Z

cc @HyukjinKwon @cloud-fan thanks.

HyukjinKwon · 2024-08-02T00:48:31Z

cc @sandip-db WDYT?

sandip-db · 2024-08-02T18:53:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/xml/XmlScan.scala

+  }
+
+  override def hashCode(): Int = super.hashCode()
+}


Any reason to not override getMetaData like json/csv?

Since there is no PushedFilters, just keep the default method of the parent class.

sandip-db · 2024-08-02T19:02:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/xml/XmlScanBuilder.scala

+      partitionFilters,
+      dataFilters)
+  }
+}


Any reason to not override pushDataFilters like json/csv?

Because we currently do not support push down for xml, this other topic for xml and we can separate another pr. Therefore, we can use the default method of the parent class currently.

sandip-db · 2024-08-02T19:17:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/xml/XmlDataSourceV2.scala

+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+class XmlDataSourceV2 extends FileDataSourceV2 {


Thanks for submitting this PR. I see that it is essentially a copy of json V2 and refactored for XML. I pointed a few differences in the PR compared to json. Are there any other deviations?

Yes, I referred to JSON and other other data source implementations that already have v1 and v2. Except for the push down part, the left part should be consistent.

sandip-db · 2024-08-02T19:18:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

          if (provider.equalsIgnoreCase("xml") && sources.size == 2) {
            val externalSource = sources.filterNot(_.getClass.getName
-              .startsWith("org.apache.spark.sql.execution.datasources.xml.XmlFileFormat")
+              .startsWith("org.apache.spark.sql.execution.datasources.v2.xml.XmlDataSourceV2")


Is the discovered source XmlDataSourceV2 irrespective of whether xml is included in the USE_V1_SOURCE_LIST?

Yes, ServiceLoader uses META-INF.services/org.apache.spark.sql.sources.DataSourceRegister file to load data source. Whether it's v1 or v2 it's the same here.

I can get your doubts. When the USE_V1_SOURCE_LIST includes XML, we finally back to using the default FileFormat here.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Line 111 in 9e35d04

case f: FileDataSourceV2 => f.fallbackFileFormat

sandip-db · 2024-08-02T19:20:34Z

cc @sandip-db WDYT?

It looks ok to me. Although, I have not come across anyone asking for it.

wayneguow · 2024-08-03T06:54:43Z

cc @sandip-db WDYT?

It looks ok to me. Although, I have not come across anyone asking for it.

Well, the reason why I proposed this PR is that other common data sources have v1 and v2 implementations, so I want to add v2 for xml. WDYT @cloud-fan , we need your help when you have time.

github-actions · 2024-11-23T00:24:47Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

xml v2

776b4e8

github-actions bot added the SQL label Jul 30, 2024

reset

f8c9512

wayneguow marked this pull request as ready for review July 31, 2024 01:16

sandip-db reviewed Aug 2, 2024

View reviewed changes

github-actions bot added the Stale label Nov 23, 2024

github-actions bot closed this Nov 24, 2024

wayneguow deleted the xml_v2 branch February 11, 2025 04:27

[SPARK-49062][SQL] Migrate XML to File Data Source V2 #47539

[SPARK-49062][SQL] Migrate XML to File Data Source V2 #47539

Uh oh!

Conversation

wayneguow commented Jul 30, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wayneguow commented Jul 31, 2024

Uh oh!

HyukjinKwon commented Aug 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandip-db commented Aug 2, 2024

Uh oh!

wayneguow commented Aug 3, 2024

Uh oh!

github-actions bot commented Nov 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants