Skip to content

Conversation

@wayneguow
Copy link
Contributor

What changes were proposed in this pull request?

This PR aims to Migrate XML to File Data Source V2.

Why are the changes needed?

Add v2 support for XML.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA and transform XmlSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 30, 2024
@wayneguow wayneguow marked this pull request as ready for review July 31, 2024 01:16
@wayneguow
Copy link
Contributor Author

cc @HyukjinKwon @cloud-fan thanks.

@HyukjinKwon
Copy link
Member

cc @sandip-db WDYT?

}

override def hashCode(): Int = super.hashCode()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not override getMetaData like json/csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no PushedFilters, just keep the default method of the parent class.

partitionFilters,
dataFilters)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not override pushDataFilters like json/csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we currently do not support push down for xml, this other topic for xml and we can separate another pr. Therefore, we can use the default method of the parent class currently.

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.util.CaseInsensitiveStringMap

class XmlDataSourceV2 extends FileDataSourceV2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR. I see that it is essentially a copy of json V2 and refactored for XML. I pointed a few differences in the PR compared to json. Are there any other deviations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I referred to JSON and other other data source implementations that already have v1 and v2. Except for the push down part, the left part should be consistent.

if (provider.equalsIgnoreCase("xml") && sources.size == 2) {
val externalSource = sources.filterNot(_.getClass.getName
.startsWith("org.apache.spark.sql.execution.datasources.xml.XmlFileFormat")
.startsWith("org.apache.spark.sql.execution.datasources.v2.xml.XmlDataSourceV2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the discovered source XmlDataSourceV2 irrespective of whether xml is included in the USE_V1_SOURCE_LIST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ServiceLoader uses META-INF.services/org.apache.spark.sql.sources.DataSourceRegister file to load data source. Whether it's v1 or v2 it's the same here.

I can get your doubts. When the USE_V1_SOURCE_LIST includes XML, we finally back to using the default FileFormat here.

case f: FileDataSourceV2 => f.fallbackFileFormat

@sandip-db
Copy link
Contributor

cc @sandip-db WDYT?

It looks ok to me. Although, I have not come across anyone asking for it.

@wayneguow
Copy link
Contributor Author

cc @sandip-db WDYT?

It looks ok to me. Although, I have not come across anyone asking for it.

Well, the reason why I proposed this PR is that other common data sources have v1 and v2 implementations, so I want to add v2 for xml. WDYT @cloud-fan , we need your help when you have time.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 23, 2024
@github-actions github-actions bot closed this Nov 24, 2024
@wayneguow wayneguow deleted the xml_v2 branch February 11, 2025 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants