-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49062][SQL] Migrate XML to File Data Source V2 #47539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon @cloud-fan thanks. |
|
cc @sandip-db WDYT? |
| } | ||
|
|
||
| override def hashCode(): Int = super.hashCode() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not override getMetaData like json/csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no PushedFilters, just keep the default method of the parent class.
| partitionFilters, | ||
| dataFilters) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not override pushDataFilters like json/csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we currently do not support push down for xml, this other topic for xml and we can separate another pr. Therefore, we can use the default method of the parent class currently.
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.sql.util.CaseInsensitiveStringMap | ||
|
|
||
| class XmlDataSourceV2 extends FileDataSourceV2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR. I see that it is essentially a copy of json V2 and refactored for XML. I pointed a few differences in the PR compared to json. Are there any other deviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I referred to JSON and other other data source implementations that already have v1 and v2. Except for the push down part, the left part should be consistent.
| if (provider.equalsIgnoreCase("xml") && sources.size == 2) { | ||
| val externalSource = sources.filterNot(_.getClass.getName | ||
| .startsWith("org.apache.spark.sql.execution.datasources.xml.XmlFileFormat") | ||
| .startsWith("org.apache.spark.sql.execution.datasources.v2.xml.XmlDataSourceV2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the discovered source XmlDataSourceV2 irrespective of whether xml is included in the USE_V1_SOURCE_LIST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ServiceLoader uses META-INF.services/org.apache.spark.sql.sources.DataSourceRegister file to load data source. Whether it's v1 or v2 it's the same here.
I can get your doubts. When the USE_V1_SOURCE_LIST includes XML, we finally back to using the default FileFormat here.
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Line 111 in 9e35d04
| case f: FileDataSourceV2 => f.fallbackFileFormat |
It looks ok to me. Although, I have not come across anyone asking for it. |
Well, the reason why I proposed this PR is that other common data sources have v1 and v2 implementations, so I want to add v2 for |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR aims to Migrate XML to File Data Source V2.
Why are the changes needed?
Add v2 support for XML.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA and transform
XmlSuite.Was this patch authored or co-authored using generative AI tooling?
No.