[SPARK-44752][SQL] XML: Update Spark Docs #43350

laglangyue · 2023-10-12T08:50:03Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-44752

Why are the changes needed?

The XML data source is basically supported, but the XML example and document page are not yet available

Does this PR introduce any user-facing change?

No

How was this patch tested?

Annotated the methods of other data sources, click on 'run' in the idea to run

Was this patch authored or co-authored using generative AI tooling?

It was written by my Rubik's Cube JSON and CSV

docs/sql-data-sources-xml.md

HyukjinKwon · 2023-10-12T10:55:08Z

You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs

pom.xml

laglangyue · 2023-10-13T01:57:03Z

You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs

year,thanks.I need this, I searched this for a long time before, but not find how to build and preview locally

docs/sql-data-sources-xml.md

examples/src/main/resources/people.xml

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

docs/sql-data-sources-xml.md

laglangyue · 2023-10-13T09:32:35Z

I don't know how to build docs locally so that I can preview HTML
@HyukjinKwon @sandip-db
Thank you very much for your help

beliefer · 2023-10-13T12:40:05Z

Thank you very much for your help

Please refer https://github.com/apache/spark/blob/master/docs/README.md

docs/sql-data-sources-xml.md

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

examples/src/main/resources/people.xml

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

docs/sql-data-sources-xml.md

sandip-db

@laglangyue Thanks for putting this together. Please address the outstanding comments to get this to closure.

docs/sql-data-sources-xml.md

laglangyue · 2023-10-17T05:59:33Z

Thank you very much for your review. You are meticulous and rigorous in participating. I have already built the doc locally and executed examples for scala and Java, but there were some delays in the process due to Java 17, it looks good. But I did not execute the Python example because I have not yet used PySpark . Additionally, I have found that there are some issues with the license for checking people.xml in CI, and I don't know how to fix it. @HyukjinKwon @sandip-db

beliefer · 2023-10-17T07:15:32Z

For people.xml, maybe you can reference #40249

sandip-db

LGTM. Approved assuming the examples work as expected. There are still some github action failures due to lint issues, etc. that need to be fixed.

docs/sql-data-sources-xml.md

laglangyue · 2023-10-19T09:25:47Z

It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end.
@sandip-db

beliefer · 2023-10-19T11:51:50Z

docs/sql-data-sources-xml.md

+  limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.


It seems not all the xml read API need rowTag option.
Please refer https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1

At the beginning, refer to org.apache.spark.sql.catalyst.xml.XmlOptions DEFAULT_ROW_TAG is ROW，
and @sandip-db the option will be modified to be a required option in the future. refer to jira.
https://issues.apache.org/jira/browse/SPARK-45562

@beliefer rowTag is ignored by from_xml, schema_of_xml and xml(xmlDataset: Dataset[String]). Each of these APIs assume a single XML record that maps to a single Row.

If we make rowTag option required everywhere in future, please ignore the comment I mentioned.

sandip-db · 2023-10-19T16:00:52Z

It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end. @sandip-db

Yes, I am working on DataFrameReader.xml. For now, change ".xml" in your python code to .format("xml").load

laglangyue · 2023-10-20T14:55:00Z

./build/mvn -pl :spark-sql_2.13 clean compile

it seems the construction method of XmlOptions is ambiguous
@sandip-db

sandip-db · 2023-10-20T16:01:39Z

./build/mvn -pl :spark-sql_2.13 clean compile... it seems the construction method of XmlOptions is ambiguous

I just completed a successful run of ./build/mvn -DskipTests clean package.
It looks like mvn in your case is picking up stale dependencies.

HyukjinKwon · 2023-10-24T08:34:22Z

Merged to master.

github-actions bot added SQL BUILD EXAMPLES DOCS labels Oct 12, 2023

HyukjinKwon changed the title ~~[SPARK-44752] XML: Update Spark Docs~~ [SPARK-44752][SQL] XML: Update Spark Docs Oct 12, 2023

HyukjinKwon reviewed Oct 12, 2023

View reviewed changes

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved

srowen reviewed Oct 12, 2023

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

sandip-db suggested changes Oct 13, 2023

View reviewed changes

laglangyue changed the title ~~[SPARK-44752][SQL] XML: Update Spark Docs~~ [WIP][SPARK-44752][SQL] XML: Update Spark Docs Oct 13, 2023

github-actions bot removed the BUILD label Oct 13, 2023

sandip-db reviewed Oct 13, 2023

View reviewed changes

github-actions bot added the PYTHON label Oct 16, 2023

sandip-db suggested changes Oct 17, 2023

View reviewed changes

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved

sandip-db approved these changes Oct 17, 2023

View reviewed changes

laglangyue changed the title ~~[WIP][SPARK-44752][SQL] XML: Update Spark Docs~~ [SPARK-44752][SQL] XML: Update Spark Docs Oct 18, 2023

sandip-db reviewed Oct 18, 2023

View reviewed changes

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved

beliefer reviewed Oct 19, 2023

View reviewed changes

laglangyue force-pushed the xml_example_doc branch from 377a6ce to de41f25 Compare October 20, 2023 15:03

laglangyue force-pushed the xml_example_doc branch 2 times, most recently from e2eef57 to bcb2920 Compare October 23, 2023 08:57

HyukjinKwon approved these changes Oct 23, 2023

View reviewed changes

laglangyue added 3 commits October 24, 2023 13:57

[SPARK-44752][SQL] XML: Update Spark Docs

74ea64f

fix xml python example

c07e8b8

fix python example format

c671b04

laglangyue force-pushed the xml_example_doc branch from bcb2920 to c671b04 Compare October 24, 2023 05:58

HyukjinKwon closed this in a484826 Oct 24, 2023

laglangyue deleted the xml_example_doc branch October 31, 2023 04:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44752][SQL] XML: Update Spark Docs #43350

[SPARK-44752][SQL] XML: Update Spark Docs #43350

laglangyue commented Oct 12, 2023 •

edited

Loading

HyukjinKwon commented Oct 12, 2023

laglangyue commented Oct 13, 2023

laglangyue commented Oct 13, 2023

beliefer commented Oct 13, 2023

sandip-db left a comment

laglangyue commented Oct 17, 2023

beliefer commented Oct 17, 2023

sandip-db left a comment

laglangyue commented Oct 19, 2023

beliefer Oct 19, 2023

laglangyue Oct 19, 2023

sandip-db Oct 19, 2023

beliefer Oct 20, 2023 •

edited

Loading

sandip-db commented Oct 19, 2023

laglangyue commented Oct 20, 2023

sandip-db commented Oct 20, 2023 •

edited

Loading

HyukjinKwon commented Oct 24, 2023

[SPARK-44752][SQL] XML: Update Spark Docs #43350

[SPARK-44752][SQL] XML: Update Spark Docs #43350

Conversation

laglangyue commented Oct 12, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Oct 12, 2023

laglangyue commented Oct 13, 2023

laglangyue commented Oct 13, 2023

beliefer commented Oct 13, 2023

sandip-db left a comment

Choose a reason for hiding this comment

laglangyue commented Oct 17, 2023

beliefer commented Oct 17, 2023

sandip-db left a comment

Choose a reason for hiding this comment

laglangyue commented Oct 19, 2023

beliefer Oct 19, 2023

Choose a reason for hiding this comment

laglangyue Oct 19, 2023

Choose a reason for hiding this comment

sandip-db Oct 19, 2023

Choose a reason for hiding this comment

beliefer Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

sandip-db commented Oct 19, 2023

laglangyue commented Oct 20, 2023

sandip-db commented Oct 20, 2023 • edited Loading

HyukjinKwon commented Oct 24, 2023

laglangyue commented Oct 12, 2023 •

edited

Loading

beliefer Oct 20, 2023 •

edited

Loading

sandip-db commented Oct 20, 2023 •

edited

Loading