Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44752][SQL] XML: Update Spark Docs #43350

Closed
wants to merge 3 commits into from

Conversation

laglangyue
Copy link

@laglangyue laglangyue commented Oct 12, 2023

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-44752

Why are the changes needed?

The XML data source is basically supported, but the XML example and document page are not yet available

Does this PR introduce any user-facing change?

No

How was this patch tested?

Annotated the methods of other data sources, click on 'run' in the idea to run

Was this patch authored or co-authored using generative AI tooling?

It was written by my Rubik's Cube JSON and CSV

@HyukjinKwon HyukjinKwon changed the title [SPARK-44752] XML: Update Spark Docs [SPARK-44752][SQL] XML: Update Spark Docs Oct 12, 2023
@HyukjinKwon
Copy link
Member

You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs

pom.xml Outdated Show resolved Hide resolved
@laglangyue
Copy link
Author

You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs

year,thanks.I need this, I searched this for a long time before, but not find how to build and preview locally

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
examples/src/main/resources/people.xml Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Show resolved Hide resolved
@laglangyue laglangyue changed the title [SPARK-44752][SQL] XML: Update Spark Docs [WIP][SPARK-44752][SQL] XML: Update Spark Docs Oct 13, 2023
@github-actions github-actions bot removed the BUILD label Oct 13, 2023
@laglangyue
Copy link
Author

I don't know how to build docs locally so that I can preview HTML
@HyukjinKwon @sandip-db
Thank you very much for your help

@beliefer
Copy link
Contributor

Thank you very much for your help

Please refer https://github.com/apache/spark/blob/master/docs/README.md

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
examples/src/main/resources/people.xml Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
Copy link
Contributor

@sandip-db sandip-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laglangyue Thanks for putting this together. Please address the outstanding comments to get this to closure.

docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
docs/sql-data-sources-xml.md Outdated Show resolved Hide resolved
@laglangyue
Copy link
Author

Thank you very much for your review. You are meticulous and rigorous in participating. I have already built the doc locally and executed examples for scala and Java, but there were some delays in the process due to Java 17, it looks good. But I did not execute the Python example because I have not yet used PySpark . Additionally, I have found that there are some issues with the license for checking people.xml in CI, and I don't know how to fix it. @HyukjinKwon @sandip-db

@beliefer
Copy link
Contributor

For people.xml, maybe you can reference #40249

Copy link
Contributor

@sandip-db sandip-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Approved assuming the examples work as expected. There are still some github action failures due to lint issues, etc. that need to be fixed.

@laglangyue laglangyue changed the title [WIP][SPARK-44752][SQL] XML: Update Spark Docs [SPARK-44752][SQL] XML: Update Spark Docs Oct 18, 2023
@laglangyue
Copy link
Author

It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end.
@sandip-db

limitations under the License.
---

Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning, refer to org.apache.spark.sql.catalyst.xml.XmlOptions DEFAULT_ROW_TAG is ROW,
and @sandip-db the option will be modified to be a required option in the future. refer to jira.
https://issues.apache.org/jira/browse/SPARK-45562

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer rowTag is ignored by from_xml, schema_of_xml and xml(xmlDataset: Dataset[String]). Each of these APIs assume a single XML record that maps to a single Row.

Copy link
Contributor

@beliefer beliefer Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make rowTag option required everywhere in future, please ignore the comment I mentioned.

@sandip-db
Copy link
Contributor

It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end. @sandip-db

Yes, I am working on DataFrameReader.xml. For now, change ".xml" in your python code to .format("xml").load

@laglangyue
Copy link
Author

./build/mvn -pl :spark-sql_2.13 clean compile
image
it seems the construction method of XmlOptions is ambiguous
@sandip-db

@sandip-db
Copy link
Contributor

sandip-db commented Oct 20, 2023

./build/mvn -pl :spark-sql_2.13 clean compile... it seems the construction method of XmlOptions is ambiguous

I just completed a successful run of ./build/mvn -DskipTests clean package.
It looks like mvn in your case is picking up stale dependencies.

@laglangyue laglangyue force-pushed the xml_example_doc branch 2 times, most recently from e2eef57 to bcb2920 Compare October 23, 2023 08:57
@HyukjinKwon
Copy link
Member

Merged to master.

@laglangyue laglangyue deleted the xml_example_doc branch October 31, 2023 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants