-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44752][SQL] XML: Update Spark Docs #43350
Conversation
You might need to check after building the docs as described in https://github.com/apache/spark/tree/master/docs |
year,thanks.I need this, I searched this for a long time before, but not find how to build and preview locally |
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala
Outdated
Show resolved
Hide resolved
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala
Outdated
Show resolved
Hide resolved
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala
Outdated
Show resolved
Hide resolved
I don't know how to build docs locally so that I can preview HTML |
Please refer https://github.com/apache/spark/blob/master/docs/README.md |
examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
Outdated
Show resolved
Hide resolved
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laglangyue Thanks for putting this together. Please address the outstanding comments to get this to closure.
Thank you very much for your review. You are meticulous and rigorous in participating. I have already built the doc locally and executed examples for scala and Java, but there were some delays in the process due to Java 17, it looks good. But I did not execute the Python example because I have not yet used PySpark . Additionally, I have found that there are some issues with the license for checking people.xml in CI, and I don't know how to fix it. @HyukjinKwon @sandip-db |
For people.xml, maybe you can reference #40249 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Approved assuming the examples work as expected. There are still some github action failures due to lint issues, etc. that need to be fixed.
It seems that XML is not yet supported in PySpark. I imitated JSON and wrote an example of XML, but I tried PySpark and failed in the end. |
limitations under the License. | ||
--- | ||
|
||
Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, and `dataframe.write().xml("path")` to write to a xml file. When reading a XML file, the `rowTag` option must be specified to indicate the XML element that maps to a `DataFrame row`. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not all the xml read API need rowTag
option.
Please refer https://github.com/apache/spark/blob/7057952f6bc2c5cf97dd408effd1b18bee1cb8f4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L579C1-L579C1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning, refer to org.apache.spark.sql.catalyst.xml.XmlOptions DEFAULT_ROW_TAG is ROW,
and @sandip-db the option will be modified to be a required option in the future. refer to jira.
https://issues.apache.org/jira/browse/SPARK-45562
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer rowTag
is ignored by from_xml
, schema_of_xml
and xml(xmlDataset: Dataset[String])
. Each of these APIs assume a single XML record that maps to a single Row
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make rowTag
option required everywhere in future, please ignore the comment I mentioned.
Yes, I am working on DataFrameReader.xml. For now, change ".xml" in your python code to |
./build/mvn -pl :spark-sql_2.13 clean compile |
377a6ce
to
de41f25
Compare
I just completed a successful run of |
e2eef57
to
bcb2920
Compare
bcb2920
to
c671b04
Compare
Merged to master. |
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-44752
Why are the changes needed?
The XML data source is basically supported, but the XML example and document page are not yet available
Does this PR introduce any user-facing change?
No
How was this patch tested?
Annotated the methods of other data sources, click on 'run' in the idea to run
Was this patch authored or co-authored using generative AI tooling?
It was written by my Rubik's Cube JSON and CSV