-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for namespaces declared at the ancestor level? #74
Comments
@alreadyexists I see. Yes this library is currently not handling namespaces. Also, I think that sounds reasonable. Above all, I think this library should be able to read and parse the XML you just provided with |
I could reproduce this bug you meet by running below: val testFile = "path-for-xml"
sqlContext.xmlFile(testFile, rowTag = "Topic").show() The console output was 11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id=""> <catid>1</catid> </Topic>
root |
Just to make this clear, this was also suggested by @davemoyers, #39 (comment). |
#74 This is a workaround to read a XML files with namespaces. Currently, this ignores namespaces but we might need to handle this by options or other ways. This PR makes this library able to read a XML file rather than ignoring the rows as malformed rows below: ```bash 11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id=""> <catid>1</catid> </Topic> root ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #75 from HyukjinKwon/ISSUE-74-namespace.
I merged the PR #75. So, now it would anyway be possible to read the XML files but this library is currently not dealing with namespaces. So, I will leave this issue open. |
I'd be interested in knowing if you had any thoughts on how to add namespace support (or at least to some degree)? In our case I think we know which namespaces that will show up beforehand, so maybe I could try adding options to standardize them as they are loaded (kind of a lookup to avoid conflicts or namespaces with different references across different documents) |
@spiu Sorry, I am late. I would appreciate that if you go ahead. If there is a proper way to standardize namespaces, then it would be great to add such a option. One thing I am a bit worried of is, basically |
Hello, I have XML that has namespace and looks like the following snippet. Does Spark-XML support paring this XML? ... Thanks ` <PDBx:datablock datablockName="1GPE" PDBx:atom_siteCategory |
Hi, we have the problem, that attribute values are getting lost, when parsing the XML. E.g. parsing |
Hello! Some clarifications on applicability needed.
Suppose I'd like to extract
Topic
entries from DMOZ RDF dump (http://rdf.dmoz.org/)WLOG, it goes like this:
Topic attribute
id
refers to a namespacer
declared in the RDF scope. If I'm getting the whole thing right, on Spark, when looking forrowtag
entries within a particular partition, there is no way to tell that this namespaces have been defined in the outer scope, in a way, globally to the whole Spark cluster. This is why I'm getting 'Malformed row' warning in the log on this example (caused byjavax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,16] Message: http://www.w3.org/TR/1999/REC-xml-names-19990114#AttributePrefixUnbound?Topic&r:id&r
)First of all, am I correct?
Second: any workaround here? I'd really enjoy reading that RDF dump. Maybe parser could expect for some static definitions provided by the user?
Thanks.
The text was updated successfully, but these errors were encountered: