Support for namespaces declared at the ancestor level? #74

alreadyexists-voodoo · 2016-01-26T08:19:56Z

Hello! Some clarifications on applicability needed.
Suppose I'd like to extract Topic entries from DMOZ RDF dump (http://rdf.dmoz.org/)
WLOG, it goes like this:

<?xml version="1.0" encoding="UTF-8"?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf/">
  <!-- Generated at 2016-01-24 00:05:51 EST from DMOZ 2.0 -->
  <Topic r:id="">
    <catid>1</catid>
  </Topic>
</RDF>

Topic attribute id refers to a namespace r declared in the RDF scope. If I'm getting the whole thing right, on Spark, when looking for rowtag entries within a particular partition, there is no way to tell that this namespaces have been defined in the outer scope, in a way, globally to the whole Spark cluster. This is why I'm getting 'Malformed row' warning in the log on this example (caused by javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,16] Message: http://www.w3.org/TR/1999/REC-xml-names-19990114#AttributePrefixUnbound?Topic&r:id&r)

First of all, am I correct?
Second: any workaround here? I'd really enjoy reading that RDF dump. Maybe parser could expect for some static definitions provided by the user?
Thanks.

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2016-01-27T01:54:41Z

@alreadyexists I see. Yes this library is currently not handling namespaces. Also, I think that sounds reasonable. Above all, I think this library should be able to read and parse the XML you just provided with rowTag setting as Topic. I will look into this closer if I have some time.

HyukjinKwon · 2016-01-27T02:27:37Z

I could reproduce this bug you meet by running below:

val testFile = "path-for-xml"
sqlContext.xmlFile(testFile, rowTag = "Topic").show()

The console output was

11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id="">        <catid>1</catid>    </Topic>
root

HyukjinKwon · 2016-01-27T02:35:34Z

Just to make this clear, this was also suggested by @davemoyers, #39 (comment).

#74 This is a workaround to read a XML files with namespaces. Currently, this ignores namespaces but we might need to handle this by options or other ways. This PR makes this library able to read a XML file rather than ignoring the rows as malformed rows below: ```bash 11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id=""> <catid>1</catid> </Topic> root ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #75 from HyukjinKwon/ISSUE-74-namespace.

HyukjinKwon · 2016-01-27T04:46:05Z

I merged the PR #75. So, now it would anyway be possible to read the XML files but this library is currently not dealing with namespaces. So, I will leave this issue open.

spiu · 2016-02-16T09:53:57Z

I'd be interested in knowing if you had any thoughts on how to add namespace support (or at least to some degree)?

In our case I think we know which namespaces that will show up beforehand, so maybe I could try adding options to standardize them as they are loaded (kind of a lookup to avoid conflicts or namespaces with different references across different documents)

HyukjinKwon · 2016-03-07T04:50:55Z

@spiu Sorry, I am late. I would appreciate that if you go ahead. If there is a proper way to standardize namespaces, then it would be great to add such a option.

One thing I am a bit worried of is, basically XmlInputFormat reads each sub-XML chunk for each row by rowTag and then it parses each as a compete XML document. So, in this way, it is difficult to access to the namespaces at the root element. Maybe, it is possible in a way but I could not come up a simple good idea for this yet.

ayat-khairy · 2017-07-05T13:14:33Z

Hello,

I have XML that has namespace and looks like the following snippet. Does Spark-XML support paring this XML? ... Thanks

`

<PDBx:datablock datablockName="1GPE"
xmlns:PDBx="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pdbml.pdb.org/schema/pdbx-v40.xsd pdbx-v40.xsd">

PDBx:atom_siteCategory
<PDBx:atom_site id="1">
PDBx:B_iso_or_equiv23.83</PDBx:B_iso_or_equiv>
PDBx:Cartn_x28.000</PDBx:Cartn_x>
PDBx:Cartn_y7.940</PDBx:Cartn_y>
PDBx:Cartn_z-19.207</PDBx:Cartn_z>
PDBx:auth_asym_idA</PDBx:auth_asym_id>
PDBx:auth_atom_idN</PDBx:auth_atom_id>
PDBx:auth_comp_idTYR</PDBx:auth_comp_id>
PDBx:auth_seq_id1</PDBx:auth_seq_id>
PDBx:group_PDBATOM</PDBx:group_PDB>
<PDBx:label_alt_id xsi:nil="true" />
PDBx:label_asym_idA</PDBx:label_asym_id>
PDBx:label_atom_idN</PDBx:label_atom_id>
PDBx:label_comp_idTYR</PDBx:label_comp_id>
PDBx:label_entity_id1</PDBx:label_entity_id>
PDBx:label_seq_id1</PDBx:label_seq_id>
PDBx:occupancy1.00</PDBx:occupancy>
PDBx:pdbx_PDB_model_num1</PDBx:pdbx_PDB_model_num>
PDBx:type_symbolN</PDBx:type_symbol>
</PDBx:atom_site>
</PDBx:atom_siteCategory>
</PDBx:datablock>
`

reneschroeder0000 · 2018-10-05T10:33:11Z

Hi,

we have the problem, that attribute values are getting lost, when parsing the XML. E.g. parsing
<element xsi:type="abc" type="xyz"></element> only one of the type Attributes is getting recognized. So either "abc" or "xyz" is returned based on the order. Any suggestions on how to fix this behavior?

HyukjinKwon added bug enhancement labels Jan 27, 2016

HyukjinKwon mentioned this issue Jan 27, 2016

Ignore namespaces while reading. #75

Closed

HyukjinKwon mentioned this issue Sep 9, 2016

Perhaps spark-xml should ignore prefixes? #168

Closed

HyukjinKwon mentioned this issue Dec 22, 2018

How to write with xml tag at the first line of file? #278

Closed

srowen closed this as completed Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for namespaces declared at the ancestor level? #74

Support for namespaces declared at the ancestor level? #74

alreadyexists-voodoo commented Jan 26, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

spiu commented Feb 16, 2016

HyukjinKwon commented Mar 7, 2016

ayat-khairy commented Jul 5, 2017 •

edited

reneschroeder0000 commented Oct 5, 2018

Support for namespaces declared at the ancestor level? #74

Support for namespaces declared at the ancestor level? #74

Comments

alreadyexists-voodoo commented Jan 26, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

HyukjinKwon commented Jan 27, 2016

spiu commented Feb 16, 2016

HyukjinKwon commented Mar 7, 2016

ayat-khairy commented Jul 5, 2017 • edited

reneschroeder0000 commented Oct 5, 2018

ayat-khairy commented Jul 5, 2017 •

edited