Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for namespaces declared at the ancestor level? #74

Closed
alreadyexists-voodoo opened this issue Jan 26, 2016 · 8 comments
Closed

Comments

@alreadyexists-voodoo
Copy link

Hello! Some clarifications on applicability needed.
Suppose I'd like to extract Topic entries from DMOZ RDF dump (http://rdf.dmoz.org/)
WLOG, it goes like this:

<?xml version="1.0" encoding="UTF-8"?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf/">
  <!-- Generated at 2016-01-24 00:05:51 EST from DMOZ 2.0 -->
  <Topic r:id="">
    <catid>1</catid>
  </Topic>
</RDF>

Topic attribute id refers to a namespace r declared in the RDF scope. If I'm getting the whole thing right, on Spark, when looking for rowtag entries within a particular partition, there is no way to tell that this namespaces have been defined in the outer scope, in a way, globally to the whole Spark cluster. This is why I'm getting 'Malformed row' warning in the log on this example (caused by javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,16] Message: http://www.w3.org/TR/1999/REC-xml-names-19990114#AttributePrefixUnbound?Topic&r:id&r)

First of all, am I correct?
Second: any workaround here? I'd really enjoy reading that RDF dump. Maybe parser could expect for some static definitions provided by the user?
Thanks.

@HyukjinKwon
Copy link
Member

@alreadyexists I see. Yes this library is currently not handling namespaces. Also, I think that sounds reasonable. Above all, I think this library should be able to read and parse the XML you just provided with rowTag setting as Topic. I will look into this closer if I have some time.

@HyukjinKwon
Copy link
Member

I could reproduce this bug you meet by running below:

val testFile = "path-for-xml"
sqlContext.xmlFile(testFile, rowTag = "Topic").show()

The console output was

11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id="">        <catid>1</catid>    </Topic>
root

@HyukjinKwon
Copy link
Member

Just to make this clear, this was also suggested by @davemoyers, #39 (comment).

HyukjinKwon added a commit that referenced this issue Jan 27, 2016
#74

This is a workaround to read a XML files with namespaces.
Currently, this ignores namespaces but we might need to handle this by options or other ways.

This PR makes this library able to read a XML file rather than ignoring the rows as malformed rows below:

```bash
11:25:32.517 WARN com.databricks.spark.xml.util.InferSchema$: Dropping malformed row: <Topic r:id="">        <catid>1</catid>    </Topic>
root
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #75 from HyukjinKwon/ISSUE-74-namespace.
@HyukjinKwon
Copy link
Member

I merged the PR #75. So, now it would anyway be possible to read the XML files but this library is currently not dealing with namespaces. So, I will leave this issue open.

@spiu
Copy link

spiu commented Feb 16, 2016

I'd be interested in knowing if you had any thoughts on how to add namespace support (or at least to some degree)?

In our case I think we know which namespaces that will show up beforehand, so maybe I could try adding options to standardize them as they are loaded (kind of a lookup to avoid conflicts or namespaces with different references across different documents)

@HyukjinKwon
Copy link
Member

@spiu Sorry, I am late. I would appreciate that if you go ahead. If there is a proper way to standardize namespaces, then it would be great to add such a option.

One thing I am a bit worried of is, basically XmlInputFormat reads each sub-XML chunk for each row by rowTag and then it parses each as a compete XML document. So, in this way, it is difficult to access to the namespaces at the root element. Maybe, it is possible in a way but I could not come up a simple good idea for this yet.

@ayat-khairy
Copy link

ayat-khairy commented Jul 5, 2017

Hello,

I have XML that has namespace and looks like the following snippet. Does Spark-XML support paring this XML? ... Thanks

`

<PDBx:datablock datablockName="1GPE"
xmlns:PDBx="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pdbml.pdb.org/schema/pdbx-v40.xsd pdbx-v40.xsd">

PDBx:atom_siteCategory
<PDBx:atom_site id="1">
PDBx:B_iso_or_equiv23.83</PDBx:B_iso_or_equiv>
PDBx:Cartn_x28.000</PDBx:Cartn_x>
PDBx:Cartn_y7.940</PDBx:Cartn_y>
PDBx:Cartn_z-19.207</PDBx:Cartn_z>
PDBx:auth_asym_idA</PDBx:auth_asym_id>
PDBx:auth_atom_idN</PDBx:auth_atom_id>
PDBx:auth_comp_idTYR</PDBx:auth_comp_id>
PDBx:auth_seq_id1</PDBx:auth_seq_id>
PDBx:group_PDBATOM</PDBx:group_PDB>
<PDBx:label_alt_id xsi:nil="true" />
PDBx:label_asym_idA</PDBx:label_asym_id>
PDBx:label_atom_idN</PDBx:label_atom_id>
PDBx:label_comp_idTYR</PDBx:label_comp_id>
PDBx:label_entity_id1</PDBx:label_entity_id>
PDBx:label_seq_id1</PDBx:label_seq_id>
PDBx:occupancy1.00</PDBx:occupancy>
PDBx:pdbx_PDB_model_num1</PDBx:pdbx_PDB_model_num>
PDBx:type_symbolN</PDBx:type_symbol>
</PDBx:atom_site>
</PDBx:atom_siteCategory>
</PDBx:datablock>
`

@reneschroeder0000
Copy link

Hi,

we have the problem, that attribute values are getting lost, when parsing the XML. E.g. parsing
<element xsi:type="abc" type="xyz"></element> only one of the type Attributes is getting recognized. So either "abc" or "xyz" is returned based on the order. Any suggestions on how to fix this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants