Improve namespace handling in XML extraction #30

aaccomazzi · 2016-12-01T18:53:13Z

Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as <ja:body>

One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above //{http://www.elsevier.com/xml/ja/schema}body, or attempt a wildcard match such as //*[local-name()="body"]

The text was updated successfully, but these errors were encountered:

- Removed example using BeautifulSoup4 directly since this strategy does not allow the use of XPath (contrary to using BeautifulSoup4 via lxml) - Added support for BeautifulSoup4 parsers used via lxml: html.parser, html5lib, lxml-html and lxml-xml - Added support for using direct lxml parsers (without BeautifulSoup4): lxml-html and lxml-xml - Remove special elements using regex as needed by the multiple supported parsers: CDATA, comments and processing instructions - Save and restore body tags in content when using parsers that remove it and wrap the full content on new <html><body></html></body> tags - Remove namespaces for parser that expand the namespace prefixes into namespaces in tags and attributes - Remove namespaces prefixes for parser that do not expand the namespace prefixes into namespaces in tags and attributes (this should close adsabs#30) - When searching for attributes, if the META_CONTENT was specified with a namespace_prefix, remove it if it was not found - Iterate through all the XML parsers until one succeeds to extract some fulltext, the iteration is done in order of preferred parser - If all parsers fail with the XML content, use a basic HTML one as last resort - Simplified StandardElsevierExtractorXML since now StandardExtractorXML is more robust supporting multiple parsers - Modified unit tests to test the extraction for all the supported parsers - Use method extract_string instead of xpath in unit tests for XML documents, given that that method does some text cleaning

marblestation mentioned this issue Sep 3, 2019

Support lxml 4.4.1 and added parsers #88

Merged

krisbukovi closed this as completed in #88 Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve namespace handling in XML extraction #30

Improve namespace handling in XML extraction #30

aaccomazzi commented Dec 1, 2016

Improve namespace handling in XML extraction #30

Improve namespace handling in XML extraction #30

Comments

aaccomazzi commented Dec 1, 2016