Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve namespace handling in XML extraction #30

Closed
aaccomazzi opened this issue Dec 1, 2016 · 0 comments · Fixed by #88
Closed

Improve namespace handling in XML extraction #30

aaccomazzi opened this issue Dec 1, 2016 · 0 comments · Fixed by #88

Comments

@aaccomazzi
Copy link
Member

Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as <ja:body>

One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above //{http://www.elsevier.com/xml/ja/schema}body, or attempt a wildcard match such as //*[local-name()="body"]

marblestation added a commit to marblestation/ADSfulltext that referenced this issue Sep 1, 2019
- Removed example using BeautifulSoup4 directly since this strategy does
  not allow the use of XPath (contrary to using BeautifulSoup4 via lxml)
- Added support for BeautifulSoup4 parsers used via lxml:
  html.parser, html5lib, lxml-html and lxml-xml
- Added support for using direct lxml parsers (without BeautifulSoup4):
  lxml-html and lxml-xml
- Remove special elements using regex as needed by the multiple supported
  parsers: CDATA, comments and processing instructions
- Save and restore body tags in content when using parsers that remove it
  and wrap the full content on new <html><body></html></body> tags
- Remove namespaces for parser that expand the namespace prefixes into
  namespaces in tags and attributes
- Remove namespaces prefixes for parser that do not expand the namespace
  prefixes into namespaces in tags and attributes (this should close adsabs#30)
- When searching for attributes, if the META_CONTENT was specified with
  a namespace_prefix, remove it if it was not found
- Iterate through all the XML parsers until one succeeds to extract some
  fulltext, the iteration is done in order of preferred parser
- If all parsers fail with the XML content, use a basic HTML one as last
  resort
- Simplified StandardElsevierExtractorXML since now StandardExtractorXML
  is more robust supporting multiple parsers
- Modified unit tests to test the extraction for all the supported parsers
- Use method extract_string instead of xpath in unit tests for XML documents,
  given that that method does some text cleaning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant