You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as <ja:body>
One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above //{http://www.elsevier.com/xml/ja/schema}body, or attempt a wildcard match such as //*[local-name()="body"]
The text was updated successfully, but these errors were encountered:
marblestation
added a commit
to marblestation/ADSfulltext
that referenced
this issue
Sep 1, 2019
- Removed example using BeautifulSoup4 directly since this strategy does
not allow the use of XPath (contrary to using BeautifulSoup4 via lxml)
- Added support for BeautifulSoup4 parsers used via lxml:
html.parser, html5lib, lxml-html and lxml-xml
- Added support for using direct lxml parsers (without BeautifulSoup4):
lxml-html and lxml-xml
- Remove special elements using regex as needed by the multiple supported
parsers: CDATA, comments and processing instructions
- Save and restore body tags in content when using parsers that remove it
and wrap the full content on new <html><body></html></body> tags
- Remove namespaces for parser that expand the namespace prefixes into
namespaces in tags and attributes
- Remove namespaces prefixes for parser that do not expand the namespace
prefixes into namespaces in tags and attributes (this should closeadsabs#30)
- When searching for attributes, if the META_CONTENT was specified with
a namespace_prefix, remove it if it was not found
- Iterate through all the XML parsers until one succeeds to extract some
fulltext, the iteration is done in order of preferred parser
- If all parsers fail with the XML content, use a basic HTML one as last
resort
- Simplified StandardElsevierExtractorXML since now StandardExtractorXML
is more robust supporting multiple parsers
- Modified unit tests to test the extraction for all the supported parsers
- Use method extract_string instead of xpath in unit tests for XML documents,
given that that method does some text cleaning
Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as
<ja:body>
One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above
//{http://www.elsevier.com/xml/ja/schema}body
, or attempt a wildcard match such as//*[local-name()="body"]
The text was updated successfully, but these errors were encountered: