Remove nonstandard parsing #106

krisbukovi · 2020-01-08T20:41:33Z

This addresses issue #101

…ukovi/ADSfulltext into remove_nonstandard_parsing

coveralls · 2020-01-08T20:44:21Z

Coverage decreased (-0.2%) to 81.944% when pulling 27b7e85 on krisbukovi:remove_nonstandard_parsing into 1d21087 on adsabs:master.

marblestation

I think that in the description you meant to link this PR to #101 and not #104, right? I have added some suggestions and comments, let me know what you think.

marblestation · 2020-01-14T08:17:46Z

adsft/extraction.py

+                        fulltext = u" ".join(map(unicode.strip, map(unicode, elements[0].itertext())))
+                        fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True)


The fulltext variable is never used after these two lines, in the previous version of the code that was used to decide that the full text extraction was successful. Either we remove these two lines or we use the outcome with something like:

if fulltext: ft_found = True

And we remove the ft_found = True in line 625. But if we do the latter, then when we get legit empty bodies, the code will go ahead and try the next parser. That's what we were trying to avoid, so we could change the condition to:

if isinstance(fulltext, str): ft_found = True

Does this sound reasonable?

Yes it sounds reasonable. I'm leaning towards removing the two lines unless you say otherwise.

marblestation · 2020-01-14T08:29:09Z

adsft/extraction.py

-            logger.warn('Parsing XML in non-standard way')
-            parsed_xml = lxml.html.document_fromstring(self.raw_xml.encode('utf-8'))
+            else:
+                logger.debug("The parser '{}' failed".format(parser_name))


Instead of succeeded or failed, could we say something like this?

"The parser '{}' did not extract any of the following fields '{}'".format(parser_name, ", ".join(META_CONTENT[self.meta_name].keys()))

And:

"The parser '{}' succeeded extracting the following fields '{}'".format(parser_name, ", ".join(content_names))

Where content_names would be an array you can construct in the loop. But then, to be complete, we would need to remove these lines:

if ft_found: break

This would give us a quick first diagnostic instead of a generic success/fail message, and hopefully it might help us in our investigations.

I'll add it.

Just to note, this will change the logic so that we will loop through all parts of the fulltext instead of stopping when we find the body (which includes an empty body). This makes sense since we are no longer using the body to define success.

If we find we don't need these logs down the line, we could break when element_found = True. This would also be a change in logic.

marblestation

Good! Let's go ahead with this PR, feel free to merge.

krisbukovi added 5 commits January 8, 2020 14:13

remove non-standard backup parsing

6b3584d

remove non-standard backup parsing

46467e8

align with recent merge

c070c91

remove non-standard backup parsing

e157cf9

Merge branch 'remove_nonstandard_parsing' of https://github.com/krisb…

330d213

…ukovi/ADSfulltext into remove_nonstandard_parsing

krisbukovi requested a review from marblestation January 9, 2020 15:05

krisbukovi self-assigned this Jan 9, 2020

marblestation requested changes Jan 14, 2020

View reviewed changes

modify logs, remove fulltext var from success check

27b7e85

krisbukovi requested a review from marblestation January 14, 2020 17:35

marblestation approved these changes Jan 15, 2020

View reviewed changes

krisbukovi merged commit b602c2f into adsabs:master Jan 15, 2020

krisbukovi mentioned this pull request Mar 23, 2020

Some parsers are only extracting the acknowledgements from XML files #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove nonstandard parsing #106

Remove nonstandard parsing #106

krisbukovi commented Jan 8, 2020 •

edited

Loading

coveralls commented Jan 8, 2020 •

edited

Loading

marblestation left a comment

marblestation Jan 14, 2020

krisbukovi Jan 14, 2020

marblestation Jan 14, 2020

krisbukovi Jan 14, 2020 •

edited

Loading

marblestation left a comment

		fulltext = u" ".join(map(unicode.strip, map(unicode, elements[0].itertext())))
		fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True)

Remove nonstandard parsing #106

Remove nonstandard parsing #106

Conversation

krisbukovi commented Jan 8, 2020 • edited Loading

coveralls commented Jan 8, 2020 • edited Loading

marblestation left a comment

Choose a reason for hiding this comment

marblestation Jan 14, 2020

Choose a reason for hiding this comment

krisbukovi Jan 14, 2020

Choose a reason for hiding this comment

marblestation Jan 14, 2020

Choose a reason for hiding this comment

krisbukovi Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

marblestation left a comment

Choose a reason for hiding this comment

krisbukovi commented Jan 8, 2020 •

edited

Loading

coveralls commented Jan 8, 2020 •

edited

Loading

krisbukovi Jan 14, 2020 •

edited

Loading