Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nonstandard parsing #106

Merged
merged 6 commits into from
Jan 15, 2020

Conversation

krisbukovi
Copy link
Contributor

@krisbukovi krisbukovi commented Jan 8, 2020

This addresses issue #101

@coveralls
Copy link

coveralls commented Jan 8, 2020

Coverage Status

Coverage decreased (-0.2%) to 81.944% when pulling 27b7e85 on krisbukovi:remove_nonstandard_parsing into 1d21087 on adsabs:master.

@krisbukovi krisbukovi self-assigned this Jan 9, 2020
Copy link
Contributor

@marblestation marblestation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in the description you meant to link this PR to #101 and not #104, right? I have added some suggestions and comments, let me know what you think.

Comment on lines 626 to 627
fulltext = u" ".join(map(unicode.strip, map(unicode, elements[0].itertext())))
fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fulltext variable is never used after these two lines, in the previous version of the code that was used to decide that the full text extraction was successful. Either we remove these two lines or we use the outcome with something like:

if fulltext:
    ft_found = True

And we remove the ft_found = True in line 625. But if we do the latter, then when we get legit empty bodies, the code will go ahead and try the next parser. That's what we were trying to avoid, so we could change the condition to:

if isinstance(fulltext, str):
    ft_found = True

Does this sound reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it sounds reasonable. I'm leaning towards removing the two lines unless you say otherwise.

logger.warn('Parsing XML in non-standard way')
parsed_xml = lxml.html.document_fromstring(self.raw_xml.encode('utf-8'))
else:
logger.debug("The parser '{}' failed".format(parser_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of succeeded or failed, could we say something like this?

"The parser '{}' did not extract any of the following fields '{}'".format(parser_name, ", ".join(META_CONTENT[self.meta_name].keys()))

And:

"The parser '{}' succeeded extracting the following fields '{}'".format(parser_name, ", ".join(content_names))

Where content_names would be an array you can construct in the loop. But then, to be complete, we would need to remove these lines:

if ft_found:
    break

This would give us a quick first diagnostic instead of a generic success/fail message, and hopefully it might help us in our investigations.

Copy link
Contributor Author

@krisbukovi krisbukovi Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it.

Just to note, this will change the logic so that we will loop through all parts of the fulltext instead of stopping when we find the body (which includes an empty body). This makes sense since we are no longer using the body to define success.

If we find we don't need these logs down the line, we could break when element_found = True. This would also be a change in logic.

Copy link
Contributor

@marblestation marblestation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Let's go ahead with this PR, feel free to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants