-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove nonstandard parsing #106
Remove nonstandard parsing #106
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adsft/extraction.py
Outdated
fulltext = u" ".join(map(unicode.strip, map(unicode, elements[0].itertext()))) | ||
fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fulltext
variable is never used after these two lines, in the previous version of the code that was used to decide that the full text extraction was successful. Either we remove these two lines or we use the outcome with something like:
if fulltext:
ft_found = True
And we remove the ft_found = True
in line 625. But if we do the latter, then when we get legit empty bodies, the code will go ahead and try the next parser. That's what we were trying to avoid, so we could change the condition to:
if isinstance(fulltext, str):
ft_found = True
Does this sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it sounds reasonable. I'm leaning towards removing the two lines unless you say otherwise.
adsft/extraction.py
Outdated
logger.warn('Parsing XML in non-standard way') | ||
parsed_xml = lxml.html.document_fromstring(self.raw_xml.encode('utf-8')) | ||
else: | ||
logger.debug("The parser '{}' failed".format(parser_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of succeeded or failed, could we say something like this?
"The parser '{}' did not extract any of the following fields '{}'".format(parser_name, ", ".join(META_CONTENT[self.meta_name].keys()))
And:
"The parser '{}' succeeded extracting the following fields '{}'".format(parser_name, ", ".join(content_names))
Where content_names
would be an array you can construct in the loop. But then, to be complete, we would need to remove these lines:
if ft_found:
break
This would give us a quick first diagnostic instead of a generic success/fail message, and hopefully it might help us in our investigations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it.
Just to note, this will change the logic so that we will loop through all parts of the fulltext instead of stopping when we find the body (which includes an empty body). This makes sense since we are no longer using the body to define success.
If we find we don't need these logs down the line, we could break when element_found = True
. This would also be a change in logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! Let's go ahead with this PR, feel free to merge.
This addresses issue #101