Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library is redirecting stderr to /dev/null upon every call #31

Closed
dmoklaf opened this issue Nov 21, 2020 · 10 comments
Closed

Library is redirecting stderr to /dev/null upon every call #31

dmoklaf opened this issue Nov 21, 2020 · 10 comments

Comments

@dmoklaf
Copy link
Contributor

dmoklaf commented Nov 21, 2020

If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call:

with open(os.devnull, 'w') as devnull:

Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

Consequently, this redirect may be removed (to be tested).

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 22, 2020

It appears a "MUFFLE_FLAG" allows to control externally thiis behavior. So I used that. However, considering readibility seems clean now (however I didn't test that), it may be better to remove all that code

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 22, 2020

I tested with the MUFFLE_FLAG active for more than 100,000 documents (very diverse, from around 10,000 websites) and readability printed out nothing on stderr. Therefore it might be possible to remove all this code. I can contribute a PR if that's relevant

@adbar
Copy link
Owner

adbar commented Nov 23, 2020

I think you're right, the problem I intended to solve with these lines isn't existing anymore, I removed this behavior in effbccf

Since we're working on the handling of external modules: do you know how to modify the internal subclass LXMLDocument so as to avoid converting back the output back from a string back to an LXML tree? It could save processing time but I'm not sure how to do it:

return html.fromstring(doc.summary(html_partial=True), parser=HTML_PARSER)

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 23, 2020

That's a good point! Reading their code, it seems the readibility.Document.get_clean_html is exactly the one to override. The comment there explicitly indicates your use case

@adbar
Copy link
Owner

adbar commented Nov 23, 2020

Yes, it seems the right way to go, but I'm not sure how to do it.

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 23, 2020

Reading the code, I guess this override in your LXMLDocument class should do the trick (not tested):

def get_clean_html(self):
        return self.html

@adbar
Copy link
Owner

adbar commented Nov 23, 2020

It doesn't work, calling doc.summary(html_partial=True) on such a modified LXMLDocument class somehow returns empty or unusable trees.

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 24, 2020

Hmmm I dont know this part at all. I would study the current get_clean_html code (it calls 2 functions) to understand what they do to get something usable out of that

@adbar
Copy link
Owner

adbar commented Nov 24, 2020

Thanks, I made a separate issue (#37) for this and will now close this one.

@adbar adbar closed this as completed Nov 24, 2020
@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 24, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants