New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library is redirecting stderr to /dev/null upon every call #31
Comments
It appears a "MUFFLE_FLAG" allows to control externally thiis behavior. So I used that. However, considering readibility seems clean now (however I didn't test that), it may be better to remove all that code |
I tested with the MUFFLE_FLAG active for more than 100,000 documents (very diverse, from around 10,000 websites) and readability printed out nothing on stderr. Therefore it might be possible to remove all this code. I can contribute a PR if that's relevant |
I think you're right, the problem I intended to solve with these lines isn't existing anymore, I removed this behavior in effbccf Since we're working on the handling of external modules: do you know how to modify the internal subclass trafilatura/trafilatura/external.py Line 56 in 3b4cb19
|
That's a good point! Reading their code, it seems the readibility.Document.get_clean_html is exactly the one to override. The comment there explicitly indicates your use case |
Yes, it seems the right way to go, but I'm not sure how to do it. |
Reading the code, I guess this override in your LXMLDocument class should do the trick (not tested):
|
It doesn't work, calling |
Hmmm I dont know this part at all. I would study the current get_clean_html code (it calls 2 functions) to understand what they do to get something usable out of that |
Thanks, I made a separate issue (#37) for this and will now close this one. |
Thanks! |
If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call:
trafilatura/trafilatura/external.py
Line 63 in a56fb3e
Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).
This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.
Consequently, this redirect may be removed (to be tested).
The text was updated successfully, but these errors were encountered: