-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified and faster encoding.py & doc building #138
Conversation
adbar
commented
Feb 7, 2020
- code cleaning: regexes + LookupError unnecessary as fix_charset() returns a default value
- faster encoding detection with cchardet, made optional in encoding.py and setup.py
The conflict comes from setup.py, please review it |
|
@@ -155,7 +157,10 @@ def _html(self, force=False): | |||
return self.html | |||
|
|||
def _parse(self, input): | |||
doc, self.encoding = build_doc(input) | |||
if isinstance(input, (_ElementTree, HtmlElement)): | |||
doc = input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option doesn't set self.encoding which was set in another branch.
Users might be relying on it.
So please add
self.endoding = 'utf-8'
to match the expectations in the best way and we'll publish it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you have no deepcopy as we have discussed.
Any thoughts on this?
._parse method could be run several times, devastating the document after the first iteration and thus reducing the found subtree quality.
See
python-readability/readability/readability.py
Line 219 in 4dcd6f9
self._html(True) |
python-readability/readability/readability.py
Line 197 in 4dcd6f9
return shorten_title(self._html(True)) |
and etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the encoding attribute, for the deepcopy
issue I'm not sure what to do, passing already existing trees is experimental anyway
@buriy Could you please accept or amend the PR? All tests pass and we could keep the |