Simplified and faster encoding.py & doc building #138

adbar · 2020-02-07T13:51:58Z

code cleaning: regexes + LookupError unnecessary as fix_charset() returns a default value
faster encoding detection with cchardet, made optional in encoding.py and setup.py

adbar · 2020-02-07T13:54:20Z

The conflict comes from setup.py, please review it

readability/encoding.py

adbar · 2020-02-19T16:31:49Z

added bypass of doc_build

buriy · 2020-02-19T17:10:35Z

readability/readability.py

@@ -155,7 +157,10 @@ def _html(self, force=False):
        return self.html

    def _parse(self, input):
-        doc, self.encoding = build_doc(input)
+        if isinstance(input, (_ElementTree, HtmlElement)):
+            doc = input


This option doesn't set self.encoding which was set in another branch.
Users might be relying on it.
So please add
self.endoding = 'utf-8'
to match the expectations in the best way and we'll publish it then.

Also, you have no deepcopy as we have discussed.
Any thoughts on this?
._parse method could be run several times, devastating the document after the first iteration and thus reducing the found subtree quality.
See

python-readability/readability/readability.py

Line 219 in 4dcd6f9

self._html(True)

and

python-readability/readability/readability.py

Line 197 in 4dcd6f9

return shorten_title(self._html(True))

and etc.

I added the encoding attribute, for the deepcopy issue I'm not sure what to do, passing already existing trees is experimental anyway

adbar · 2021-09-14T12:33:02Z

@buriy Could you please accept or amend the PR? All tests pass and we could keep the deepcopy improvement for later.

adbar added 2 commits February 7, 2020 14:23

cleaned code

6872c94

add cchardet for speed

c2916a1

Merge branch 'master' into master

769f3ef

buriy reviewed Feb 7, 2020

View reviewed changes

readability/encoding.py Show resolved Hide resolved

adbar added 2 commits February 19, 2020 17:28

bypass doc building if input is already a parsed object

6429899

Merge branch 'master' of https://github.com/adbar/python-readability

4dcd6f9

adbar changed the title ~~Simplified and faster encoding.py~~ Simplified and faster encoding.py & doc building Feb 19, 2020

adbar mentioned this pull request Feb 19, 2020

Pass LXML object straight to readability? #140

Open

buriy reviewed Feb 19, 2020

View reviewed changes

add encoding attribute to parsed tree

e121fae

buriy merged commit 1415c30 into buriy:master Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified and faster encoding.py & doc building #138

Simplified and faster encoding.py & doc building #138

adbar commented Feb 7, 2020

adbar commented Feb 7, 2020

adbar commented Feb 19, 2020

buriy Feb 19, 2020 •

edited

Loading

buriy Feb 19, 2020 •

edited

Loading

adbar Feb 19, 2020

adbar commented Sep 14, 2021

Simplified and faster encoding.py & doc building #138

Simplified and faster encoding.py & doc building #138

Conversation

adbar commented Feb 7, 2020

adbar commented Feb 7, 2020

adbar commented Feb 19, 2020

buriy Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

buriy Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

adbar Feb 19, 2020

Choose a reason for hiding this comment

adbar commented Sep 14, 2021

buriy Feb 19, 2020 •

edited

Loading

buriy Feb 19, 2020 •

edited

Loading