Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified and faster encoding.py & doc building #138

Merged
merged 6 commits into from
Sep 14, 2021
Merged

Conversation

adbar
Copy link

@adbar adbar commented Feb 7, 2020

  • code cleaning: regexes + LookupError unnecessary as fix_charset() returns a default value
  • faster encoding detection with cchardet, made optional in encoding.py and setup.py

@adbar
Copy link
Author

adbar commented Feb 7, 2020

The conflict comes from setup.py, please review it

@adbar
Copy link
Author

adbar commented Feb 19, 2020

  • added bypass of doc_build

@adbar adbar changed the title Simplified and faster encoding.py Simplified and faster encoding.py & doc building Feb 19, 2020
@@ -155,7 +157,10 @@ def _html(self, force=False):
return self.html

def _parse(self, input):
doc, self.encoding = build_doc(input)
if isinstance(input, (_ElementTree, HtmlElement)):
doc = input
Copy link
Owner

@buriy buriy Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option doesn't set self.encoding which was set in another branch.
Users might be relying on it.
So please add
self.endoding = 'utf-8'
to match the expectations in the best way and we'll publish it then.

Copy link
Owner

@buriy buriy Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you have no deepcopy as we have discussed.
Any thoughts on this?
._parse method could be run several times, devastating the document after the first iteration and thus reducing the found subtree quality.
See

and
return shorten_title(self._html(True))

and etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the encoding attribute, for the deepcopy issue I'm not sure what to do, passing already existing trees is experimental anyway

@adbar
Copy link
Author

adbar commented Sep 14, 2021

@buriy Could you please accept or amend the PR? All tests pass and we could keep the deepcopy improvement for later.

@buriy buriy merged commit 1415c30 into buriy:master Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants