Fix encoding of UTF-8 domain names #2628

timobrembeck · 2024-01-30T22:37:20Z

Short description

This is a bit ugly workaround, but I doubt that the underlying problem will be fixed in lxml or libxml2 anytime soon.

Proposed changes

Use unicode in lxml's tostring() method
Fix urlencoded domain names
No longer encode html entities in TinyMCE

Side effects

The page content now contains utf-8 characters instead of html entities, but I think most modern browsers should be able to display them without problems.

Resolved issues

Fixes: #2274

Pull Request Review Guidelines

codeclimate · 2024-01-30T22:37:43Z

Code Climate has analyzed commit 01fef50 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 78.9% (50% is the threshold).

This pull request will bring the total coverage in the repository to 81.8% (-0.2% change).

View more on Code Climate.

david-venhoff

Thanks, looks good to me!
I am a little bit worried that the regex can find some false positives, but even if it happens it should be fine since it just removes the encoding of unicode characters.

integreat_cms/cms/utils/linkcheck_utils.py

Co-authored-by: David Venhoff <david.venhoff@tuerantuer.org>

MizukiTemma

🎉 Cool!!!!!! 🎉

Use unicode in lxml's tostring() method

bb66ffb

timobrembeck requested a review from a team as a code owner January 30, 2024 22:37

david-venhoff approved these changes Jan 31, 2024

View reviewed changes

integreat_cms/cms/utils/linkcheck_utils.py Outdated Show resolved Hide resolved

Fix urlencoded domain names

e8f8ef4

Co-authored-by: David Venhoff <david.venhoff@tuerantuer.org>

timobrembeck force-pushed the bugfix/content-encoding branch 2 times, most recently from 8f9acc4 to 01fef50 Compare January 31, 2024 14:50

No longer encode html entities in TinyMCE

01fef50

Co-authored-by: David Venhoff <david.venhoff@tuerantuer.org>

MizukiTemma approved these changes Jan 31, 2024

View reviewed changes

timobrembeck merged commit e3cc3cb into develop Jan 31, 2024
5 checks passed

timobrembeck deleted the bugfix/content-encoding branch January 31, 2024 17:16

timobrembeck mentioned this pull request Feb 10, 2024

Not possible to replace utf-8 domain links in broken link checker #2646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding of UTF-8 domain names #2628

Fix encoding of UTF-8 domain names #2628

timobrembeck commented Jan 30, 2024

codeclimate bot commented Jan 30, 2024 •

edited

david-venhoff left a comment

MizukiTemma left a comment

Fix encoding of UTF-8 domain names #2628

Fix encoding of UTF-8 domain names #2628

Conversation

timobrembeck commented Jan 30, 2024

Short description

Proposed changes

Side effects

Resolved issues

codeclimate bot commented Jan 30, 2024 • edited

david-venhoff left a comment

Choose a reason for hiding this comment

MizukiTemma left a comment

Choose a reason for hiding this comment

codeclimate bot commented Jan 30, 2024 •

edited