Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding of UTF-8 domain names #2628

Merged
merged 3 commits into from
Jan 31, 2024
Merged

Conversation

timobrembeck
Copy link
Member

Short description

This is a bit ugly workaround, but I doubt that the underlying problem will be fixed in lxml or libxml2 anytime soon.

Proposed changes

  • Use unicode in lxml's tostring() method
  • Fix urlencoded domain names
  • No longer encode html entities in TinyMCE

Side effects

  • The page content now contains utf-8 characters instead of html entities, but I think most modern browsers should be able to display them without problems.

Resolved issues

Fixes: #2274


Pull Request Review Guidelines

@timobrembeck timobrembeck requested a review from a team as a code owner January 30, 2024 22:37
Copy link

codeclimate bot commented Jan 30, 2024

Code Climate has analyzed commit 01fef50 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 78.9% (50% is the threshold).

This pull request will bring the total coverage in the repository to 81.8% (-0.2% change).

View more on Code Climate.

Copy link
Member

@david-venhoff david-venhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!
I am a little bit worried that the regex can find some false positives, but even if it happens it should be fine since it just removes the encoding of unicode characters.

integreat_cms/cms/utils/linkcheck_utils.py Outdated Show resolved Hide resolved
Co-authored-by: David Venhoff <david.venhoff@tuerantuer.org>
@timobrembeck timobrembeck force-pushed the bugfix/content-encoding branch 2 times, most recently from 8f9acc4 to 01fef50 Compare January 31, 2024 14:50
Co-authored-by: David Venhoff <david.venhoff@tuerantuer.org>
Copy link
Member

@MizukiTemma MizukiTemma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Cool!!!!!! 🎉

@timobrembeck timobrembeck merged commit e3cc3cb into develop Jan 31, 2024
5 checks passed
@timobrembeck timobrembeck deleted the bugfix/content-encoding branch January 31, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 encoded ULRs in Broken Link checker
3 participants