Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN SITE_URL is not converted to Punycode #1644

Closed
gerritsangel opened this issue Mar 26, 2015 · 14 comments
Closed

IDN SITE_URL is not converted to Punycode #1644

gerritsangel opened this issue Mar 26, 2015 · 14 comments
Assignees
Milestone

Comments

@gerritsangel
Copy link

The SITE_URL is not converted correctly to Punycode. For example, when initialising a new Blog and writing:

Site URL [http://getnikola.com/]: http://exämple.com/täst/
, this will result in conf.py to:
SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

Correct should be that the domain name is converted to Punycode:
SITE_URL = "http://xn--exmple-cua.com/t\u00e4st/"

The result is that for example Firefox throws an error when clicking on the logo.

I guess that (only) the domain part needs to be isolated from SITE_URL and then converted with "exämple.com".encode("idna") to xn--exmple-cua.com.

Nikola should also keep in mind that the user may edit the SITE_URL in conf.py directly and write the IDN without punycode directly, so for example:
SITE_URL = "http://exämple.com/täst"
Therefore the Punycode convert should best be applied while building, not in the blog init.

@ralsina
Copy link
Member

ralsina commented Mar 26, 2015

Interesting. It looks easy-ish :-)

@Kwpolska Kwpolska added this to the v7.3.2 milestone Mar 27, 2015
@Kwpolska
Copy link
Member

SITE_URL = "http://exämple.com/täst"

This is equivalent to:

SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

To solve this, we could just urlsplit(), encode the domain part and urljoin() it back.

@Kwpolska
Copy link
Member

PS. the issue is caused by a dumb algorithm (in lxml?) that is handling links like it’s 1999:

<a href="http://%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.%D1%80%D1%84/">broken</a>

This is the problem. Firefox wouldn’t mind the punycode form, and it also wouldn’t mind the real unicode:

<a href="http://xn--d1abbgf6aiiy.xn--p1ai/">works</a>
<a href="http://президент.рф/">works too</a>

On a side note, Chrome supports the percent-escaped link, IE and Safari also fail.

test: https://dl.dropboxusercontent.com/u/1933476/IDN.html

This is not a statement of support for the Russian Federation

@gerritsangel
Copy link
Author

As a side node, maybe it would be good to write the variables in conf.py direclty in UTF-8. Escaping everything reduces readability to 0 and is not necessary, because conf.py's encoding is given as utf8 either way.

Kwpolska added a commit that referenced this issue Mar 31, 2015
Requires UTF-8 input on Python 2.

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@Kwpolska
Copy link
Member

Done in fb3a7db. Requires UTF-8 input for this to work.

@gerritsangel
Copy link
Author

Found that the bug has a slightly larger impact when using Isso as a comment system. "script src" in the output html file will be incorrect and then the comment file is not loaded, and comments don't work.

Solution/workaround: Same as above, write the Domain in Punycode in COMMENT_SYSTEM_ID.

@Kwpolska
Copy link
Member

Kwpolska commented Apr 7, 2015

It looks like the best solution would be to fix things in nikola init and require Punycode in the config file (don’t allow to build if Unicode is found in the domain).

@gerritsangel
Copy link
Author

Well, in my (humble) opinion, it is best if Nikola would not convert the UTF-8 to anything: No Punycoding, no escaping. All current web browsers should understand URLs like http://президент.рф/президент.html. Therefore I don't see the necessity to convert this to http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.html.

It would greatly improve readability if no conversions would be made. First, in conf.py: If you actually want to read what you have entered in SITE_URL, non-Punycode is better. The HTML sourcecode is, of course, not a problem, but this would be easier to read as well (and save some bytes :D)

... but of course I understand that this may need extreme overhaul.

@Kwpolska
Copy link
Member

Kwpolska commented Apr 7, 2015

We can’t fix this on our own. lxml or one of their upstreams can — talk to the appropriate vendor if you want this fixed nicely.

@gerritsangel
Copy link
Author

Ah ok, sorry for the misunderstanding :) I'll try it.

@ralsina
Copy link
Member

ralsina commented Apr 24, 2015

@Kwpolska so, if I understand this correctly there's nothing more we can do? Close it?

@Kwpolska
Copy link
Member

@ralsina Possible solutions include:

(a) trying to get the link replacer to fix this (which will probably not fix everything);
(b) fixing things in nikola init and warning users/failing to build if Unicode characters are encountered in SITE_URL.

Which one do we choose?

@ralsina
Copy link
Member

ralsina commented Apr 24, 2015

I'd say b) which looks much easier.

@Kwpolska Kwpolska self-assigned this Apr 24, 2015
@Kwpolska
Copy link
Member

I tried to fix it with (a) and I failed. Not only did the aforementioned isso src links blow up, it also looks like the URL replacer does not touch the logo link and many others.

But, we could leave the patch in for when people want to link to IDN domain names and have Unicode input.

Fix in #1668.

Kwpolska added a commit that referenced this issue Apr 24, 2015
fix #1644 -- work around issues with IDNs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants