IDN SITE_URL is not converted to Punycode #1644

gerritsangel · 2015-03-26T20:07:23Z

The SITE_URL is not converted correctly to Punycode. For example, when initialising a new Blog and writing:

Site URL [http://getnikola.com/]: http://exämple.com/täst/
, this will result in conf.py to:
SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

Correct should be that the domain name is converted to Punycode:
SITE_URL = "http://xn--exmple-cua.com/t\u00e4st/"

The result is that for example Firefox throws an error when clicking on the logo.

I guess that (only) the domain part needs to be isolated from SITE_URL and then converted with "exämple.com".encode("idna") to xn--exmple-cua.com.

Nikola should also keep in mind that the user may edit the SITE_URL in conf.py directly and write the IDN without punycode directly, so for example:
SITE_URL = "http://exämple.com/täst"
Therefore the Punycode convert should best be applied while building, not in the blog init.

ralsina · 2015-03-26T20:18:49Z

Interesting. It looks easy-ish :-)

Kwpolska · 2015-03-27T13:23:46Z

SITE_URL = "http://exämple.com/täst"

This is equivalent to:

SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

To solve this, we could just urlsplit(), encode the domain part and urljoin() it back.

Kwpolska · 2015-03-27T13:51:18Z

PS. the issue is caused by a dumb algorithm (in lxml?) that is handling links like it’s 1999:

<a href="http://%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.%D1%80%D1%84/">broken</a>

This is the problem. Firefox wouldn’t mind the punycode form, and it also wouldn’t mind the real unicode:

<a href="http://xn--d1abbgf6aiiy.xn--p1ai/">works</a>
<a href="http://президент.рф/">works too</a>

On a side note, Chrome supports the percent-escaped link, IE and Safari also fail.

test: https://dl.dropboxusercontent.com/u/1933476/IDN.html

This is not a statement of support for the Russian Federation

gerritsangel · 2015-03-31T16:24:56Z

As a side node, maybe it would be good to write the variables in conf.py direclty in UTF-8. Escaping everything reduces readability to 0 and is not necessary, because conf.py's encoding is given as utf8 either way.

Requires UTF-8 input on Python 2. Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska · 2015-03-31T17:42:14Z

Done in fb3a7db. Requires UTF-8 input for this to work.

gerritsangel · 2015-04-07T15:13:37Z

Found that the bug has a slightly larger impact when using Isso as a comment system. "script src" in the output html file will be incorrect and then the comment file is not loaded, and comments don't work.

Solution/workaround: Same as above, write the Domain in Punycode in COMMENT_SYSTEM_ID.

Kwpolska · 2015-04-07T15:25:23Z

It looks like the best solution would be to fix things in nikola init and require Punycode in the config file (don’t allow to build if Unicode is found in the domain).

gerritsangel · 2015-04-07T16:08:39Z

Well, in my (humble) opinion, it is best if Nikola would not convert the UTF-8 to anything: No Punycoding, no escaping. All current web browsers should understand URLs like http://президент.рф/президент.html. Therefore I don't see the necessity to convert this to http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.html.

It would greatly improve readability if no conversions would be made. First, in conf.py: If you actually want to read what you have entered in SITE_URL, non-Punycode is better. The HTML sourcecode is, of course, not a problem, but this would be easier to read as well (and save some bytes :D)

... but of course I understand that this may need extreme overhaul.

Kwpolska · 2015-04-07T16:25:20Z

We can’t fix this on our own. lxml or one of their upstreams can — talk to the appropriate vendor if you want this fixed nicely.

gerritsangel · 2015-04-07T16:30:13Z

Ah ok, sorry for the misunderstanding :) I'll try it.

ralsina · 2015-04-24T15:44:15Z

@Kwpolska so, if I understand this correctly there's nothing more we can do? Close it?

Kwpolska · 2015-04-24T15:48:37Z

@ralsina Possible solutions include:

(a) trying to get the link replacer to fix this (which will probably not fix everything);
(b) fixing things in nikola init and warning users/failing to build if Unicode characters are encountered in SITE_URL.

Which one do we choose?

ralsina · 2015-04-24T15:52:59Z

I'd say b) which looks much easier.

Kwpolska · 2015-04-24T17:12:31Z

I tried to fix it with (a) and I failed. Not only did the aforementioned isso src links blow up, it also looks like the URL replacer does not touch the logo link and many others.

But, we could leave the patch in for when people want to link to IDN domain names and have Unicode input.

Fix in #1668.

fix #1644 -- work around issues with IDNs

Kwpolska added bug minor labels Mar 27, 2015

Kwpolska added this to the v7.3.2 milestone Mar 27, 2015

Kwpolska added a commit that referenced this issue Mar 31, 2015

Unicode nikola init output (via #1644)

fb3a7db

Requires UTF-8 input on Python 2. Signed-off-by: Chris Warrick <kwpolska@gmail.com>

Kwpolska self-assigned this Apr 24, 2015

Kwpolska added the PR exists label Apr 24, 2015

Kwpolska closed this as completed in ea4ff18 Apr 24, 2015

Kwpolska added a commit that referenced this issue Apr 24, 2015

Merge pull request #1668 from getnikola/punycode

57d46d2

fix #1644 -- work around issues with IDNs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDN SITE_URL is not converted to Punycode #1644

IDN SITE_URL is not converted to Punycode #1644

gerritsangel commented Mar 26, 2015

ralsina commented Mar 26, 2015

Kwpolska commented Mar 27, 2015

Kwpolska commented Mar 27, 2015

gerritsangel commented Mar 31, 2015

Kwpolska commented Mar 31, 2015

gerritsangel commented Apr 7, 2015

Kwpolska commented Apr 7, 2015

gerritsangel commented Apr 7, 2015

Kwpolska commented Apr 7, 2015

gerritsangel commented Apr 7, 2015

ralsina commented Apr 24, 2015

Kwpolska commented Apr 24, 2015

ralsina commented Apr 24, 2015

Kwpolska commented Apr 24, 2015

IDN SITE_URL is not converted to Punycode #1644

IDN SITE_URL is not converted to Punycode #1644

Comments

gerritsangel commented Mar 26, 2015

ralsina commented Mar 26, 2015

Kwpolska commented Mar 27, 2015

Kwpolska commented Mar 27, 2015

gerritsangel commented Mar 31, 2015

Kwpolska commented Mar 31, 2015

gerritsangel commented Apr 7, 2015

Kwpolska commented Apr 7, 2015

gerritsangel commented Apr 7, 2015

Kwpolska commented Apr 7, 2015

gerritsangel commented Apr 7, 2015

ralsina commented Apr 24, 2015

Kwpolska commented Apr 24, 2015

ralsina commented Apr 24, 2015

Kwpolska commented Apr 24, 2015