Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Use encodelink() everywhere #2037

Merged
merged 5 commits into from Sep 10, 2015

Conversation

Projects
None yet
3 participants
Contributor

da2x commented Sep 8, 2015

Replaces the pull request suggested in #2005.

utils.encodelink(str) takes a string and NFC normalizes it, encodes the netloc to IDN punycode, and encodes the path to percent-encoding (RFC-3986 / RFC-3987). Schema, query, and fragment components of the address are also NFC normalized but are otherwise untouched. The function can be called multiple times non-destructively.

@masayuko, would you please test this? Set USE_SLUGIFY = False to support tags and categories. No other option required. Please also test nikola check -l (Nikola’s link checker.)

da2x added some commits Sep 8, 2015

Contributor

da2x commented Sep 8, 2015

OK, after 3 hours I’ve concluded that py27 support for check is either a) unfixable or b) beyond my ability to fix. Too many Python internals are hardcoded to be ascii only. Anything that traverses the filesystem is a nightmare. I can cobble together something that runs on py27 excluding py3 and everything that isn’t Linux. py3 works fine as-is now.

Would it be okay ti throw an error and die if a user tries using check with non-ascii characters in anything older than py32? "It’s cool that you want to use UTF-8 paths, but you can’t have nice things in py2. Please upgrade to Python 3 to check your files and links."

Contributor

da2x commented Sep 9, 2015

Everything is awesome. Should work*. Does it work for you, @masayuko?

* OSX may still explode because of it’s unique NFD preference, seems to work fine, but really it shouldn’t have.

Owner

Kwpolska commented Sep 9, 2015

We wouldn’t mind not supporting Python 2 here.

Contributor

da2x commented Sep 9, 2015

This is ready to merge with py27 and py3 support. Someone want to sanity check it? @Kwpolska? @masayuko?

Contributor

msnoigrs commented Sep 10, 2015

Sorry for late reply. I am checking it now. I'll report back later.
It seems that NFKC is better than NFD.
(refer to nameprep rfc3491 stringprep rfc3454)

Contributor

msnoigrs commented Sep 10, 2015

OK. I have just tested in several ways. It works for me.
As mentioned above, it should use NFKC to normalize in util.encodelink().
I have encountered a issue that fail to create sitemap. When the words of author, category, tag, etc end with period(.), It will be occured due to follow line:

path = (path.replace(os.sep, '/') + '/').replace('./', '')

in nikola/plugins/task/sitemap/init.py.
It was fixed in my PR code #2005.

Contributor

da2x commented Sep 10, 2015

NFKC is destructive where as NFC is not. NFKC offers no advantages here as far as I know. Emailing discussions on W3C’s list recommended NFC over NFKC, so I’m sticking with that. (I couldn’t be bothered to dig up the link again.) NFD and NFKD is out of the question because they can’t be used reliably to produce the same output. Besides, everyone but Apple is using NFC everywhere for everything.

Regarding the issue with periods in the end of paths, could you make a separate issue about that?

da2x added a commit that referenced this pull request Sep 10, 2015

Merge pull request #2037 from getnikola/encodelinks
Use encodelink() everywhere

@da2x da2x merged commit ef6af2b into master Sep 10, 2015

2 of 3 checks passed

continuous-integration/travis-ci/push The Travis CI build failed
Details
continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

da2x added a commit that referenced this pull request Sep 10, 2015

Contributor

msnoigrs commented Sep 11, 2015

The netloc should follow nameprep RFC3491, RFC3454. so prefer NFKC.
Examples of japanese idn include converting half-width kana between full-width kana.
In actual operation, half-width kana cannot be used as idn because half-width kana convert to full-width kana by NFKC.
Kana is the major character in Japanese and It has several expressions in Unicode.

ハ (U+FF8A) half-width
ハ (U+30CF) full-width
ハ (U+FF8A) -> NFC -> ハ (U+FF8A) <-- disallow in idn
ハ (U+FF8A) -> NFKC -> ハ (U+30CF)

パ (U+FF8A U+FF9F) half-width and two code points
パ (U+30D1) full-width
パ (U+FF8A U+FF9F) -> NFC -> パ (U+FF8A U+FF9F) <-- disallow in idn
パ (U+FF8A U+FF9F) -> NFKC -> パ (U+30D1)

On the other hand, the path get used as the location name on a filesystem. so prefer NFC.

Contributor

msnoigrs commented Sep 11, 2015

I have just checked myself.

print('パ'.encode('idna'))
b'xn--odk'
print('パ'.encode('idna'))
b'xn--odk'

NFC is no problem. Please forget the above said.

@Kwpolska Kwpolska deleted the encodelinks branch Dec 24, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment