Plugins: auto-anchor: use ASCII-only slug #349
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #348
For some reason, our Markdown interpreter (Kramdown) won't create anchor tags containing non-ASCII text. I don't know if that its fault or there's something in the HTML5 spec. This affects both our Japanese and Spanish translations (the later only occasionally when one of the words in the anchor text has an accent character).
The fix implemented in this PR is to (1) remove all non-ASCII accents (converting those characters to their nearest ASCII representation, e.g. õ to o) and (2) further remove all non-ASCII characters (e.g. all Japanese characters) before generating the anchor id ("slug"). This works, but it has two implications for non-Latin character sets:
If all of the text in the bolded, italicized, or linkified part of a bullet is non-Latin, no anchor will be generated (because there's no text left after the non-ASCII is removed). This doesn't seem to happen often in the Japanese translations because there's usually technical terms like "LN" or "bech32" that are used unchanged in the translation.
Once all of the non-Latin text has been stripped, there's sometimes more than one bullet in an article that has the same anchor as another bullet. E.g. after stripping non-ASCII, these produce the same anchor:
Our tests catch cases where two anchors have the same id, so this is easy to catch before publication. I found that it was easy to fix by just inserting a single-character comment in the bolded text to distinguish the duplicates (see PR for a few cases where I had to do this).
Overall, I think this is an adequate solution, although it's still unfortunate that we can't have localized anchors.
Testing: I tested by building the site with and without this PR's commit and then diffing the HTML. It looks to me like this change is purely additive---in only adds anchors (and HTML comments) in places where there were no anchors before. No existing anchors are changed.