CJK link processing causes Markdown links and HTML links to have different URLs #672

ptmkenny · 2020-12-14T12:15:53Z

The Commonmark spec currently results in the Japanese text in Markdown links getting converted. This is causing problems for me because I have Markdown links and HTML links on the same page, and my pa11y accessibility tests are failing with the Markdown links because they expect the Japanese text but instead are getting the converted text.

[Link](https://www.example.com/テレビ)

<a href="https://www.example.com/テレビ">Link</a>

This results in the following HTML:

<p><a href="https://www.example.com/%E3%83%86%E3%83%AC%E3%83%93">Link</a></p>

<p><a href="https://www.example.com/テレビ">Link</a></p>

In this case, I do not think the text should be converted. It's perfectly valid HTML to leave the links "as is", so no additional processing should be applied.

The text was updated successfully, but these errors were encountered:

wooorm · 2020-12-14T13:00:29Z

The link from markdown still works though, right? You can open both and they go to the same place. Sounds like an issue in pa11y to me

alerque · 2020-12-14T13:12:30Z

The links are equivalent, but there should not be reason to munge the URL to a different encoding (and code style) if it is valid as-is.

ptmkenny · 2020-12-14T13:16:28Z

Right, both URLs resolve to the same page, but it complicates any post processing that might be done after markdown.

For example, if I have a static site and I grep for a Japanese link, now I have to search for two patterns, the original Japanese and the converted one.

jgm · 2020-12-14T20:06:30Z

This isn't an issue with the spec, because the spec doesn't mandate that the URLs be escaped this way. That's up to the renderer.

Crissov · 2020-12-15T10:00:33Z

According to Babelmark3, CM implementations disagree on this:

commonmark-java and flexmark-java do not percent-encode either URL
commonmark.js, league/commonmark, markdig, MD4C and pulldown-cmark encode the Markdown URL only (as described above); markdown-it safifies HTML beforehand
GFM encodes both URLs

jgm · 2020-12-15T15:03:59Z

Yes, this is just an issue which the spec leaves to implementations. (The spec is really concerned with parsing, not rendering. It's a bit awkward that we use HTML for the examples/tests, for this reason; the examples inevitably have to make decisions on rendering that go beyond what the spec demands.)

wooorm · 2020-12-15T15:07:54Z

@jgm I believe the idea of leaving that to compilers/generators/implementations is not in the spec, it would be very useful in my opinion to have that documented

jgm · 2020-12-15T16:29:52Z

The spec doesn't say "URLs must be encoded." It just says how to recognize something as a link destination, etc. But yes, we could add a note about this.

wooorm · 2020-12-15T16:40:13Z

My that was a reference to your earlier comment (“The spec is really concerned with parsing, not rendering. It's a bit awkward that we use HTML for the examples/tests, for this reason; the examples inevitably have to make decisions on rendering that go beyond what the spec demands.”), not with this specific issue of whether to encode URLs.
I believe I’ve heard you voice that idea several times on this issue tracker, and I think it’d be good to add it to the spec, because to me the spec currently reads as dictating that certain input MUST lead to certain output without any room to diverge.

jgm · 2020-12-15T18:35:20Z

Agreed. Perhaps add a sentence or two to this paragraph:

Since this document describes how Markdown is to be parsed into
an abstract syntax tree, it would have made sense to use an abstract
representation of the syntax tree instead of HTML. But HTML is capable
of representing the structural distinctions we need to make, and the
choice of HTML for the tests makes it possible to run the tests against
an implementation without writing an abstract syntax tree renderer.

is normative. See #672.

jgm · 2020-12-16T00:07:35Z

Let me know what you think about the text added in the above commit.

wooorm · 2021-02-05T15:19:37Z

Sorry. I think this looks goods! 👍

jgm added a commit that referenced this issue Dec 16, 2020

Add note clarifying that not every feature of HTML examples...

71b0bab

is normative. See #672.

ptmkenny mentioned this issue Apr 8, 2024

HeadingPermalinkExtension does not percent-encode CJK names, but commonmark does percent-encode CJK URLs thephpleague/commonmark#1021

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK link processing causes Markdown links and HTML links to have different URLs #672

CJK link processing causes Markdown links and HTML links to have different URLs #672

ptmkenny commented Dec 14, 2020

wooorm commented Dec 14, 2020

alerque commented Dec 14, 2020

ptmkenny commented Dec 14, 2020

jgm commented Dec 14, 2020

Crissov commented Dec 15, 2020

jgm commented Dec 15, 2020

wooorm commented Dec 15, 2020

jgm commented Dec 15, 2020 •

edited

Loading

wooorm commented Dec 15, 2020

jgm commented Dec 15, 2020

jgm commented Dec 16, 2020

wooorm commented Feb 5, 2021

CJK link processing causes Markdown links and HTML links to have different URLs #672

CJK link processing causes Markdown links and HTML links to have different URLs #672

Comments

ptmkenny commented Dec 14, 2020

wooorm commented Dec 14, 2020

alerque commented Dec 14, 2020

ptmkenny commented Dec 14, 2020

jgm commented Dec 14, 2020

Crissov commented Dec 15, 2020

jgm commented Dec 15, 2020

wooorm commented Dec 15, 2020

jgm commented Dec 15, 2020 • edited Loading

wooorm commented Dec 15, 2020

jgm commented Dec 15, 2020

jgm commented Dec 16, 2020

wooorm commented Feb 5, 2021

jgm commented Dec 15, 2020 •

edited

Loading