New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescaped left angle brackets in link destination #193

Open
ScottAbbey opened this Issue Apr 30, 2017 · 11 comments

Comments

Projects
None yet
4 participants
@ScottAbbey
Copy link

ScottAbbey commented Apr 30, 2017

The CommonMark spec has this to say about link destinations:

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no >spaces, line breaks, or unescaped < or > characters, or

Given the following input:

[a]

[a]: <te<st>

this implementation outputs:

<p><a href="te%3Cst">a</a></p>

However, the Javascript reference implementation outputs:

<p><a href="%3Cte%3Cst%3E">a</a></p>

Note the %3C characters surrounding te%3Cst are missing in the output from this implementation. They should be present, as this should be treated as the non-<...> type of link definition.

This also seems to affect inline link destinations.

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Apr 30, 2017

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 13, 2017

The fix was wrong.

Spec says there are two types of link destination:

  1. a sequence of zero or more characters between an opening < and a
    closing > that contains no spaces, line breaks, or unescaped
    < or > characters, or

  2. a nonempty sequence of characters that does not include
    ASCII space or control characters, and includes parentheses
    only if (a) they are backslash-escaped or (b) they are part of
    a balanced pair of unescaped parentheses.

The point of this bug report was that <te<st> should be treated as a type 2 link destination, not as a type 1 destination. But the "fix" makes it a type 1 link destination. Expected behavior is to get a link with URL "%3Cte%3Cst%3E".

jgm added a commit that referenced this issue Jul 13, 2017

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 13, 2017

I reverted the completely mistaken fix. Now we have a failing test.

What we need is a fix for manual_scan_link_url. In writing this, I previously assumed that we'd be able to tell at the outset whether we were scanning for a type 1 or type 2 URL. But this isn't right, if <te<st> should be a type 2 URL. Scanning gets a bit tricky if we can't tell till the end whether we have type 1 or type 2, because type 1 doesn't require parentheses to be balanced.

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 13, 2017

Now I think I understand why I originally disallowed unescaped nested parens! This made scanning much easier.

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 13, 2017

We could always have two scanners: first try with a type 1 (if < is the first character), and if that fails, try with type 2. This will mean backtracking, but only on unusual inputs, and only once, so it shouldn't cause performance issues.

@kivikakk

This comment has been minimized.

Copy link
Contributor

kivikakk commented Jul 14, 2017

That seems quite acceptable! Here's a pretty gross diff that does just that:

diff --git a/src/inlines.c b/src/inlines.c
index e30c2af..0500177 100644
--- a/src/inlines.c
+++ b/src/inlines.c
@@ -859,10 +859,14 @@ static bufsize_t manual_scan_link_url(cmark_chunk *input, bufsize_t offset) {
         i += 2;
       else if (cmark_isspace(input->data[i]))
         return -1;
-      else
+      else if (input->data[i] == '<') {
+        i = offset;
+        goto type2;
+      } else
         ++i;
     }
   } else {
+type2:
     while (i < input->len) {
       if (input->data[i] == '\\' &&
          i + 1 < input-> len &&

Is this the kind of thing you mean?

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 14, 2017

@kivikakk

This comment has been minimized.

Copy link
Contributor

kivikakk commented Jul 17, 2017

#219 is a PR for a better fix.

@jgm jgm closed this in #219 Jul 25, 2017

@jgm jgm reopened this Jul 25, 2017

@jgm

This comment has been minimized.

Copy link
Member

jgm commented Jul 25, 2017

Reopening because we don't correctly handle this case:

[hi](<aaa)

which should be a link with URL "%3caaa", but is not parsed as a link at all.
There should be a regression test, or better a spec test, for this.

@mity

This comment has been minimized.

Copy link
Contributor

mity commented Aug 6, 2017

Note the PR #219 also does not fix the following case:

[a](<x>X)

The problem is that if the parsing of the whole link (or ref. def.) with link destination type 1 fails, parsing of whole link (or ref. def.) with the type 2 should be retried. Doing the retry only on the destination itself is not enough.

BTW, MD4C also fails on this.
EDIT: MD4C now has this fixed but my proposal below to simplify it still holds.

Although it is fixable, IMHO the question is whether it is worth it. Take this as my vote to ban the retrying completely in the specs. To be more precise, I vote for something as follows (bold text is new):

A link destination consists of either

  • a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

  • a nonempty sequence of characters that does not include ASCII space or control characters, does not begin with unescaped <, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

This would allow the parser to decide how to parse by checking the first character of the maybe-link-destination string.

Rationale:

  1. Link URLs starting with < are rare so usage of it for type 2 is likely minimal.
  2. With such URL, users may need to escape < anyway even now if the URL also ends with >.
  3. Simplification of the implementation to the level which would make similar regressions unlikely.

Note the ambiguity implied in the point 2 is unfixable without change in specification because, with its current wording, <xxxx> simply fulfills both types of link destinations.

@kivikakk

This comment has been minimized.

Copy link
Contributor

kivikakk commented Jan 25, 2018

+1 to banning retrying. I can't think of many useful URIs that begin with <.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment