parentheses in links cannot be parsed by regex #248

Pomax · 2015-07-23T17:50:49Z

I was looking for a new markdown library, since chjj/marked is built entirely using regex, which means it completely fails at parsing URLs with parentheses in it (wikipedia's full of those, for instance), but looking at https://github.com/evilstreak/markdown-js/blob/master/src/parser.js#L179-202 it seems like this code is also using regex for inline pattern extraction.

If that is the case, then this parser also can't deal with URLs like https://en.wikipedia.org/wiki/Set_(mathematics) (github's parser gets URIs like these right), unless people manually replace the parentheses with %.. values (which is not a realistic solution). And as long as regex are used, there is no solution to that problem, so this might be something that needs to be put in the README.md as a "warning: there are limitations to this parser. [...]" so that people don't suddenly run into this problem but can plan around it.

The text was updated successfully, but these errors were encountered:

1kastner · 2015-07-23T21:31:08Z

I am not sure whether I am not just getting your point but can't we find a regex to match them all? Instead of just catching alphanumeric symbols we could decide to match all symbols expect spaces.

You can just play here: https://regex101.com/#javascript and I found this one (http[\S]*) as a short and simple solution.

Pomax · 2015-07-24T16:34:49Z

the problem is catching the parenthesis. For instance:

[text1](http://moo.com/this has (a link)) and this is text ) with a ) parens
[text2](http://moo.com/this has (a (reasonable) link)) and this is text ) with a ) parens

This should match http://moo.com/this has (a link) as first URI, and http://moo.com/this has (a (reasonable) link) as second URI, but those paired parenthesis are a problem for regex, as they can't do nested pair matching (due to how regex works).

Writing a pattern so that any pairs (\([...]+\))? are matched would work, but ( and ) can also occur on their own in the URL, and now you have a properly hard problem.

It's definitely worth adding that simple pattern match, but the README.md would definitely need a booknote on how link parsing is guaranteed to fail for certain links, and how to mitigate that (using %... codes for the ( and ) characters if you have a truly problematic URL with either nested parens or only a single parens without its matching counterpart)

1kastner · 2015-07-24T17:18:51Z

Ok, this explains it, you can't count with regex ;-)
Well, I am just a contributing user as well, so maybe just fork it and put a pull request for the repository administrators?

Pomax · 2015-07-24T17:38:59Z

yeah, thinking about that.

HappyStinson mentioned this issue Oct 30, 2018

Links with parentheses becomes unreachable microsoft/linkcheckermd#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parentheses in links cannot be parsed by regex #248

parentheses in links cannot be parsed by regex #248

Pomax commented Jul 23, 2015

1kastner commented Jul 23, 2015

Pomax commented Jul 24, 2015

1kastner commented Jul 24, 2015

Pomax commented Jul 24, 2015

parentheses in links cannot be parsed by regex #248

parentheses in links cannot be parsed by regex #248

Comments

Pomax commented Jul 23, 2015

1kastner commented Jul 23, 2015

Pomax commented Jul 24, 2015

1kastner commented Jul 24, 2015

Pomax commented Jul 24, 2015