Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parentheses in links cannot be parsed by regex #248

Open
Pomax opened this issue Jul 23, 2015 · 4 comments
Open

parentheses in links cannot be parsed by regex #248

Pomax opened this issue Jul 23, 2015 · 4 comments

Comments

@Pomax
Copy link

Pomax commented Jul 23, 2015

I was looking for a new markdown library, since chjj/marked is built entirely using regex, which means it completely fails at parsing URLs with parentheses in it (wikipedia's full of those, for instance), but looking at https://github.com/evilstreak/markdown-js/blob/master/src/parser.js#L179-202 it seems like this code is also using regex for inline pattern extraction.

If that is the case, then this parser also can't deal with URLs like https://en.wikipedia.org/wiki/Set_(mathematics) (github's parser gets URIs like these right), unless people manually replace the parentheses with %.. values (which is not a realistic solution). And as long as regex are used, there is no solution to that problem, so this might be something that needs to be put in the README.md as a "warning: there are limitations to this parser. [...]" so that people don't suddenly run into this problem but can plan around it.

@1kastner
Copy link

I am not sure whether I am not just getting your point but can't we find a regex to match them all? Instead of just catching alphanumeric symbols we could decide to match all symbols expect spaces.

You can just play here: https://regex101.com/#javascript and I found this one (http[\S]*) as a short and simple solution.

@Pomax
Copy link
Author

Pomax commented Jul 24, 2015

the problem is catching the parenthesis. For instance:

[text1](http://moo.com/this has (a link)) and this is text ) with a ) parens
[text2](http://moo.com/this has (a (reasonable) link)) and this is text ) with a ) parens

This should match http://moo.com/this has (a link) as first URI, and http://moo.com/this has (a (reasonable) link) as second URI, but those paired parenthesis are a problem for regex, as they can't do nested pair matching (due to how regex works).

Writing a pattern so that any pairs (\([...]+\))? are matched would work, but ( and ) can also occur on their own in the URL, and now you have a properly hard problem.

It's definitely worth adding that simple pattern match, but the README.md would definitely need a booknote on how link parsing is guaranteed to fail for certain links, and how to mitigate that (using %... codes for the ( and ) characters if you have a truly problematic URL with either nested parens or only a single parens without its matching counterpart)

@1kastner
Copy link

Ok, this explains it, you can't count with regex ;-)
Well, I am just a contributing user as well, so maybe just fork it and put a pull request for the repository administrators?

@Pomax
Copy link
Author

Pomax commented Jul 24, 2015

yeah, thinking about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants