Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 *s cause problem #118

Closed
ggbetz opened this issue Aug 6, 2019 · 4 comments
Closed

3 *s cause problem #118

ggbetz opened this issue Aug 6, 2019 · 4 comments
Labels

Comments

@ggbetz
Copy link

ggbetz commented Aug 6, 2019

Argdown cannot handle three stars "***":

[T]: Argdown doesn't like *** stars.
@christianvoigt
Copy link
Owner

christianvoigt commented Sep 19, 2019

This is a lexing error. The error occurs also with underscores and can already be reproduced with only two characters (**). The lexer interprets the first one as "italic start" and the second as "italic end". As a consequence the parser misses text that is marked as italic.

@christianvoigt
Copy link
Owner

This issue turns out to be trickier than expected. As a context-free LL parser, the Argdown parser will always be more strict than most hand-built regex-based Markdown parsers. This is especially true for bold and italic text ranges. It would make the parser much more complex and slower to exactly emulate the behavior of these Markdown parsers.

Italic/bold start and end markers are lexed without knowing anything about the larger context. The lexer simply looks for preceding or subsequent white spaces (and line breaks and punctuation) to decide how to interpret asterisks and underscores. If an underscore is directly followed by an alphanumeric character than it is interpreted as the start of a bold/italic range. Otherwise, if it is preceded by an non-whitespace character, it is interpreted as the end of a range. This is a naive approach that works reasonably well within the limits of a context-free lexer.

However, the current behavior of the parser can still be improved in some cases, while in the other cases we can at least return a more informative error message. Let's look at three different cases, and how they will be treated in the next version:

Two asterisks surrounded by whitespace

Test ** Test.

Current behavior: The first asterisk is interpreted as the start of an italic range, the second one as the end of it. Because there is no text in between that is marked as italic, the parser returns an error.

New behavior: The parser will accept bold and italic ranges without any text in between the start and end tokens.

Three asterisks/underscores surrounded by whitespace

Test *** Test.

Current behavior: The first two asterisks are interpreted as the start of a bold range. The third asterisk is treated as a normal character as it is followed by an empty space (so it is not lexed as the start of an italic range). The parser returns an uninformative error message.

Desired behavior: A Markdown parser might parse these three asterisks as "normal" text without any bold or italic ranges. To achieve that, the Argdown lexer would have to know much more about the context than it currently does. At the moment I think that this is not worth the effort.

New behavior: The lexer will still interpret this as the beginning of a bold range but the parser will return a better error message. The user will be informed that she should use backslashes to escape the asterisks if she wishes to use them as "normal" text:

Test \*\*\* Test.

This is the recommended solution for all cases in which special characters should be ignored by the parser (another example use case for this is an asterisk that indicates a footnote (Test.\*), which would normally be parsed as the end of an italic range.

Four asterisks/underscores surrounded by whitespace

The same as with two asterisks/underscores with the only difference that the lexer will parse the first two as the start of a bold range and the last two as the end of a bold range.

For now I think this is a good compromise between keeping the lexing process as simple and as fault-tolerant as possible.

@bd82
Copy link

bd82 commented Sep 25, 2019

Hello @christianvoigt

It is theoretically possible to have a lexer that depends on Parser context with Chevrotain.
It is not officially supported, however I have experimented with this. See:

Things to note:

  • You would probably also have to build your own lexer as the Chevrotain based Lexer consumes the whole input at once.

  • This may cause issues with Error Recovery flows which assume the ability to re-sync the token stream, however if Parser context is needed to Lex that makes re-syncing more complex...

@christianvoigt
Copy link
Owner

Hi @bd82 thank you so much for your help, it sounds very exciting/mind-bending and I am always amazed by your level and speed of support.

Sadly, at the moment I only have time for maintenance tasks and I think the italic and bold ranges work well enough for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants