3 *s cause problem #118

ggbetz · 2019-08-06T14:52:59Z

Argdown cannot handle three stars "***":

[T]: Argdown doesn't like *** stars.

The text was updated successfully, but these errors were encountered:

christianvoigt · 2019-09-19T14:22:38Z

This is a lexing error. The error occurs also with underscores and can already be reproduced with only two characters (**). The lexer interprets the first one as "italic start" and the second as "italic end". As a consequence the parser misses text that is marked as italic.

christianvoigt · 2019-09-25T12:37:24Z

This issue turns out to be trickier than expected. As a context-free LL parser, the Argdown parser will always be more strict than most hand-built regex-based Markdown parsers. This is especially true for bold and italic text ranges. It would make the parser much more complex and slower to exactly emulate the behavior of these Markdown parsers.

Italic/bold start and end markers are lexed without knowing anything about the larger context. The lexer simply looks for preceding or subsequent white spaces (and line breaks and punctuation) to decide how to interpret asterisks and underscores. If an underscore is directly followed by an alphanumeric character than it is interpreted as the start of a bold/italic range. Otherwise, if it is preceded by an non-whitespace character, it is interpreted as the end of a range. This is a naive approach that works reasonably well within the limits of a context-free lexer.

However, the current behavior of the parser can still be improved in some cases, while in the other cases we can at least return a more informative error message. Let's look at three different cases, and how they will be treated in the next version:

Two asterisks surrounded by whitespace

Test ** Test.

Current behavior: The first asterisk is interpreted as the start of an italic range, the second one as the end of it. Because there is no text in between that is marked as italic, the parser returns an error.

New behavior: The parser will accept bold and italic ranges without any text in between the start and end tokens.

Three asterisks/underscores surrounded by whitespace

Test *** Test.

Current behavior: The first two asterisks are interpreted as the start of a bold range. The third asterisk is treated as a normal character as it is followed by an empty space (so it is not lexed as the start of an italic range). The parser returns an uninformative error message.

Desired behavior: A Markdown parser might parse these three asterisks as "normal" text without any bold or italic ranges. To achieve that, the Argdown lexer would have to know much more about the context than it currently does. At the moment I think that this is not worth the effort.

New behavior: The lexer will still interpret this as the beginning of a bold range but the parser will return a better error message. The user will be informed that she should use backslashes to escape the asterisks if she wishes to use them as "normal" text:

Test \*\*\* Test.

This is the recommended solution for all cases in which special characters should be ignored by the parser (another example use case for this is an asterisk that indicates a footnote (Test.\*), which would normally be parsed as the end of an italic range.

Four asterisks/underscores surrounded by whitespace

The same as with two asterisks/underscores with the only difference that the lexer will parse the first two as the start of a bold range and the last two as the end of a bold range.

For now I think this is a good compromise between keeping the lexing process as simple and as fault-tolerant as possible.

bd82 · 2019-09-25T18:50:59Z

Hello @christianvoigt

It is theoretically possible to have a lexer that depends on Parser context with Chevrotain.
It is not officially supported, however I have experimented with this. See:

Things to note:

You would probably also have to build your own lexer as the Chevrotain based Lexer consumes the whole input at once.
This may cause issues with Error Recovery flows which assume the ability to re-sync the token stream, however if Parser context is needed to Lex that makes re-syncing more complex...

christianvoigt · 2019-10-21T21:15:50Z

Hi @bd82 thank you so much for your help, it sounds very exciting/mind-bending and I am always amazed by your level and speed of support.

Sadly, at the moment I only have time for maintenance tasks and I think the italic and bold ranges work well enough for now.

christianvoigt added the bug label Sep 13, 2019

christianvoigt mentioned this issue Sep 25, 2019

Whitespace after newline #117

Closed

christianvoigt closed this as completed in 77caffa Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3 *s cause problem #118

3 *s cause problem #118

ggbetz commented Aug 6, 2019

christianvoigt commented Sep 19, 2019 •

edited

Loading

christianvoigt commented Sep 25, 2019

bd82 commented Sep 25, 2019 •

edited

Loading

christianvoigt commented Oct 21, 2019

3 *s cause problem #118

3 *s cause problem #118

Comments

ggbetz commented Aug 6, 2019

christianvoigt commented Sep 19, 2019 • edited Loading

christianvoigt commented Sep 25, 2019

Two asterisks surrounded by whitespace

Three asterisks/underscores surrounded by whitespace

Four asterisks/underscores surrounded by whitespace

bd82 commented Sep 25, 2019 • edited Loading

christianvoigt commented Oct 21, 2019

christianvoigt commented Sep 19, 2019 •

edited

Loading

bd82 commented Sep 25, 2019 •

edited

Loading