Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse inline markup during the parse phase (use recursive descent parser) #61

Open
mojavelinux opened this issue Jan 8, 2013 · 32 comments
Open
Assignees
Labels
Milestone

Comments

@mojavelinux
Copy link
Member

@mojavelinux mojavelinux commented Jan 8, 2013

Currently, the parsing of inline content happens during rendering. This limits the information you have in the document after parsing. Instead, Asciidoctor should parse all text extents into inline nodes during parsing.

This requires moving the substitutions from the rendering phase to the parsing phase.

It also means that each line in the buffer will become an array of inline nodes that represent the chunked text, toggling between plain text and elements like links and images.

Since this has the potential to slow down a single pass parse/render, we may want to have a flag which controls the phase in which inline text is parsed.

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Jan 8, 2013

Btw, this will not only make it easier to index the document, it will also make Asciidoctor capable of being its own API for a syntax highlighting engine. No need for a project like CodeRay to parse AsciiDoc. Instead, it can use Asciidoctor to get the structure down to the level of inline nodes and then format them accordingly. This is an alternative way to get output instead of using backend (render) templates.

@mojavelinux mojavelinux added this to the v2.0.0 milestone Jul 16, 2014
@mojavelinux mojavelinux self-assigned this Jul 16, 2014
@mojavelinux mojavelinux changed the title Parse inline markup during the parse phase Parse inline markup during the parse phase (use recursive decent parser) Jul 31, 2014
@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Sep 23, 2014

I think the strategy to take here is to start developing an inline parser in Asciidoctor alongside the existing streaming transformer. Once it's fully fleshed out, we can switch to it. But the benefit is that you start to get something to use that at least hits the major syntax sooner rather than later. In other words, we can roll it out gradually. I envision the inline parser to be something you can call on a given block. Keep in mind that not all blocks in Asciidoctor have parsed text, or the text is parsed differently, so it makes sense that it's available as an API (at least in the near term) on the node.

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Nov 18, 2014

Use case to test for proper parsing once we switch over.

Use `/*` and `*/` for multiline comments.
@benignbala
Copy link

@benignbala benignbala commented Oct 17, 2015

Should I have a go at this using ANTLR4 (https://rubygems.org/gems/antlr4/versions/0.9.2) ? Thanks

@benignbala
Copy link

@benignbala benignbala commented Oct 17, 2015

Or rather, shall we try it with ANTLR4 in Java in the asciidoctorj project ?

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Oct 17, 2015

There's a project setup w/ ANTLR to do this experimentation. See https://github.com/asciidoctor/asciidoc-grammar-prototype

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Feb 25, 2017

Test case:

The following should not be matched as emphasis (i.e., italics):

*_id* word_
@elextr
Copy link

@elextr elextr commented Feb 27, 2017

I provided advice to another project that was trying to develop an Asciidoc implementation in C++ for an embedded project. That project has now been dropped (the embedded project and therefore the Asciidoc implementation 😞 ), but here are some lessons for future implementers of this parser.

Using the traditional parser tools is problematical, they all assume context free grammers, and Asciidoc is definitely not a CFG. Even the tools various ways of handling contextual dependence did not seem sufficient for this case (YMMV but don't waste too much time on it, so far nobody made it work). The example in the post above is a simple context dependence (parsing the _id as italic should stop if it hits the end of any quotes it is nested in, thats the context it depends on).

The big context dependence is of course the the different parsing between blocks (and changing that with the subs= option, since it changes the context dependence dynamically during parsing). So you have to recognise the block type or subs option early and apply it to the appropriate part of the input, even before you have parsed the appropriate input.

An alternative that was explored was to parse everything and then just revert the parses that were not required back to text. Unfortunately what is parsed affects how other things parse, eg if quotes are not parsed because the opening is inside a macro target and the ending is outside that will change if subsequently its discovered that macros are not in the subs= list, eg

[subs=-macros]
the italicised part of http:xxx__yyy[]__ frobnicates the foo bar

Its not entirely bad news, the top level structure of a document (sections, blocks, lists etc) is a CFG and defines most of the context for parsing the lower levels, so it seems possible to use a two pass approach where the structure and attribute lists are parsed and then then the block contents are parsed based on the context that specifies.

Thats where it was at when things stopped. Hope this will help future implementers avoid many of the same issues (I'm sure there are plenty still left for them to find).

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Feb 27, 2017

Thanks for this input. I'll study it in detail.

Just to share the idea in my head, I am definitely not planning to implement it as a single parser / grammar. My idea was to do the selective parsing I am doing now, which is working fantastic, but switch to the parser / grammar for inline stuff where the context is fixed. The problem right now is only there. The rest of AsciiDoc is actually easy to parse using a line-based approach.

I haven't validated that idea yet, but from my experience, I'm very confident it's going to work.

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Mar 15, 2018

On second look, I understand better what you're pointing out. Yes, I agree there are going to be differences in the edge cases. These changes in the syntax are going to have to happen if we expect to move past this problem. We have to evolve.

@elextr
Copy link

@elextr elextr commented Mar 16, 2018

Easy. You just use a different PEG for the two cases.

Lets see, six binary options for subs= thats 64 combinations, so 64 parsers, I suppose that is manageable. There might be some sharing of common parts as well.

Not necessarily a traditional thing to do, but what parsing AsciiDoc calls for.

Indeed, Asciidoc is not a context free grammar, so traditional context free techniques should not be expected to work.

Its good to see that you understand that a simple PEG parser isn't the solution by itself, I guess I harp on it a bit because many previous posts on the topic by various people seem to suggest its all you need, and it would be sad to see people wasting their time pursuing that path.

@s-leroux
Copy link

@s-leroux s-leroux commented Mar 16, 2018

@mojavelinux, @elextr Great to hear things are things are evolving concerning this issue.

These changes in the syntax are going to have to happen if we expect to move past this problem. We have to evolve.

If I understand it well, we may have to expect syntax change in AsciiDoctor source document before we can see inline nodes parsed into the AST? Do you have already some ideas in mind? Or maybe is there already another issue to discuss those changes?

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Oct 28, 2018

If I understand it well, we may have to expect syntax change in AsciiDoctor source document before we can see inline nodes parsed into the AST?

The goal is to minimize the impact as much as possible, but there may be slight differences at the edge cases. We won't really know until we get it fully working.

@mojavelinux
Copy link
Member Author

@mojavelinux mojavelinux commented Oct 28, 2018

For those following along, @Mogztter has been working on an inline parser to put theory into practice, tease out unknowns, and move the conversation forward. I still owe him feedback, but perhaps others are interested in weighing in.

See https://github.com/Mogztter/asciidoctor-inline-parser

@vmassol
Copy link

@vmassol vmassol commented Oct 28, 2018

Thanks for the heads up @mojavelinux

FTR, on the XWiki side we use java so we would need some java version to try it out/provide feedback/help.

Have a great weekend.

@laubai
Copy link

@laubai laubai commented Jan 10, 2019

Found my way here from #1678. We ran into this, and #2600, but while that one is fixed in the current version, this one is not yet, so wanted to provide some relevant test cases.

Hope this helps, apologies for noise if it doesn't.

master_adoc.txt

[EDIT] I should also note that it doesn't appear to matter whether underscores are paired or not - it's the dash that seems to be the issue.

@bripmccann
Copy link

@bripmccann bripmccann commented Apr 17, 2019

Typo in the title here ('descent' misspelled as 'decent')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.