Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Extension support in libcmark #100
Hello, I always see "extensions" mentioned on discussions of features in CommonMark (for example the discussion about tables).
Does libcmark itself support actual extensions, and if so is there any guide on how to implement one, and an index of common extensions, or are extensions purely conceptual extensions to the specification, up to individual implementations to add ?
I'm pretty sure I could just read the code and find out but
Thanks for all your work on CommonMark / pandoc!
Interesting, thanks, however afaiu some extensions will need to define rules at the parsing stage? ie for example the table extension would need to define new node types, and set state such as IN_TABLE when the beginning of a table is encountered?
Yes, that's much harder to accomplish. (Though currently
which a filter can intercept, process, and replace with raw
I kind of want to use libcmark itself as it's hell of fast, also for my customization needs I currently use commonmark-py , as my tool is written in python
The API is already documented using CommonMark! libcmark
+++ Mathieu Duponchelle [Feb 04 16 11:13 ]:
Oh one more thing btw, adding gobject-introspection support to libcmark would be extremely freaking great ! I suppose you could do so by simply encapsulating data types as "boxed types", but a full port to gobject might even make more sense? I know this kind of deviates from the original subject, but if you're interested I can open a separate issue?
Sorry, I don't know what gobject is.
+++ Mathieu Duponchelle [Feb 04 16 11:27 ]:
You don't know that GObject? We must be living on two very separate islands of the Open Source archipelago!
You seem to be writing somewhat object-oriented C in this project (with new constructors and what not).
GObject is the most standard implementation of the object concept in C, I would strongly advise giving it a good look :)
gobject-introspection generates "gir files" based on C source code and annotated comments in the source, these files are then read at runtime by projects like pygobject, which make the API transparently available from python (with some caveats, for example it's impossible to "call" function macros from python obviously, but you can still write manual "overrides" file if you really want to)
I'd really love having access to the full extent of the CommonMark API from python, instead of the very limited wrapper that currently exists (cmarkpy only exposes "markdown_to_html"), and this solution is the best one I know from a maintenance perspective.
+++ Mathieu Duponchelle [Feb 04 16 13:04 ]:
That's undoubtedly true. I'm mostly a Haskell programmer;
It should be trivial to make the whole API available from
Alternatively you could probably use swig to generate
Understandable, GObject and the GLib are very established projects though, known to build and work on all the platforms I can think of and then some.
Of course it isn't, but experience taught me that this solution was the superior one :)
I'll certainly clone the library and have a better look at its implementation in the near future, for now I won't mind if you close that issue, thanks for the feedback !
Reopening after reading the code while working on #103 , I think I understand the parsing process a bit better now, and I'd be interested in your opinion on the best way to add "hooks" for extensions to plug upon during the parsing process.
I think the first step needed would be to document the block parsing algorithm, and the cmark_parser structure.
My "high-level" understanding of the algorithm is as follows:
I might very well be very wrong on some of my interpretations of the code, but I can kind of see how this process would create a kind of valid ast in the end :)
However, the code in there still needs a lot of cleanup and refactoring to make it easily legible, for example:
I am also not really clear on how the backtracking works exactly.
I have currently a limited intuition of how extending the block parsing part of the process could be done (I still haven't looked at either inlines or rendering), basically the extension would "register" a new block type, then be called during both the first and second part of the process to determine whether the processed line could match a block type defined by the extension, then whether this new block could be contained by the current container. The extension should also provide a way to make the check the other way around, ie could a new block be a child of its block type.
Am I correct in some of my assumptions here? When I manage to understand the process completely, and the proposed refactoring and documentation work are done, I'll be happy to help with designing and implementing an extension interface :)
There's a bit of high-level documentation of the parsing strategy here:
I'd certainly be receptive to PRs that improve readability, e.g. by making functions for some of the complicated tests. (As long as they don't significantly affect performance.)
Yep, that's pretty much what I tried to do, except better :)
My suggestions for now would be :
I may not find time to do this soon, but it really should be done in order to improve the overall maintainability of this specific chunk of code :)
Hello, so as promised in #104 , I have implemented a first draft of "block parsing extensions".
Important disclaimer: the implementation linked below is not intended to be final, that's why I didn't make a pull request, this is initial work to help me figure out whether my understanding of the block parsing process was adequate, discover design issues that should be solved, and provide us with some talking points.
I have provided an example extension, that parses simple piped tables.
To test it out, check out my branch, build it with the usual commands, create a file named "foo.md" containing :
You should see an html table, woohoo :)
Note that after writing all this old-school C, I would strongly recommend using the glib in libcmark. My current implementation is full of shortcuts which can and will fail given the chance, I did not want to distract myself while developing this draft, as if we agree with using it the glib provides plenty of helper functions to solve these for us.
I will mention these shortcuts along my self-review.
The current implementation certainly leaks a little too btw
I'm now going to go over the commits, and detail the issues I've encountered and how they are currently solved, please comment here and not on the commits themselves so we can keep a trace of the process after future rebases :)
Extensions: first draft (MathieuDuponchelle@70fa8a0)
Making cmark link with "CMAKE_DL_LIBS"
Extensions need to be wrapped as plugins, as we want third party to be able to implement their own extensions once an API is decided. I dumbed down http://eli.thegreenplace.net/2012/08/24/plugins-in-c for this purpose, ideally I'd like to use https://developer.gnome.org/glib/stable/glib-Dynamic-Loading-of-Modules.html instead.
I define LIBDIR at compile-time in order to have a place to look for "standard extensions", in the future the cmark executable could expose a command-line argument to discover plugins in other places, and "discover_plugins" could also look in an environment variable defined path.
Some fiddling will have to be done for the statically linked cmark I suppose.
Exposing internals for use by the plugins
I've made quite a mess in there, the symbols I exposed are :
Node type approach
The issue here is that all the standard block types have specific
I think a better solution would be along the lines of the glib type system https://developer.gnome.org/gobject/unstable/gobject-Type-Information.html , which permits dynamic registration of types at run time, however I don't think we would want full-fledged objects for each block, and I'm not sure registering block types as "fundamental types" would make sense either, so that's still a pretty open question on my end.
For now, the extensions are discovered and stored in a simple list , a GHashTable would be more appropriate but it's not really critical.
The user can then attach extensions to a given parser with "cmark_parser_register_extension", the name is probably off but the approach seems pretty sane to me.
I've added some command-line switches to the cmark executable itself, once again naming might be off but the solution is pretty straightforward, one can list extensions and define zero or more extensions to be used for a given run.
In the future, I think a nice feature would be to let commonmark documents define which "extensions" they need to be parsed correctly, I think there's some discussion about this on talk.commonmark and I haven't given the issue more thought though.
The extension structure itself
While making blocks gobjects would IMO be pretty bad performance-wise, having extensions (and other object-like structures) derive from GObject would be my natural inclination if it was available, as I exposed what amounts to virtual methods on these.
I've documented the header file for the extensions at https://github.com/MathieuDuponchelle/cmark/blob/extensions/src/extension.h , and while incomplete (a virtual method to let the extension inform cmark of whether a block can "accept lines" would be needed), I think the interface is OKish.
I pass the parent_container to these functions to allow for "vertical rules" a la setext heading, the table extension makes use of it to requalify a paragraph as a table header, it is not enough to let the extension create new blocks, it also needs to be able to modify previously open ones.
It is for now up to the extension to advance the offset of the parser, I don't know if this is too-level or not.
My approach is to define a single
The issue I can think of is that for now, the internal state of the renderer is absolutely not taken into account, I don't have a good example for where this could be an issue yet though, but a solution would be to add an extra argument to that function, carrying that state over somehow.
I tried to keep it as small as possible, the responsible extension and custom_data pointers seem mandatory to me.
This is pretty much all I can think of for now
Extensions: implement an example "core extension" (MathieuDuponchelle@ea58c70)
The extension is responsible for creating its own scanner, I used re2c for that purpose as it was available, nothing should prevent an implementation to use flex or X as a scanner.
I'm not particularly good at writing regular expressions, and the ones I did write are certainly full of holes, not really the issue here but feel free to correct me :)
As exposed in "Node type approach" , the code is full of ugly strcmp to implement the "grammar" of a piped table (tables must start with a header, rows can contain cells etc).
Other than that, not much to say, I think the size of the code is not unreasonable and there's not that many low-level internals exposed.
Just a couple of notes and opinions from my side.
The code for extensions should live in separate repositories, not within the cmark repository. This requires that extensions use their own build system.
Loading extensions via
Most people embedding libcmark in other projects will be strongly against a runtime dependency like GLib (me included), especially regarding portability. There's really no need for that. Extensions can simply be based on a table of function pointers like in your draft. There's also no need for a hash table to store extensions. A linked list is perfectly fine for a small number of entries (maybe 1-3 in the typical case). Extensions should add no overhead or dependencies for people who don't need them.
The string APIs (buffer and maybe chunk) should probably be exposed to extension authors. ("Chunks" are basically an immutable string type that can point to string buffers it doesn't own.)
The hardest problem is how to integrate extensions with the block parsing algorithm. I don't know enough about these internals, so I can't comment on that.
If you make internal functions like
This reduces the size of diffs considerably.
Possibly, I made this to ease testing but I have no strong opinions either way, a project like gstreamer does ship some "core plugins" (https://cgit.freedesktop.org/gstreamer/gstreamer/tree/plugins) for example, but the difference is pretty clear-cut there as a "core plugin" has to be "media-agnostic" and high quality to be included there, it does provide with an official example on how to implement a plugin.
I'm aware of that :) Note that https://developer.gnome.org/glib/stable/glib-Dynamic-Loading-of-Modules.html would take care of portability.
GLib is extremely portable. With respect to "embedding", bundling software is a poor practice and shouldn't be a criteria here. Depending on the GLib would make the project way more maintainable, reducing the amount of code and thus the surface for bugs.
Can you please be specific about your statement on portability?
I don't agree with that, allowing extensions to specify which nodes the block types they define can and cannot contain seems extremely important to me: we want users of the node API to be prevented from adding "tables" to "table cells" for example.
I have worked on #103 and #104 to gain some insight on the parsing algorithm, as this was indeed the actually hard problem to solve. I think my current solution is pretty adequate, a good exercise to ascertain that would be to try and implement all the core block types as extensions.
Indeed, that was dumb of me, I'll rework the patch tomorrow :)
Thanks for the feedback, cheers
I'm pretty swamped at the moment, so I only had time for a brief glance, but this looks promising to me!
I agree with @nwellnhof about avoiding a glib dependency.
One thing I'm not sure about is the inclusion of "rendering" in an extension.
So be it, even though more specific reasons would be nice :) This means that we'll have to figure out and implement a system to have dynamic loading work on windows too.
That was the "rendering" section of my admittedly long summary previously. There aren't really that many solutions out of this. The solution I did adopt was the pragmatic one, which is :
Note that I only implemented the call to the render_block virtual method in the html renderer for now, but all the renderers should indeed implement this as well.
I chose this solution because it is the most flexible one, the only other solution would be to have something like pandoc, where for example commonmark's core would define block types for a table, even though it recognizes no syntax to create one "natively". Extensions would then only have to create "conforming" tables, and the renderers would know how to render them natively.
The advantages of this approach are that:
The disadvantages are :
One could imagine a mix of these two solutions maybe, at that point I think it needs to be your call as it's a pretty fundamental design decision, which will be important for the future of this API.
Another thing I'd like to mention is that I did that initial work in part to just get the ball rolling, as this is something I consider as very important to have as soon as possible, if this triggers the interest of someone, by all means grab my branch and continue the work, just let me know about your branch here so I can cherry-pick stuff / rebase on it.
@nwellnhof , I'm not exactly sure how to best go about the chunk API to be honest, it is entirely composed of "forced-inline" functions and I have no idea whether exposing such functions as API is a good practice, any clues ?
Regarding strbuf, I've made some preliminary work to expose it, if you feel like having a look I've got a branch at https://github.com/MathieuDuponchelle/cmark/tree/expose_new_API
On 27/02/2016 05:52, Mathieu Duponchelle wrote:
Yes, exposing these inline functions would be bad practice. Not the inline
The major problem with the chunk API is that it makes no guarantees how long
See my inline comments (mostly documentation).
We'll also have to add a real constructor and destructor for API users.
@nwellnhof I addressed your review in separate commits on that branch to preserve your comments, let me know if you're happy with the changes and I'll submit a pull request with a fixed up commit, I'd rather have such changes be merged ASAP to minimize chances of conflict.
Thanks for your time, I'll answer your comments here a bit later.
@nwellnhof I've just looked back on the extension, particularly _scan_at which is a strict copy paste from src/scanners.re, and indeed there is no reason to pass a chunk in the extension scanner at all. Note that this is also the case in src/scanners.re, but that doesn't really matter there.
I'm toying with the idea of cleaning up my patch completely (apart from the plugin loading abstraction), then try to reimplement all the block parsing in core as "extensions". This would finish validating the API, and would be a conceptually satisfying thing to do, but I'm unsure about the performance impact it may have. We'll see :)
I pushed a bunch of commits:
The two issues that I still see are plugin loading on Windows, which isn't a design issue but needs to be done, and more importantly the decision of whether the current design where extensions implement and render their own block types is the correct one.
The more I think about it, the more I feel like the alternative design, where libcmark would define block types (CMARK_NODE_TABLE, CMARK_NODE_WHATEVER), even though there isn't (yet) any standard syntax rules in the specification that can lead to their creation, would be the superior one.
The major issue with my current design is really the interaction of multiple extensions, I can imagine all sorts of issues arising from the fact that an extension is unable to tell whether a node type it implements can contain or be contained by a node type implemented by another extension, and after thinking long and hard about it, I honestly think there's no general solution to this issue, and I'm not interested in a solution that "mostly works", as such solutions are already achievable by parsing and modifying the ast in a second phase.
I'll wait for feedback about this for the while :)
I haven't looked at your table parsing strategy in detail,
Two worries about this: (a) it limits extensions to things that can be
Suppose we introduce the notion of a "regular block element" and
If a block is not a regular block container, then the extension
Hey, I pushed a third branch that follows the alternative design I exposed here :
I think this approach should be considered very seriously, the drastically simplified implementation is to me a sign of things being right.
Conceptually this makes syntax extensions (I renamed the structure) simple "syntax rules providers", which allows to continue strictly enforcing a correct node hierarchy in the output.
I must admit I'm not sure I fully understood your solution, from what I understand it still leaves room for "impossible things" to happen, generic classes of containers cannot be guaranteed to always be containable one in another. I must also admit I'm now a bit tired and may need more detailed examples to see the light. Also note that "syntax rule providers" can always resort to creating custom blocks, which already exist and afaiu overlap pretty significantly with what you're proposing, maybe a way forward could be to extend their API, but I'm not convinced.
The advantage of the approach I proposed is that while the specification of the input CommonMark format is (rightfully so), a slow process, with the goal of avoiding "syntax debt" as much as possible, there is no such limitation with respect to the output of libcmark: the formats which it outputs are well-defined, and as such we shouldn't be overly afraid of "committing" ourselves here, there aren't 50 correct ways to output a html table.
I initially also perceived the idea of defining node types with no standard syntax rules to allow creating them as awkward, but then I looked at it from another angle, if you consider these node types as "nice things we will want to have in the future", then they're not awkward, as they match the specification process, which is to state a desired output, then work on the best syntax to produce it, and they can actually be a valuable help in the process of defining the best syntax without committing to something immediately.
I'll look at the table parsing strategy link you shared, thanks, note that as I said initially my goal with this extension for now is simply to validate the API, not figure out the perfect syntax / parsing strategy for now :)
@jgm , I came up with a silly idea for the table parsing problem exposed at http://talk.commonmark.org/t/parsing-strategy-for-tables/2027/5 , I'm pretty curious to know what you think of it :)
I'm adding inline syntax extensions along the same lines, my test case will be strikethrough with:
This will allow me to validate the interface with the most complex aspect of inline parsing, the delimiter stack logic.
My intention for the short term is to bundle (I know :) my branch of libcmark in https://github.com/hotdoc/hotdoc , I will extend inline parsing to support syntax which by definition shouldn't make it in the specification, gtk-doc's syntax, this should help validate the API further. I of course don't want to bundle a fork of cmark in the long-term, that would be silly :)
And I just pushed a new set of commits to implement initial inline parsing extensibility, still on https://github.com/MathieuDuponchelle/cmark/commits/extensions_draft_3. Some other methods of the "subject" will need to be exposed for full extension functionality, but the current state is sufficient to parse strikethrough tags in an appropriate way (through the delimiter stack process).
ping :) I've now implemented the "gtk-doc" extension, mix of inline and block extensions, you can check out the code at https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c . I think it shows the interface offered is pretty adequate, it allowed me to implement a pretty complex case,
No worries, I know there's quite a lot of code in that diff, but that's also what makes me want to go ahead with this, I don't like resolving conflicts :) Fortunately we already merged the refactoring, which was bound to conflict a lot. By the way my branch is rebased on latest master, wasn't too difficult ^^