Using kramdown to mark up a digitized book, I've been converting the OCR-generated asterisks to underscores to improve the code's readability.
The :smart_quotes substitution algorithm breaks around text with underscores. For example:
renders both lines with closing quotes at both ends of the word. Asterisk markup works correctly, as does explicit html markup.
The handling of quotes is correct as per the kramdown spec. Other libraries like rubypants do the same as kramdown when given your test cases.
I recommend putting the underscores outside of the quotes.
So when a quotation starts with an italic word, the italics should overlap the front quote? Not how typesetting works.
Just curious: why does it work correctly for single asterisks, but not for underscores?
"_Single underscores_ do not work."
"*Single asterisks* work as they should."
'__Double underscores__ do not work.'
'**Double underscores** do not work.'
That's a good question! It should definitely work the same. rubypants shows the same, incorrect behavior...
Will have a look at it! And sorry for my dismissive behavior, should have investigated a bit more pedantically!
It seems like a little edge case unless you've actually had first-hand experience trying to write a book in it, I know :)
I've already started writing a new parser, with specs and discrete acceptance tests for specified features.
Do you have any particular technical goals for kramdown, such as speed or ability to handle very large input files, or is accuracy of all the edge cases your main goal? Would it be a problem if I were to write a comprehensive suite of explicit unit and integration specs, rather than the large suite of fixtures you use now?
The original goal was twofold:
Since the internal structure of the parsers and the converters doesn't really matter, I thought it would be better to provide test inputs and their respective outputs to ensure that changing the code doesn't invalidate the spec. Therefore the large suite of test cases.
You can use whatever you like and what seems best for you for testing your parser :)
I'm just taking a little while to tease apart all the integration tests you have here into individual specs. A few "small changes" to the current parser broke a surprising variety of them in a lot of uninterpretable ways, so single-responsibility tests will help me step carefully.
Could you say a few words about the overall architecture? My problem visualizing the whole is that it feels as though each parser is applied in the order they appear in the @block_parsers (and then @span_parsers) list. Is that the case? They're being serially applied in that order to each component, recursively over each branch found?
So as a side effect of this serial ordering, I don't need to worry (for architectural reasons) about accidentally applying a smart_quotes parser to an ERB block? Because the "only things left" to parse will be "extra raw" text that should have the quotes curled, dashes replaced and so forth?
The kramdown parser does its thing in multiple steps:
The block and span parsers are applied in a serial kind of way, in the order they appear in @block_parser and @span_parsers.
Each parser defines a regexp that is used to identify the block or span level element. Since some parsers use similar regexps (e.g. setext header and paragraphs, both need a line of text) the order of the parsers is important. Also, by arranging the parsers that are not order-dependent in a most-commonly-used way it is possible to squeeze some more speed out of the kramdown parser.
In case of the block parsers, they are just tested on the current position of the document in the order they appear and the first one wins.
In case of the span parsers, the start regexps for all spans parsers are stitched together into one huge regexp. A match for this regexp is searched for and then the span parsers are tested in the order they appear in the @span_parsers array. This is done to avoid scanning each and every character.
Regarding an ERB block parser: This depends on how you create the element. If you create a custom :erb element with the ERB text in the #value accessor, you have nothing to worry about. Text will only be parsed by the span level parser if you create an element of type :raw_text with the ERB text!
So, I have found a way to make your examples work without breaking my test suite :) So your presented use cases will work in the next release.
Thank you! I worked on it for quite a while and was elbow-deep into a full parser rewrite before I got stuck. So I'm glad you found an incremental solution.
Do you have the sense that, especially as Markdown flavors get more popular, the ability to extend and revise the pattern-matching is starting to trump the speed of the implementations?
Not really, no. There are also some fast implementations that are not regexp-based like the original Perl Markdown from Gruber but which are based on a PEG grammar that can more easily be modified.
I also have to admit that I don't immediately recognize what all the code of the kramdown parser does :) It takes me some time to understand some really complex parts (should have add more comments).