Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Incorrect smart_quote handling around underscore markup #57

Closed
Vaguery opened this Issue · 10 comments

2 participants

@Vaguery

Using kramdown to mark up a digitized book, I've been converting the OCR-generated asterisks to underscores to improve the code's readability.

The :smart_quotes substitution algorithm breaks around text with underscores. For example:

"_Hurrah!_"

'__Absolutely__.'

renders both lines with closing quotes at both ends of the word. Asterisk markup works correctly, as does explicit html markup.

@gettalong gettalong was assigned
@gettalong
Owner

The handling of quotes is correct as per the kramdown spec. Other libraries like rubypants do the same as kramdown when given your test cases.

I recommend putting the underscores outside of the quotes.

@gettalong gettalong closed this
@Vaguery

So when a quotation starts with an italic word, the italics should overlap the front quote? Not how typesetting works.

Just curious: why does it work correctly for single asterisks, but not for underscores?

"_Single underscores_ do not work."

"*Single asterisks* work as they should."

'__Double underscores__ do not work.'

'**Double underscores** do not work.'
@gettalong
Owner

That's a good question! It should definitely work the same. rubypants shows the same, incorrect behavior...

Will have a look at it! And sorry for my dismissive behavior, should have investigated a bit more pedantically!

@gettalong gettalong reopened this
@Vaguery

It seems like a little edge case unless you've actually had first-hand experience trying to write a book in it, I know :)

I've already started writing a new parser, with specs and discrete acceptance tests for specified features.

Do you have any particular technical goals for kramdown, such as speed or ability to handle very large input files, or is accuracy of all the edge cases your main goal? Would it be a problem if I were to write a comprehensive suite of explicit unit and integration specs, rather than the large suite of fixtures you use now?

@gettalong
Owner

The original goal was twofold:

  • to create a faster pure Ruby Markdown parser than Maruku and
  • to provide a syntax spec for the kramdown specific input so that the edge cases are all documented

Since the internal structure of the parsers and the converters doesn't really matter, I thought it would be better to provide test inputs and their respective outputs to ensure that changing the code doesn't invalidate the spec. Therefore the large suite of test cases.

You can use whatever you like and what seems best for you for testing your parser :)

@Vaguery

I'm just taking a little while to tease apart all the integration tests you have here into individual specs. A few "small changes" to the current parser broke a surprising variety of them in a lot of uninterpretable ways, so single-responsibility tests will help me step carefully.

Could you say a few words about the overall architecture? My problem visualizing the whole is that it feels as though each parser is applied in the order they appear in the @block_parsers (and then @span_parsers) list. Is that the case? They're being serially applied in that order to each component, recursively over each branch found?

So as a side effect of this serial ordering, I don't need to worry (for architectural reasons) about accidentally applying a smart_quotes parser to an ERB block? Because the "only things left" to parse will be "extra raw" text that should have the quotes curled, dashes replaced and so forth?

@gettalong
Owner

The kramdown parser does its thing in multiple steps:

  • First the document is parsed for block level elements, the big building blocks.
  • Then the created document tree is updated by parsing raw_text elements with the span level parser and applying the IALs.
  • The last step is replacing abbreviations with the correct kramdown Element object.

The block and span parsers are applied in a serial kind of way, in the order they appear in @block_parser and @span_parsers.

Each parser defines a regexp that is used to identify the block or span level element. Since some parsers use similar regexps (e.g. setext header and paragraphs, both need a line of text) the order of the parsers is important. Also, by arranging the parsers that are not order-dependent in a most-commonly-used way it is possible to squeeze some more speed out of the kramdown parser.

In case of the block parsers, they are just tested on the current position of the document in the order they appear and the first one wins.

In case of the span parsers, the start regexps for all spans parsers are stitched together into one huge regexp. A match for this regexp is searched for and then the span parsers are tested in the order they appear in the @span_parsers array. This is done to avoid scanning each and every character.

Regarding an ERB block parser: This depends on how you create the element. If you create a custom :erb element with the ERB text in the #value accessor, you have nothing to worry about. Text will only be parsed by the span level parser if you create an element of type :raw_text with the ERB text!

@gettalong
Owner

So, I have found a way to make your examples work without breaking my test suite :) So your presented use cases will work in the next release.

@gettalong gettalong closed this
@Vaguery

Thank you! I worked on it for quite a while and was elbow-deep into a full parser rewrite before I got stuck. So I'm glad you found an incremental solution.

Do you have the sense that, especially as Markdown flavors get more popular, the ability to extend and revise the pattern-matching is starting to trump the speed of the implementations?

@gettalong
Owner

Not really, no. There are also some fast implementations that are not regexp-based like the original Perl Markdown from Gruber but which are based on a PEG grammar that can more easily be modified.

I also have to admit that I don't immediately recognize what all the code of the kramdown parser does :) It takes me some time to understand some really complex parts (should have add more comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.