Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block quote marker choice uniformity #466

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

aidantwoods
Copy link
Contributor

Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described here)

For what it's worth, this is one of the possible interpretations of the rule being overwritten – so it's not strictly new, just more specific ;)

Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described [here](commonmark#460 (comment)))
@aidantwoods
Copy link
Contributor Author

The line:

If some lines in Ls are blank lines, then trailing whitespace after a block quote marker may be omitted without affecting the interpretation of which block quote marker was prepended.

has the intent of making the following equivalent (both take interpretation of using marker type (a))

(() in place of trailing space)

> I would like to visit a castle in north scotland, next year.
>
> But my home is my castle
> I would like to visit a castle in north scotland, next year.
>
> But my home is my castle

It could probably be rephrased

very sorry typo
@ScottAbbey
Copy link
Contributor

Can you add some negative examples that implementations can test against? Since this would be tightening up current rules, they will probably need to be modified.

@aidantwoods
Copy link
Contributor Author

aidantwoods commented May 4, 2017

@ScottAbbey how does that look?

@mgeier
Copy link
Contributor

mgeier commented May 7, 2017

I think I don't really understand this new rule yet, but does it mean that (in the worst case) the whole block quote has to be parsed before it can be decided if the marker is >• or > and therefore before parsing the contents of the block quote can be started?

If yes, this would be a new limitation that hasn't been there before, and I'd have to rewrite my block parser.

@aidantwoods
Copy link
Contributor Author

aidantwoods commented May 7, 2017

I think I don't really understand this new rule yet, but does it mean that (in the worst case) the whole block quote has to be parsed before it can be decided if the marker is >• or > and therefore before parsing the contents of the block quote can be started?

Yes, that's correct.

Seeing as this is a new limitation (and may require rewrites in a lot of places) we could opt for a similar approach that is opportunistic instead of minimal overall marker (difference highlighted here).

Opportunistic being that instead of a uniform marker, we always continue using the previous marker (i.e. we use >• until > is used, and then we continue using > for the remainder of the block quote).

Personally I'd prefer the consistency of the current proposed approach (minimal overall marker) so that visual indent always matches the interpreted indent. Opportunistic would solve some ambiguity that we currently experience though, so it would at least be a step in the right direction. (I suppose the decision may come down to whether it should be easy to parse, or easy to write).

Whichever approach is chosen, it should achieve the method of parsing that is consistent across all implementations (at present, a parser could conform to spec by randomly allocating which marker to interpret as being prepended on each line of the same blockquote).

@mgeier
Copy link
Contributor

mgeier commented May 8, 2017

@aidantwoods Thanks for the answer!

This would indeed solve the parsing problem, but it would make for quite confusing and hard to explain behavior.

For example:

>•••••code1
>
>or
>
>•••••code2

In this case, code2 would have a leading space, although the number of spaces is the same as in the line with code1.
The problem here is that the "mistake" (not using a space before "or") leads to unexpected output somewhere else.
In the case of lists inside of blockquotes, the "mistake" and the unexpected output is at least at the same place.

Which leads me to my alternative suggestion:

We should declare >• as the blockquote marker. Using > without space would be strictly speaking an error, but since there are no errors in CommonMark, we could "tolerate" it, knowing that it may entail some problems.

The "quick reference" (http://commonmark.org/help/) already implies that there should be a space, and the tutorial (http://commonmark.org/help/tutorial/08-blockquotes.html) mentions an "optional" space. We could just drop the "optional" and everything would be fine.

In the actual spec, we could specify that >• is the "real" blockquote marker and if the space is missing, it is still treated as blockquote marker, but there may be unintended consequences in nested structures.

What about that?

IMHO this would have some similarity to how indentation is handled inside of fenced code blocks (assuming the fence itself is indented by N = 1 space):

If the leading code fence is indented N spaces, then up to N spaces of indentation are removed from each line of the content (if present). (If a content line is not indented, it is preserved unchanged. If it is indented less than N spaces, all of the indentation is removed.)

BTW, I'm a fan of the "readers perspective" (cf. #460 (comment)). IMHO It's much easier to understand if the spec says "remove >• from every line; if there is no space, just remove >" instead of talking about Ls and Bs and prepending stuff.

@aidantwoods
Copy link
Contributor Author

aidantwoods commented May 8, 2017

In the actual spec, we could specify that >• is the "real" blockquote marker and if the space is missing, it is still treated as blockquote marker, but there may be unintended consequences in nested structures.

This wouldn't affect my personal usage too much (I always use >• anyway), but we'd have to address the case where multiple nestings of a blockquote are opened using >, as I'd imagine that because this is quite a convenience it is possibly often used.

Also note that both minimal overall marker, and opportunistic marker approaches are tightening of the spec (so are valid already within the current spec), whereas dropping > as a marker would require changing the existing examples in spec (though switching to this approach would probably be an easy change for most implementations).

@mgeier
Copy link
Contributor

mgeier commented Aug 2, 2017

@aidantwoods I'm not talking about dropping > as a marker. I'm just saying that from a user's perspective, it should be considered an "error", but one that is "tolerated". It still should work, but it may have those unintended consequences with nested lists and probably other structures.

Multiple nested block quotes wouldn't be any different. The "official" usage would be >•>•>•, users can drop the spaces if they insist, but there might be problems with nested structures when "mixing" >• and >, just as in the case with a single block quote.

I don't think that any of the examples in the spec have to be changed, but probably the wording should be changed as to why those examples are allowed. And the text could contain a warning about possible problems when not using a space after >.

What I'm suggesting doesn't require any change in the reference implementations, it's just a different "interpretation" of the current behavior.

@aidantwoods
Copy link
Contributor Author

IMO if > is to be considered an error, it should just fail. Having it be silently tolerated (resulting in indentation misalignment) is confusing. Having it fail outright makes it clear that it should not be used. Removing grey areas is kind of the point in the spec, right?

If we want to keep it, we should address the problems in the current spec.

Btw, no examples in the spec need be changed with the proposed implementation (since it is contained in a subset of possible interpretations of the current spec). ;)

@mgeier
Copy link
Contributor

mgeier commented Aug 4, 2017

Having it be silently tolerated (resulting in indentation misalignment) is confusing.

I agree, but this might be the lesser evil.

And it might actually happen very rarely (but I have no data to support that).

Having it fail outright makes it clear that it should not be used.

But this would break backwards compatibility.

And as long as there are no lists (with ) inside the block quote, it's perfectly fine to be used (but I would still not recommend it in tutorials).

Removing grey areas is kind of the point in the spec, right?

I agree, but I consider none of the discussed alternatives "grey areas".
They are all well defined and unambiguous (the wording might still be improved), it's just that each one has their own set of disadvantages ...

@aidantwoods
Copy link
Contributor Author

I think the only way to resolve this may be to try and gauge some consensus from the community.

Couple points though:

Having it fail outright makes it clear that it should not be used.

But this would break backwards compatibility.

The spec is too permissive at present (it does not define a unique mapping), so we necessarily break BC with something by fixing it. It is an unavoidable consequence.

I think my order of preference for fix is:

  1. The changes currently proposed in this PR to find the minimal marker.
  2. Drop support for the short marker, and hard fail
  3. Drop official support for the short marker and have implementations be able to choose whether to tentatively support it

I think that dropping the short marker is perhaps a reasonable solution, but I worry that allowing implementations to decide what to do goes against the aim of the spec (that you have guarantees that the same markdown will look the same when going through different compliant parsers). Spec should take a hard line on things when possible IMO

@jgm
Copy link
Member

jgm commented Sep 10, 2018

Statistics on a large corpus of markdown files from GitHub: about 0.8% use the > without a space for block quotes. That is a small enough percentage that it option 2 seems kind of tempting.

Is this a fourth option?

  1. Use the first line to determine what marker will be used for the blockquote. So if the first line begins with > + SPACE, and you have a subsequent line that starts with > without the SPACE, that line will just contain a literal > character.

This option avoids having to scan the whole block quote -- which I really want to avoid. It is a kind of mix of 1 and 2, but I think it would give good results for all real-world cases.

@aidantwoods
Copy link
Contributor Author

I like this idea, especially so since it addresses uniformity while not breaking the >>> foo case and consistent uses of the shorter marker. It's also a nice bonus not requiring major parsing strategy changes to meet the spec ;-)
I think we might have to change the wording of the spec to accommodate this parsing in a way that no longer completely uses the writer's perspective: we'd be explaining how we interpret the marker on the first line instead of a description about prepending the same marker to the beginning of some existing lines.

i.e. (with representing a space in the following)

>•foo
>bar

has a singular interpretation with this new rule, but the writer may have started with

•foo
bar

and prepended > to both lines (the same marker on each line).
Alternatively they may have started with

foo
>bar

and prepended >• to the first line and left the second as continuation text.

This new rule assumes the latter case, and the only way to explain why is to say how the marker on the first line is interpreted (reader's perspective). This could be as simple as saying something like: "if there is ambiguity as to whether > or >• was prepended to the first line then >• will be assumed". This would be in addition to the requirement already added in this PR that enforces that the same block quote marker must be prepended to all lines (so that we may deduce the choice of marker on the remaining lines).

@jgm
Copy link
Member

jgm commented Sep 11, 2018

Good point. How about this.

A type (a) block quote marker is a > followed by a space or a newline.

A type (b) block quote marker is a > not followed by a space or a newline.

If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (a) block quote markers to each line in L forms a block quote with content Bs.

If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (b) block quote markers to each line in L forms a block quote with content Bs.

If we have

•foo
bar

then there's no way to prepend type (b) block quote markers to each line, since a > character on the first line would be followed by a space.

@jgm
Copy link
Member

jgm commented Sep 11, 2018

Actually, we probably need something more like this:

A type (a) block quote marker is a > followed by a space.

A type (b) block quote marker is a > not followed by a space or a newline line ending.

A type (c) block quote marker is a > followed by an empty line (spaces plus a newline) a line ending.

If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (a) or (c) block quote markers to each line in L forms a block quote with content Bs.

If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (b) or (c) block quote markers to each line in L forms a block quote with content Bs.

The revision allows for:

>a
>b
>
>c

@jgm
Copy link
Member

jgm commented Sep 12, 2018

To be explicit, this proposal will parse

> a
>b

as a block quote with contents a\n>b (the > will be treated as a regular character). The Consecutiveness rule prevents parsing it as two block quotes with different kinds of markers.

@digitalmoksha
Copy link

This proposal sounds reasonable. My only question, what about blank lines? For example, I've used quite a lot

> This the first blockquote
>
> This is the second blockquote (note there is no space after the `>` on the second line

@jgm
Copy link
Member

jgm commented Sep 12, 2018

Blank lines are covered. According to this proposal, you can use either (type a and c) or (type b and c) markers. Type c markers are the ones followed by a newline.

@jgm
Copy link
Member

jgm commented Sep 12, 2018

To be explicit, this proposal will parse

a
b

as a block quote with contents a\n>b (the > will be treated as a regular character). The
Consecutiveness rule prevents parsing it as two block quotes with different kinds of markers.

Hm. But really, I don't think the spec is all too clear about why this is a block quote with content a\n>b rather than, say, a regular paragraph with content > a followed by a block quote with content b. (Or even a regular paragraph with content > a\n>b -- depending on how we interpret the laziness condition, which seems circular.)

I am getting tempted to scrap the whole descriptive spec idea, and describe a parsing algorithm (similar to the HTML5 spec). It should be possible to describe the general block parsing strategy, in terms of a set of parameters (block start matcher, block continuation matcher, can it interrupt a paragraph?, does it allow laziness, etc.). Descriptions of individual block-level elements could then fill in these parameters.

@mgeier
Copy link
Contributor

mgeier commented Sep 12, 2018

@jgm Hooray! It's nice to hear that you finally are thinking about dropping this "writer's perspective" thing with its overly complicated and hard-to-understand rules. I'm very much in favor to switch to a spec that describes some kind of "abstract" parser. Implementers of actual parsers don't have to do exactly the same steps, but they should do something that behaves "as if" those steps would have been done (similar to the "as if" rule in C++: https://en.cppreference.com/w/cpp/language/as_if).

@mgeier
Copy link
Contributor

mgeier commented Sep 12, 2018

So the rule for a block-quote parser would be to remove one or two characters from the beginning of each line and yield the rest of the line for further block processing. This issue discusses whether that should indeed be exactly one or rather two characters or some mixture of both.

I'll try to summarize the above suggestions, starting with the numbering used by @aidantwoods in #466 (comment).
I think that really all options have some disadvantages, we'll have to choose the lesser of a few evils ...

option 1: find the minimal marker

disadvantage: The block cannot be parsed line-by-line anymore.

option 2: only long marker

A > character without following space would turn into a literal > character.

A special rule would have to be added to handle a > followed by a line break.

disadvantage: This would break backwards compatibility with Gruber's Markdown. Some old documents would suddenly have a few stray literal > characters in their block quotes.

Question: should this still work for nested block-quote markers without spaces?

>>> what shall we do?
>>
>> let's go home
>
> OK

Should this still be allowed? (I tend towards yes)

option 3: discourage short marker

This is basically the same as the current state, except the documentation is changed to strongly suggest using a space.
See #466 (comment).
See option 4 for further disadvantages.

option 4: either long or short marker

Remove space, except none is available.
This is the situation in the current spec.

disadvantage: breaks lists if list item is written without space.

option 5: change marker adaptively

Remove space from each line until there is no space. Afterwards don't remove spaces.
Suggested in #466 (comment).

Disadvantage: Behavior changes within one block quote, quite confusing.

option 6: only short marker

This wasn't really proposed above, but I think it would also be an option.

One could still use spaces, but those would count as additional indentation in the nested blocks.
Since additional spaces are allowed in all blocks (up to 3 spaces), this should work surprisingly well.
This would not break lists.

Disadvantage: Some indented code blocks would suddenly get a leading space where there wasn't one before.
OTOH, it would make the rule for indented code blocks less confusing, because the indentation would be exactly 4 spaces, as everywhere else.

I would still use spaces in all tutorials and examples. I think spaces look much better, but that doesn't mean that the block-quote parser has to remove them.


I hope I didn't forget some option that was mentioned above. Feel free to add further options and comments!

@jgm
Copy link
Member

jgm commented Sep 12, 2018 via email

@mgeier
Copy link
Contributor

mgeier commented Sep 12, 2018

Which situation is more seldom?

No spaces are used (this breaks in option 2):

>text

List item without space (this breaks in option 2, 3 and 4):

>1.•text
>
>•••more text

Indented code block with 5 spaces (this grows an unwanted leading space in option 6):

>•••••code

Spaces followed by no spaces followed again by spaces (this behaves strangely in option 5 and has unwanted leading spaces in option 6)

>•••••code1
>
>or
>
>•••••code2

Are there other situation I've forgotten?

Leaving option 1 aside, I have the feeling that option 5 leads to the least breakage, but option 6 is easier to understand. Options 2, 3 and 4 lead to far more breakage.

@ScottAbbey
Copy link
Contributor

option 6: only short marker

This wasn't really proposed above, but I think it would also be an option.

One could still use spaces, but those would count as additional indentation in the nested blocks.
Since additional spaces are allowed in all blocks (up to 3 spaces), this should work surprisingly well.
This would not break lists.

Statistics on a large corpus of markdown files from GitHub: about 0.8% use the > without a space for block quotes. That is a small enough percentage that it option 2 seems kind of tempting.

Are you able to craft a query on the large corpus of markdown files to see what effect "option 6: only short marker" would have? This should mostly affect situations where someone has started their line with the > followed by four spaces? What else will it break?

@jgm
Copy link
Member

jgm commented Sep 12, 2018

Option 6 is interesting. I guess you're right, it would only affect cases where the block quote encloses indented content: indented code blocks and lists. And it shouldn't greatly affect lists, because it would change the indentation of every line evenly.

I tried modifying the cmark parser so it doesn't consume the space after >, and only one spec test case failed:

Example 97 (lines 1734-1745) Fenced code blocks
> ```
> aaa

bbb

--- expected HTML
+++ actual HTML
@@ -1,5 +1,5 @@
 <blockquote>
-<pre><code>aaa
+<pre><code> aaa
 </code></pre>
 </blockquote>
 <p>bbb</p>

I'll look at the corpus with this in mind.

@jgm
Copy link
Member

jgm commented Sep 13, 2018

In one small part of the corpus (4901 files):

  • 2 files had > followed by 4 spaces (or tab) and then a word character.
  • 11 files had > followed by indented code (5 spaces or tab + space) and then a word character.

I tried a few others; this seems fairly typical. So to estimate, with option 6,

  • 0.2% would be affected by having an extra space indent added to their code blocks
  • 0.04% would be affected by having blockquoted paragraphs turn into blockquoted code blocks.

@mgeier
Copy link
Contributor

mgeier commented Sep 13, 2018

@jgm Thanks for checking this out! I would be really interested in those 2 cases that have > plus 4 spaces. I'm wondering how this can happen in reality (other than in a continuation line of a list item or of a plain paragraph). Can you provide those 2 cases?

The test failure above is actually a bug (and not a problem with option 6). If only the > characters are removed, the remaining fenced code block has a space before the fence. This space has to be remove from the following code lines, according to the spec.

BTW, I actually forgot an option in the list. It was suggested by @jgm in #466 (comment), right below the original 3 options.
I didn't skip this on purpose, I just overlooked it, so let me repeat it here (it was originally number 4, but I've ruined that):

option 7: first line decides

This sounds indeed quite practical, but I don't know how this is supposed to work with nested > markers. For example:

>>•nested text
>
>•text

Is the space in the third line supposed to be removed?

What about:

> text
>
>> what now?

Is the last line supposed to have literal >> characters?

@jgm
Copy link
Member

jgm commented Sep 13, 2018

I would be really interested in those 2 cases that have > plus 4 spaces. I'm wondering how this can happen in reality (other than in a continuation line of a list item or of a plain paragraph)

It currently gets parsed as a regular paragraph in a block quote. The block quote marker consumes the first space, leaving three -- and a paragraph can have a three-space indent.

The test failure above is actually a bug (and not a problem with option 6). If only the > characters are removed, the remaining fenced code block has a space before the fence. This space has to be remove from the following code lines, according to the spec.

That's right.

As for option 4:

> text
>
>> what now?

Thinking about this in terms of the current spec, and not the "remove markers" idea (which is foreign to it), this would be parsed as a blockquote containing a paragraph "text" followed by a paragraph containing ">> what now?". The uniformity requirement would prohibit parsing the whole thing as one block quote, and the consecutiveness rule would forbid having two distinct block quotes without an intervening blank line. Admittedly this is something you might see in the wild, and this proposed interpretation is not going to be the expected one.

So, perhaps that's a point in favor of the simple option 6 -- just letting the > character, with no following space, be the marker.

@jgm
Copy link
Member

jgm commented Sep 13, 2018

Here's the main thing that keeps me from embracing option 6. If you start with

This is code:

    code

and put it in a block quote with nice spacing:

> This is code:
>
>     code

Then you should, in my opinion, get a block quote with the same content as the original.

@digitalmoksha
Copy link

^ totally agree with that - adding the blockquote with space should not change how the text inside the blockquote looks when rendered (inside a blockquote).

Statistics on a large corpus of markdown files from GitHub: about 0.8% use the > without a space for block quotes. That is a small enough percentage that it option 2 seems kind of tempting.

Just ran into this recently with a set of files that got converted from RedCarpet to CommonMark. We changed some of these where we ran across it, but since it didn't break the final rendering, not all of them were changed.

Not saying they shouldn't be eventually, just many still exist as >Note: something something something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants