Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heading node should "eat" the following whitespace and newline #55

Open
Rua opened this issue Dec 15, 2013 · 8 comments
Open

Heading node should "eat" the following whitespace and newline #55

Rua opened this issue Dec 15, 2013 · 8 comments

Comments

@Rua
Copy link

Rua commented Dec 15, 2013

When headings are parsed, as far as I know, the wiki software requires that nothing except whitespace be on the line following the heading. So

==head==

Will be parsed as a heading by the wiki, but

==head== foo

is interpreted as raw text.

mwparserfromhell will emit a separate text node containing the following newline and any preceding spaces. But it's possible to remove this node, which then results in a parse tree that can't actually exist in the wiki: a Heading node without a following Text node beginning with a newline. The following newline and the preceding whitespace should really be implicit in the heading, so it should be "eaten" by the Heading node, rather than be converted into a separate text node. Maybe any whitespace should be preserved, but if so, it should be possible to strip it from the Heading node.

The node following the heading should be the first node on the next line, not the newline. If any non-whitespace intervenes between the heading and the newline, mwparserfromhell should not emit a heading at all but should parse it as inline text (possibly containing templates and such), just like the wiki software does.

@earwig
Copy link
Owner

earwig commented Dec 15, 2013

I see what you're saying. Good suggestion; I'll get on it at some point soon.

@ghost ghost assigned earwig Dec 15, 2013
@earwig
Copy link
Owner

earwig commented May 26, 2014

The real problem here is that, while ==head== foo is always disallowed by MediaWiki, ==head== {{foo}} is only disallowed if {{foo}} contains text (and is not just adding a comment or something). mwparser has no way to detect this.

@davidswinegar
Copy link
Contributor

I think this brings up the problem of comments as whitespace as well - from what I can tell MediaWiki parses <!-- comment -->=heading= as an HTML comment and a heading, but mwparserfromhell parses the heading as text because there isn't a leading newline. I also think (but am not sure) that this means that MediaWiki could interpret {{template}}=heading= with =heading= as a heading, but only if {{template}} contains a comment with no whitespace outside of it. I've been running into these issues when trying to parse tables, and they're really obscure situations but I'm not sure how to handle them yet.

@earwig
Copy link
Owner

earwig commented Jul 16, 2014

Yes, that seems correct. MediaWiki works by first substituting templates and removing HTML comments before it converts headings into real <h2>... etc tags. Thus, if a template on the same line as a heading is empty, the heading will still be parsed correctly (and furthermore, this is necessary for MediaWiki to parse headings located inside templates).

Since mwparser works in a fundamentally different way and we can never determine what the parse tree should truly look like, I think it's better to be safe than sorry with regards to determining whether something is a real heading or not. To that effect, I think foo ==head== and ==head== foo should always be disallowed (since they are always invalid), but {{foo}} ==head== and ==head== {{foo}} should be allowed since they only might be invalid. <!-- foo -->==head== is never invalid, so the fact that mwparser doesn't treat it correctly is a bug.

@davidswinegar
Copy link
Contributor

That sounds like a good strategy (though {{foo}} ==head== would also fail because of the extra whitespace, only {{foo}}==head== is valid). But with reference to the actual issue, I think its pretty much impossible to stop users from creating an incorrect parse tree if you allow them to delete and insert arbitrary nodes, though this seems like a situation in which it's easy to do so accidentally. I suppose it might be possible to create some way of including a meta-node or other way of tracking dependencies on these kind of whitespace/newline characters? That would probably make the API a lot more complicated.

@yuvipanda
Copy link
Contributor

This also causes problems when encountering things like:

= ∞*b  then that implies a/b = ∞*0.  A similar proof could be done with the

in the middle of a paragraph. This is interpreted as a heading 1 when it should be continuation of text.

That's from https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse/Questions/Archive_296&action=edit. The proposed solution in #55 (comment) should cover this

@yuvipanda
Copy link
Contributor

@lahwaacz
Copy link
Contributor

Also note that the amount of blank lines after a heading does not matter, e.g.

== foo ==





bar

is parsed by MediaWiki as

<h2><span class="mw-headline" id="foo">foo</span></h2>
<p>bar
</p>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants