Heading node should "eat" the following whitespace and newline #55

Rua · 2013-12-15T22:04:43Z

When headings are parsed, as far as I know, the wiki software requires that nothing except whitespace be on the line following the heading. So

==head==

Will be parsed as a heading by the wiki, but

==head== foo

is interpreted as raw text.

mwparserfromhell will emit a separate text node containing the following newline and any preceding spaces. But it's possible to remove this node, which then results in a parse tree that can't actually exist in the wiki: a Heading node without a following Text node beginning with a newline. The following newline and the preceding whitespace should really be implicit in the heading, so it should be "eaten" by the Heading node, rather than be converted into a separate text node. Maybe any whitespace should be preserved, but if so, it should be possible to strip it from the Heading node.

The node following the heading should be the first node on the next line, not the newline. If any non-whitespace intervenes between the heading and the newline, mwparserfromhell should not emit a heading at all but should parse it as inline text (possibly containing templates and such), just like the wiki software does.

The text was updated successfully, but these errors were encountered:

earwig · 2013-12-15T22:44:49Z

I see what you're saying. Good suggestion; I'll get on it at some point soon.

earwig · 2014-05-26T08:00:35Z

The real problem here is that, while ==head== foo is always disallowed by MediaWiki, ==head== {{foo}} is only disallowed if {{foo}} contains text (and is not just adding a comment or something). mwparser has no way to detect this.

davidswinegar · 2014-07-16T18:14:55Z

I think this brings up the problem of comments as whitespace as well - from what I can tell MediaWiki parses =heading= as an HTML comment and a heading, but mwparserfromhell parses the heading as text because there isn't a leading newline. I also think (but am not sure) that this means that MediaWiki could interpret {{template}}=heading= with =heading= as a heading, but only if {{template}} contains a comment with no whitespace outside of it. I've been running into these issues when trying to parse tables, and they're really obscure situations but I'm not sure how to handle them yet.

earwig · 2014-07-16T18:46:31Z

Yes, that seems correct. MediaWiki works by first substituting templates and removing HTML comments before it converts headings into real <h2>... etc tags. Thus, if a template on the same line as a heading is empty, the heading will still be parsed correctly (and furthermore, this is necessary for MediaWiki to parse headings located inside templates).

Since mwparser works in a fundamentally different way and we can never determine what the parse tree should truly look like, I think it's better to be safe than sorry with regards to determining whether something is a real heading or not. To that effect, I think foo ==head== and ==head== foo should always be disallowed (since they are always invalid), but {{foo}} ==head== and ==head== {{foo}} should be allowed since they only might be invalid. ==head== is never invalid, so the fact that mwparser doesn't treat it correctly is a bug.

davidswinegar · 2014-07-16T19:04:25Z

That sounds like a good strategy (though {{foo}} ==head== would also fail because of the extra whitespace, only {{foo}}==head== is valid). But with reference to the actual issue, I think its pretty much impossible to stop users from creating an incorrect parse tree if you allow them to delete and insert arbitrary nodes, though this seems like a situation in which it's easy to do so accidentally. I suppose it might be possible to create some way of including a meta-node or other way of tracking dependencies on these kind of whitespace/newline characters? That would probably make the API a lot more complicated.

yuvipanda · 2015-12-19T23:32:44Z

This also causes problems when encountering things like:

= ∞*b  then that implies a/b = ∞*0.  A similar proof could be done with the

in the middle of a paragraph. This is interpreted as a heading 1 when it should be continuation of text.

That's from https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse/Questions/Archive_296&action=edit. The proposed solution in #55 (comment) should cover this

yuvipanda · 2015-12-19T23:56:17Z

I've 'fixed' it for now with https://en.wikipedia.org/w/index.php?title=Wikipedia%3ATeahouse%2FQuestions%2FArchive_296&type=revision&diff=695965028&oldid=655041713

lahwaacz · 2016-06-28T07:12:16Z

Also note that the amount of blank lines after a heading does not matter, e.g.

== foo ==





bar

is parsed by MediaWiki as

<h2><span class="mw-headline" id="foo">foo</span></h2>
<p>bar
</p>

ghost assigned earwig Dec 15, 2013

davidswinegar mentioned this issue Jul 21, 2014

Implement MediaWiki table parsing #81

Merged

earwig modified the milestones: version 1.0, version 0.4 May 23, 2015

earwig added priority: mid and removed priority: high labels May 23, 2015

yuvipanda mentioned this issue Dec 20, 2015

Strange nesting when discussion has nested headings mediawiki-utilities/python-mwchatter#12

Closed

lahwaacz mentioned this issue Jun 4, 2017

rewrite and extend Caveats #180

Merged

roysmith mentioned this issue Feb 6, 2021

Heading node should strip leading and trailing whitespace #265

Open

DaxServer mentioned this issue Apr 4, 2021

Text node inside Template params shall strip leading and trailing whitespaces #266

Closed

earwig mentioned this issue Apr 25, 2021

Possible way to delete a node? #268

Closed

RheingoldRiver mentioned this issue Apr 25, 2021

Remove whitespace as well when removing a node #269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heading node should "eat" the following whitespace and newline #55

Heading node should "eat" the following whitespace and newline #55

Rua commented Dec 15, 2013

earwig commented Dec 15, 2013

earwig commented May 26, 2014

davidswinegar commented Jul 16, 2014

earwig commented Jul 16, 2014

davidswinegar commented Jul 16, 2014

yuvipanda commented Dec 19, 2015

yuvipanda commented Dec 19, 2015

lahwaacz commented Jun 28, 2016

Heading node should "eat" the following whitespace and newline #55

Heading node should "eat" the following whitespace and newline #55

Comments

Rua commented Dec 15, 2013

earwig commented Dec 15, 2013

earwig commented May 26, 2014

davidswinegar commented Jul 16, 2014

earwig commented Jul 16, 2014

davidswinegar commented Jul 16, 2014

yuvipanda commented Dec 19, 2015

yuvipanda commented Dec 19, 2015

lahwaacz commented Jun 28, 2016