Bug: massive memory usage for parsing a revision #286

geohci · 2022-06-06T20:46:42Z

We've been doing some parsing of lots of historical Wikipedia revisions so are running into weird edge cases. One of these is the revision below that seems to somehow escape the recursion depth and consumes a massive amount of memory when parsing. Obviously a patch would be great but also I'm happy to hear if there are any ideas for easily killing the parsing before its memory usage gets out of hand (without knowing the problematic revision IDs in advance).

Details:

The wikitext itself isn't particularly long but it's a bunch of HTML tags all nested within links etc.: https://simple.wikipedia.org/w/index.php?title=2000s_(decade)&action=edit&oldid=69784

Code to replicate -- very quickly consumed memory and was killed on PAWs (3GB RAM max) but I'm told it gets up to about 30GB if that's available:

import mwapi
import mwparserfromhell
session = mwapi.Session('https://simple.wikipedia.org', user_agent='<contact info>')

def get_wikitext(revision_id, wiki='simple'):
    return (session.get(
        action='query',
        prop='revisions',
        rvprop='content',
        rvslots='*',
        revids=revision_id,
        formatversion='2'
    ))['query']['pages'][0]['revisions'][0]['slots']['main']['content']

bad_wikitext = get_wikitext(69784)
print(len(bad_wikitext))  # 80360
wikicode = mwparserfromhell.parse(bad_wikitext)  # blows up

The text was updated successfully, but these errors were encountered:

lahwaacz · 2022-06-27T14:00:51Z

Related to #40

Is there some unclosed tag or wiki markup in that revision?

geohci · 2022-06-27T16:43:02Z

Is there some unclosed tag or wiki markup in that revision?

Thanks for chiming in. Yes, I do think that's what is going on but my understanding is that the max recursion depth (code) should prevent these unclosed tags from consuming so much memory? Perhaps it's because the wikitext has so many in this example (>1000 <big> tags and only about 300 closing </big> tags)? So in addition to the max recursion depth (which I assume is per item), there also should be a global max?

I guess that hints at my cheap fix too (check for number of open tags and number of close tags and make sure they are not too uneven before parsing) but that's not an easy check to add before parsing and has all sorts of edge cases (e.g., these tags don't have equivalent closing tags)

earwig · 2022-06-30T03:22:03Z

I guess the runaway memory usage is from our cache of which combinations of tag locations are known to be invalid. Probably we're storing that in a very inefficient way and doing unnecessary backtracking.

Thanks for the bug report and sorry for being unresponsive lately. I'll think more about this.

geohci · 2022-06-30T17:46:23Z

that explanation makes sense. so decreasing recursion depth would probably help but is a blunt solution for something that's really about how to manage the possible combinations for these complicated, unclosed tag situations. thanks in advance for taking a look!

declerambaul mentioned this issue Apr 5, 2023

Add missing TagData_dealloc #303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: massive memory usage for parsing a revision #286

Bug: massive memory usage for parsing a revision #286

geohci commented Jun 6, 2022

lahwaacz commented Jun 27, 2022

geohci commented Jun 27, 2022

earwig commented Jun 30, 2022

geohci commented Jun 30, 2022

Bug: massive memory usage for parsing a revision #286

Bug: massive memory usage for parsing a revision #286

Comments

geohci commented Jun 6, 2022

lahwaacz commented Jun 27, 2022

geohci commented Jun 27, 2022

earwig commented Jun 30, 2022

geohci commented Jun 30, 2022