Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: massive memory usage for parsing a revision #286

Open
geohci opened this issue Jun 6, 2022 · 4 comments
Open

Bug: massive memory usage for parsing a revision #286

geohci opened this issue Jun 6, 2022 · 4 comments

Comments

@geohci
Copy link

geohci commented Jun 6, 2022

We've been doing some parsing of lots of historical Wikipedia revisions so are running into weird edge cases. One of these is the revision below that seems to somehow escape the recursion depth and consumes a massive amount of memory when parsing. Obviously a patch would be great but also I'm happy to hear if there are any ideas for easily killing the parsing before its memory usage gets out of hand (without knowing the problematic revision IDs in advance).

Details:

The wikitext itself isn't particularly long but it's a bunch of HTML tags all nested within links etc.: https://simple.wikipedia.org/w/index.php?title=2000s_(decade)&action=edit&oldid=69784

Code to replicate -- very quickly consumed memory and was killed on PAWs (3GB RAM max) but I'm told it gets up to about 30GB if that's available:

import mwapi
import mwparserfromhell
session = mwapi.Session('https://simple.wikipedia.org', user_agent='<contact info>')

def get_wikitext(revision_id, wiki='simple'):
    return (session.get(
        action='query',
        prop='revisions',
        rvprop='content',
        rvslots='*',
        revids=revision_id,
        formatversion='2'
    ))['query']['pages'][0]['revisions'][0]['slots']['main']['content']

bad_wikitext = get_wikitext(69784)
print(len(bad_wikitext))  # 80360
wikicode = mwparserfromhell.parse(bad_wikitext)  # blows up
@lahwaacz
Copy link
Contributor

Related to #40

Is there some unclosed tag or wiki markup in that revision?

@geohci
Copy link
Author

geohci commented Jun 27, 2022

Is there some unclosed tag or wiki markup in that revision?

Thanks for chiming in. Yes, I do think that's what is going on but my understanding is that the max recursion depth (code) should prevent these unclosed tags from consuming so much memory? Perhaps it's because the wikitext has so many in this example (>1000 <big> tags and only about 300 closing </big> tags)? So in addition to the max recursion depth (which I assume is per item), there also should be a global max?

I guess that hints at my cheap fix too (check for number of open tags and number of close tags and make sure they are not too uneven before parsing) but that's not an easy check to add before parsing and has all sorts of edge cases (e.g., these tags don't have equivalent closing tags)

@earwig
Copy link
Owner

earwig commented Jun 30, 2022

I guess the runaway memory usage is from our cache of which combinations of tag locations are known to be invalid. Probably we're storing that in a very inefficient way and doing unnecessary backtracking.

Thanks for the bug report and sorry for being unresponsive lately. I'll think more about this.

@geohci
Copy link
Author

geohci commented Jun 30, 2022

that explanation makes sense. so decreasing recursion depth would probably help but is a blunt solution for something that's really about how to manage the possible combinations for these complicated, unclosed tag situations. thanks in advance for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants