Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with long lines and repeated indexOf('\n') searches #298

Closed
MikeWeller opened this issue Aug 10, 2021 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@MikeWeller
Copy link

Describe the bug
While parsing long lines with no line breaks, an excessive amount of time is spent in the lexer performing repeated this.buffer.indexOf('\n') operations.

We are parsing some very large (1MB+) lines of minified YAML containing the "flow" style of arrays/maps. This is taking far too long than it should.

To Reproduce

    function reproduceSlowParsing() {
        let s = '- [';
        for (let i = 0; i < 100000; ++i) {
            s += '{},"",';
        }
        s+= ']\n';

        const document = parseDocument(s);
    }

Expected behaviour
This should parse in just a second or so (on my machine). It actually takes 28s.

Versions (please complete the following information):

  • Chromium 88.0.4324.206, V8 8.8
  • git-describe is v2.0.0-7

Additional context

If I run the above code and profile with the Chrome profiler I see ~28s spent in 'parseDocument':

yaml-slow-parse-aggregated-time-parseDocument-edited

The majority of this time is spent in parseFlowCollection

yaml-slow-parse-aggregated-time-parseFlowCollection-edited

Specifically we spent a lot of time in two indexOf calls in getLine and parseQuotedScalar:

Note: the lines are offset by one due to some issue on my end

yaml-slow-parse-getLine-indexOf-edited

yaml-slow-parse-parseQuotedScalar-indexOf-edited

What appears to be happening is that every time we lex one of the {} flow collections, we are searching for the final newline on the line which is millions of characters ahead. We do this for every flow collection on the line and don't cache the result of the search. This results in many expensive indexOf searches.

I hacked some memoization/caching of the newline search into the code (I won't share because it was very ugly and just to verify the performance issue) and after my changes the total parse time goes to just ~1s (from 28s):

AFTER-yaml-slow-parse-aggregated-time-parseDocument-edit

And the time spent in 'parseFlowCollection' is now just 177ms (down from 26s):

AFTER-yaml-slow-parse-aggregated-time-parseFlowCollection-edited

I'm guessing there are other places in the code that may have similar indexOf type searches, but I happened to hit this due to parseFlowCollection.

I'm not really sure the best solution to this, but I'm happy to work on a fix if we can agree on a best approach/design/whatever. Thanks!

@MikeWeller MikeWeller added the bug Something isn't working label Aug 10, 2021
@eemeli
Copy link
Owner

eemeli commented Aug 11, 2021

Thank you! That's a beautifully investigated and reported bug.

I'm pretty sure that the lexer's getLine() and parseQuotedScalar() are the only places where there's a buffer.indexOf() search that doesn't end up consuming the scanned chars.

So as a first step, I added caching to getLine() and stopped parseQuotedScalar() from looking past its end quote; those together appear to drop the lexing time to be about 20% of the parseDocument(), which seems appropriate.

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

@MikeWeller
Copy link
Author

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

I'll test it out soon. One thing to be careful of:

  private setNext(state: State) {
    this.buffer = this.buffer.substring(this.pos)
    this.pos = 0
    this.next = state
    return null
  }

I think the cached lineEndPos needs to be adjusted here since we substring and reset the position back to 0. But I might be wrong.

@MikeWeller
Copy link
Author

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

I can confirm these changes show a similar improvement to what I reported with my "hacks".

@eemeli
Copy link
Owner

eemeli commented Aug 11, 2021

I think the cached lineEndPos needs to be adjusted here since we substring and reset the position back to 0. But I might be wrong.

Good catch. Took me a while to figure out why the stream tests weren't complaining about this, and it's because lex() always gets called first before lineEndPos is next accessed, and it does the reset. Still, makes sense to add the reset there as well.

@MikeWeller, please close this issue if you think that the performance issue is solved by these changes?

@MikeWeller
Copy link
Author

Problem solved, thanks for the quick turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants