Performance issue with long lines and repeated `indexOf('\n')` searches #298

MikeWeller · 2021-08-10T14:53:21Z

Describe the bug
While parsing long lines with no line breaks, an excessive amount of time is spent in the lexer performing repeated this.buffer.indexOf('\n') operations.

We are parsing some very large (1MB+) lines of minified YAML containing the "flow" style of arrays/maps. This is taking far too long than it should.

To Reproduce

    function reproduceSlowParsing() {
        let s = '- [';
        for (let i = 0; i < 100000; ++i) {
            s += '{},"",';
        }
        s+= ']\n';

        const document = parseDocument(s);
    }

Expected behaviour
This should parse in just a second or so (on my machine). It actually takes 28s.

Versions (please complete the following information):

Chromium 88.0.4324.206, V8 8.8
git-describe is v2.0.0-7

Additional context

If I run the above code and profile with the Chrome profiler I see ~28s spent in 'parseDocument':

The majority of this time is spent in parseFlowCollection

Specifically we spent a lot of time in two indexOf calls in getLine and parseQuotedScalar:

Note: the lines are offset by one due to some issue on my end

What appears to be happening is that every time we lex one of the {} flow collections, we are searching for the final newline on the line which is millions of characters ahead. We do this for every flow collection on the line and don't cache the result of the search. This results in many expensive indexOf searches.

I hacked some memoization/caching of the newline search into the code (I won't share because it was very ugly and just to verify the performance issue) and after my changes the total parse time goes to just ~1s (from 28s):

And the time spent in 'parseFlowCollection' is now just 177ms (down from 26s):

I'm guessing there are other places in the code that may have similar indexOf type searches, but I happened to hit this due to parseFlowCollection.

I'm not really sure the best solution to this, but I'm happy to work on a fix if we can agree on a best approach/design/whatever. Thanks!

The text was updated successfully, but these errors were encountered:

eemeli · 2021-08-11T11:10:42Z

Thank you! That's a beautifully investigated and reported bug.

I'm pretty sure that the lexer's getLine() and parseQuotedScalar() are the only places where there's a buffer.indexOf() search that doesn't end up consuming the scanned chars.

So as a first step, I added caching to getLine() and stopped parseQuotedScalar() from looking past its end quote; those together appear to drop the lexing time to be about 20% of the parseDocument(), which seems appropriate.

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

MikeWeller · 2021-08-11T12:56:35Z

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

I'll test it out soon. One thing to be careful of:

  private setNext(state: State) {
    this.buffer = this.buffer.substring(this.pos)
    this.pos = 0
    this.next = state
    return null
  }

I think the cached lineEndPos needs to be adjusted here since we substring and reset the position back to 0. But I might be wrong.

MikeWeller · 2021-08-11T13:23:42Z

@MikeWeller Would you be able to verify those results on your machine by testing the current head of the master branch?

I can confirm these changes show a similar improvement to what I reported with my "hacks".

eemeli · 2021-08-11T14:08:40Z

I think the cached lineEndPos needs to be adjusted here since we substring and reset the position back to 0. But I might be wrong.

Good catch. Took me a while to figure out why the stream tests weren't complaining about this, and it's because lex() always gets called first before lineEndPos is next accessed, and it does the reset. Still, makes sense to add the reset there as well.

@MikeWeller, please close this issue if you think that the performance issue is solved by these changes?

MikeWeller · 2021-08-11T15:03:59Z

Problem solved, thanks for the quick turnaround!

MikeWeller added the bug Something isn't working label Aug 10, 2021

eemeli added a commit that referenced this issue Aug 11, 2021

fix(lexer): Cache line end positions (#298)

b5e279b

eemeli added a commit that referenced this issue Aug 11, 2021

fix(lexer): Also reset line end positions in setNext() (#298)

1654765

MikeWeller closed this as completed Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with long lines and repeated `indexOf('\n')` searches #298

Performance issue with long lines and repeated `indexOf('\n')` searches #298

MikeWeller commented Aug 10, 2021

eemeli commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

eemeli commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

Performance issue with long lines and repeated indexOf('\n') searches #298

Performance issue with long lines and repeated indexOf('\n') searches #298

Comments

MikeWeller commented Aug 10, 2021

eemeli commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

eemeli commented Aug 11, 2021

MikeWeller commented Aug 11, 2021

Performance issue with long lines and repeated `indexOf('\n')` searches #298

Performance issue with long lines and repeated `indexOf('\n')` searches #298