Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

Parsing and tokenizing large views thrashes CPU, Memory & GC #635

Closed
SimonOrdo opened this issue Dec 9, 2015 · 23 comments
Closed

Parsing and tokenizing large views thrashes CPU, Memory & GC #635

SimonOrdo opened this issue Dec 9, 2015 · 23 comments
Assignees
Milestone

Comments

@SimonOrdo
Copy link

I'm running RC1-Final. Pages are served fine, but with a large view, the server gets so bogged down, apparently trying to parse all the tags, that the server complains and I end up with HTTP errors at times.

In this instance, I'm serving a view with no razor "@" tags and no kind of inline code. Aside from two "section" tags, it's purely HTML and Javascript. The view defines several data tables and is 188kb in size. Here is what the start and end of requests look like in the VS performance tools (red arrows are start of request and end of request):

image

Memory snapshot before the page request
image

Memory snapshot during the page request
image

Memory snapshot after the page request
image

Any ideas how to mitigate this beyond "write smaller views with fewer tags to parse" ?

@rynowak
Copy link
Member

rynowak commented Dec 9, 2015

/cc @Eilon this is exactly what we were discussing.

@rynowak
Copy link
Member

rynowak commented Dec 9, 2015

The proposed fixes here:

Rename ISymbol to IToken - because these are tokens not symbols** :trollface: Consider using an abstract base class here because there's going to plenty of functionality that's language-agnostic.

HTML-span based tokens should have a pointer to the source and a range (start, end) of the document. We can't afford to create copies of the HTML regions of the page as they are numerous, arbitrarily large and don't have a ton of duplication. If we can't do this without buffering (because of concurrency, multiple passes on the input), when we might want to substring here, but with de-duplication.

For CSharp-span based tokens, they will be extremely repetitive. We should dedupe these as well.

We should also consider creating a token-stream data structure, and storing in syntax nodes the range of tokens, rather than an explicit list.

For building of spans, it's the wrong approach to ToString() and then write it out. The span should hold on to lower-level constructs like tokens or nodes, and ask those items to write themselves out to a TextWriter. https://github.com/aspnet/Razor/blob/dev/src/Microsoft.AspNet.Razor/Chunks/ChunkTreeBuilder.cs#L144

There's some further reading here about Roslyn's implementation:
http://ericlippert.com/2012/06/08/red-green-trees/
https://roslyn.codeplex.com/discussions/541953

The red-green tree discussion is potentially interesting if we want to do incremental parsing for editor scenarios, but in general we need to solve a different problem. They have a lot of text that's punctuation, or very repetitive.

@Eilon Eilon added this to the 4.0.0-rc2 milestone Dec 9, 2015
@rynowak
Copy link
Member

rynowak commented Dec 29, 2015

Updated with some more info

@Eilon
Copy link
Member

Eilon commented Dec 30, 2015

@rynowak how related is this to #614?

@rynowak
Copy link
Member

rynowak commented Dec 30, 2015

@rynowak how related is this to #614?

They are both razor bugs we need to fix 👍 These are both issues you encounter with trying to use large razor pages that are usually completely blocking.

@Eilon
Copy link
Member

Eilon commented Dec 30, 2015

Cool. Nicholas will take care of these!

@dougbu
Copy link
Member

dougbu commented Jan 3, 2016

For building of spans, it's the wrong approach to ToString() and then write it out

Not sure whether this is part of the same bug but Razor makes extensive use of ToString(). Mostly this extra complication improves the debugging experience, eases testing, or supports tracing. But CSharpCodeWriter writes out literals char-by-char, calling ToString() on each. Would be much more efficient for the underlying CodeWriter to expose a Write(char) method.

@NTaylorMullen
Copy link
Member

@rynowak had a great suggestion to optimize our usage of HtmlSymbols and take a more Roslyn approach when creating them (caching them). When working on the issue I discovered that Razor's SyntaxTree has gaps in it. Meaning you could have a SyntaxTreeNode spanning 0-10 of the document and then its sibling spanning 12-20 (leaving a gap 10-12). In order to cache HtmlSymbols I had to decouple them from their positions and errors and instead rely on their parents positions. I had to decide between two approaches:

  1. Change the Razor parser to not create gaps in the SyntaxTree. This would also make the SyntaxTree more accurate.
  2. Have the parser knowingly create SyntaxTreeNodes with a specified position (and leave the gaps).

The problem with 2 is with how the parser is written today logic code does not always know when/if it's creating a Span. Sooo to make it create Spans with specific positions it'd be a muchhhh larger re-design.

So I decided to attempt 1. In my branch I was able to get Razor into a runnable state for mainline scenarios and profiled cached HtmlSymbols vs non-cached for MSN's home page (very large). In the end the cached HtmlSymbols resulted in a 4% reduction in allocations.

Overall the change is 100% positive and enforces Razor's data structures to be more accurate; however, with RTM around the corner, optimizing HtmlSymbols for a 4% gain would be a very risky change! I had to touch nearly every part of the parser and would need to touch more if I were to fix all the edge case scenarios in my branch. Assuming I fixed every test in the Razor repo I'd be confident in our runtime / design time tests coverage. However, we lack very thorough integration tests, ones that'd be required for a drastic parser change. Also, tooling has expectations on the format of the SyntaxTree that we'd be invalidating.

@rynowak and I talked and we suggest pushing this out to post RTM, BUT should definitely be done because it is an extremely positive/enforcing change. The 4% that we see now would also become more apparent/larger as other parts of the parser are optimized.

/cc @Eilon

@rynowak
Copy link
Member

rynowak commented Jan 14, 2016

RTM, BUT should definitely be done because it is an extremely positive/enforcing change. The 4% that we see now would also become more apparent/larger as other parts of the parser are optimized.

To add more to this, one of the best reasons to do this change in the future is that it will fix bugs we have and prevent us from re-introducing those bugs.

Agreed that this is probably more of a change that we want to take right now given the time it would take us to react to any bugs we'd introduce and the difficulty of testing the impact on editor/design-time.

@SimonOrdo
Copy link
Author

@NTaylorMullen , @rynowak I see that this was merged with another issue. Can you clarify for me... will large views (like those with datatables) still have trouble being parsed as described in this original post or has that part of the issue been resolved

@rynowak
Copy link
Member

rynowak commented Jan 14, 2016

will large views (like those with datatables) still have trouble being parsed as described in this original post

We're still working on this. Expect to see some concrete improvement in the next week or so.

All of the discussion here has fundamentally been about strategies to address the performance issues of parsing large pages. We investigated one avenue to address the issue and decided not to take that particular fix for this release.

@Eilon
Copy link
Member

Eilon commented Jan 14, 2016

@rynowak from your previous comment it wasn't clear to me if you want this for RC2, RTM, or post-RTM?

@rynowak
Copy link
Member

rynowak commented Jan 14, 2016

@Eilon - we should do the proposed design change for the tokenizer post RTM.

@rynowak
Copy link
Member

rynowak commented Jan 14, 2016

@NTaylorMullen and I are breaking up the items and working through them.

Here's a baseline (15 iterations doing codegen on MSN.cshtml) CoreCLR x64:

image

Partial fix - 6a4a954 - Removing copies of empty RazorError[] - This is about 50mb

image

NTaylorMullen added a commit that referenced this issue Jan 15, 2016
- The Equals operators were boxing the symbol types like crazy, added an abstract `SymbolTypeEquals` to avoid this.

#635
NTaylorMullen added a commit that referenced this issue Jan 15, 2016
- The Equals operators were boxing the symbol types like crazy, added an abstract `SymbolTypeEquals` to avoid this.

#635
@rynowak
Copy link
Member

rynowak commented Jan 15, 2016

ffdf5d2

@rynowak
Copy link
Member

rynowak commented Jan 15, 2016

95ea4cc

@NTaylorMullen
Copy link
Member

7a0a9c6

@rynowak
Copy link
Member

rynowak commented Jan 15, 2016

08fef95

@Eilon
Copy link
Member

Eilon commented Jan 15, 2016

Ok so do whatever we need now via this issue, and log a separate issue for whatever we want to do post-RTM.

@rynowak
Copy link
Member

rynowak commented Jan 15, 2016

315d79f

@NTaylorMullen
Copy link
Member

1ce9180

@rynowak
Copy link
Member

rynowak commented Jan 20, 2016

e68c55a

@NTaylorMullen
Copy link
Member

We were able to reduce allocations by 42% and have filed the additional follow up issues to reduce it further:

#674
https://github.com/aspnet/Razor/issues/675
https://github.com/aspnet/Razor/issues/676

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants