Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Text.Json incremental source generator performance issues #68353

Closed
Tracked by #77019
eiriktsarpalis opened this issue Apr 21, 2022 · 26 comments · Fixed by #86616
Closed
Tracked by #77019

System.Text.Json incremental source generator performance issues #68353

eiriktsarpalis opened this issue Apr 21, 2022 · 26 comments · Fixed by #86616
Assignees
Labels
area-System.Text.Json source-generator Indicates an issue with a source generator feature tenet-performance Performance related issue
Milestone

Comments

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Apr 21, 2022

The System.Text.Json.SourceGeneration.Roslyn4.0.Tests project defines ~35 large JsonSerializerContext classes, resulting in approximately 3500 generated source files. The size of the project has had a detrimental effect on performance, both when building but especially when editing the project: on my primary dev machine the IDE essentially goes unresponsive when attempting to make any change in the project.

I ran a performance investigation and came up with the following findings:

  1. There appear to be performance issues inherent in Roslyn ingesting such big numbers of generated source files. On my machine a build will max out all 20 cores of my CPU and consume 4-7 GB of memory. It can be isolated to making a large number of calls to the SourceProductionContext.AddSource method and is not related to the source generator itself.
  2. Our incremental source generator implementation will regenerate every single source file on every keystroke. This is attributable to a combination of two factors:
  3. The source generator is not honoring the CancellationToken passed by Roslyn. Fixed in Make json source generation cancellable #69332

I'm not sure if 1) could be improved, however it definitely seems like 2) is a performance bug in our source generator implementation that we should try to fix. Not sure how it could be done without some refactoring of the Parser class.

cc @layomia @ericstj @eerhardt

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Apr 21, 2022
@ghost
Copy link

ghost commented Apr 21, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

The System.Text.Json.SourceGeneration.Roslyn4.0.Tests project defines ~35 large JsonSerializerContext classes, resulting in approximately 3500 generated source files. The size of the project has had a detrimental effect on performance, both when building but especially when editing the project: on my primary dev machine the IDE essentially goes unresponsive when attempting to make any change in the project.

I ran a performance investigation and came up with the following findings:

  1. There appear to be performance issues inherent in Roslyn ingesting such big numbers of generated source files. On my machine a build will max out all 20 cores of my CPU and consume 4-7 GB of memory. It can be isolated to making a large number of calls to the SourceProductionContext.AddSource method and is not related to the source generator itself.
  2. Our incremental source generator implementation will regenerate every single source file on every keystroke. This is attributable to a combination of two factors:
  3. The source generator is not honoring the CancellationToken passed by Roslyn.

I'm not sure if 1) could be improved, however it definitely seems like 2) is a performance bug in our source generator implementation that we should try to fix. Not sure how it could be done without some refactoring of the Parser class.

cc @layomia @ericstj @eerhardt

Author: eiriktsarpalis
Assignees: -
Labels:

area-System.Text.Json, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member Author

cc @chsienki

@eiriktsarpalis
Copy link
Member Author

eiriktsarpalis commented Apr 22, 2022

Here's a naive workaround that improves performance when editing the project in Visual Studio. It avoids regenerating sources for unchanged context classes by avoiding Collect() and invoking the source generator separately for each newly updated class. It also employs a custom equality comparer that ignores the Compilation component so that arbitrary changes do not trigger source generation.

As it stands though the workaround suffers from a couple of issues:

  1. Invoking the source generator independently for every source class is suboptimal in batch compilation. I was looking for a variant of IncrementValuesProvider.Collect() that only surfaces the newly updated values but such functionality doesn't seem to exist.
  2. The source generator will only trigger if the context classes themselves are modified, but not if any of their JsonSerializableAttribute dependencies are. A refinement of this might need to resolve the syntax nodes from the attribute symbols but I'm not sure if and how this might be possible.

@ericstj
Copy link
Member

ericstj commented Apr 23, 2022

We clearly have a lot more to learn about building these. Do we have the right inputs and APIs available to make our source-regeneration incremental? I know @eiriktsarpalis and @sharwell looked at this before and made improvements.

One thing to be careful of -- which I noticed for our own SLN -- is that it seemed to be loading the "first" source generator for JSON that it hit which was sometimes the old non-incremental one. It wouldn't load the incremental version since it had the exact same name. That's an internal only problem since customers would see only the one appropriate for their compiler version.

@layomia
Copy link
Contributor

layomia commented Apr 28, 2022

[With this comment I'm assuming that generating less files might have less impact on memory and cores, even if the material content generated remains the same]

It can be isolated to making a large number of calls to the SourceProductionContext.AddSource method and is not related to the source generator itself.

I recall @eerhardt gave feedback about reducing the number of files we output. Today we generate individual files for the following:

  1. the template context implementation including ctors, intialization of source-generated options, helpers to get converters
  2. the JsonSerializerContext.GetTypeInfo implementation
  3. JsonEncodedText-typed members for property names used for fast-path deserialization
  4. unique files for every type specified on the context, including types nested in their object graphs (this can be a lot)

Since we've identified this as a perf bottleneck, we should consider combining these files (especially the first 3). Unique files for each type is nice for debugging source-generated code for a single type, so it might be helpful to define a file-generation strategy, say based on an msbuild property.

@layomia
Copy link
Contributor

layomia commented Apr 28, 2022

Issues 2 & 3 should be looked into and addressed.

One thing to be careful of -- which I noticed for our own SLN -- is that it seemed to be loading the "first" source generator for JSON that it hit which was sometimes the old non-incremental one. It wouldn't load the incremental version since it had the exact same name. That's an internal only problem since customers would see only the one appropriate for their compiler version.

Definitely worth looking into resolving this.

@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2022
@layomia layomia added this to the 7.0.0 milestone Apr 28, 2022
@sharwell
Copy link
Member

sharwell commented Apr 28, 2022

It also employs a custom equality comparer that ignores the Compilation component

Never do this. This action instructs the command line compiler that it is allowed to produce inaccurate outputs in an actual .dll or .exe, and will allow a production machine to publish/sign/ship that invalid output.

@sharwell
Copy link
Member

will regenerate every single source file on every keystroke

There were several Roslyn bugs that were causing this to occur when it shouldn't. Roslyn's design is to not regenerate sources on every keystroke regardless of the implementation of the source generator itself, and either 17.2 or 17.3 has been updated to better reflect this desired behavior.

@eiriktsarpalis
Copy link
Member Author

One thing to be careful of -- which I noticed for our own SLN -- is that it seemed to be loading the "first" source generator for JSON that it hit which was sometimes the old non-incremental one. It wouldn't load the incremental version since it had the exact same name. That's an internal only problem since customers would see only the one appropriate for their compiler version.

Definitely worth looking into resolving this.

FWIW I deleted the Roslyn3.1 projects locally to ensure no interference and can confirm that the performance problems persist even with just the 4.0 projects present.

@eiriktsarpalis
Copy link
Member Author

It also employs a custom equality comparer that ignores the Compilation component

Never do this. This action instructs the command line compiler that it is allowed to produce inaccurate outputs in an actual .dll or .exe, and will allow a production machine to publish/sign/ship that invalid output.

That's interesting, I would have assumed there were no incremental caching concerns when building from the command line. I can see how this can cause the source generator to fire less frequently than is desirable in interactive scenaria, but I don't think it should be capable of passing unsound inputs to the source generator.

@eiriktsarpalis
Copy link
Member Author

will regenerate every single source file on every keystroke

There were several Roslyn bugs that were causing this to occur when it shouldn't. Roslyn's design is to not regenerate sources on every keystroke regardless of the implementation of the source generator itself, and either 17.2 or 17.3 has been updated to better reflect this desired behavior.

Hmmm, I tried attaching the Roslyn process from the latest VS 17.3 to the debugger and it would seem like every edit still triggers a new compilation.

@sharwell
Copy link
Member

sharwell commented Apr 29, 2022

That's interesting, I would have assumed there were no incremental caching concerns when building from the command line.

It's also possible to enable the incremental cache for VBCSCompiler, with a long-term goal to improve compilation performance by reducing the number of cases where a source generator needs to run.

Hmmm, I tried attaching the Roslyn process from the latest VS 17.3 to the debugger and it would seem like every edit still triggers a new compilation.

Edits can trigger new compilations, but it's no longer tied to the inner typing loop and should operate on longer delays. Continued typing should automatically increase the delay.

Some (non-default) user settings can override this behavior and increase the frequency the generators run.

Under a debugger, it may be hard to tell the difference between the 17.1 and 17.3 behavior. However, if you watch the ETW events to see the total number of generator executions/updates, you should see a significant reduction in the numbers during active typing scenarios.

@eiriktsarpalis eiriktsarpalis added the tenet-performance Performance related issue label May 3, 2022
@CyrusNajmabadi
Copy link
Member

This is definitely an issue with the latest versions of roslyn. Locally, i'm measuring the json generator using 75% of all the CPU in devenv and our external server. This impacts all features that need accurate semantics. For example, lightbulbs, diagnostics, etc. i've measured tens of seconds of delay introduced by this.

@CyrusNajmabadi
Copy link
Member

#69332 Helps address:

  1. The source generator is not honoring the CancellationToken passed by Roslyn.

@eiriktsarpalis
Copy link
Member Author

Moving to 8.0.0 as we won't have time to address this in .NET 7

@eiriktsarpalis eiriktsarpalis added the source-generator Indicates an issue with a source generator feature label Mar 9, 2023
@weifenluo
Copy link

I'm developing a similar source generator as System.Text.Json, to generate a custom type descriptor for each type.

I'm a bit shocked with Roslyn "incremental" source generation, it needs significant change to make it real "incremental":

  1. First of all, SourceProductionContext.AddSource, which is the final step of any source generator, is not incremental. Only values of IncrementalValueProvider<T> and IncrementalValuesProvider<T> during transformations can be cached, the transformations, however, are normally cheap operations;
  2. Roslyn is implemented via red and green nodes: You, the user, only ever see the red tree; the green tree is an implementation detail. We need the green tree node to detect whether it has been changed. For example, we cannot cache INamedTypeSymbol, because it is created on demand and varies from every Compilation object. The green tree node must be exposed publicly to write incremental source generator.

So the viable solution would be using RegisterImplementationSourceOutput instead of RegisterSourceOutput, which declares that the source produced has no semantic impact on user code from the point of view of code analysis - we can workaround this by defining types into a separate project/assembly. I'm not sure STJ can go this approach though.

@CyrusNajmabadi
Copy link
Member

First of all, SourceProductionContext.AddSource, which is the final step of any source generator, is not incremental

Correct. That's because a good incremental generator will rarely ever get to that step after the pipeline has run once. With most users edits/changes, the pipeline will stop early, so we can use the texts/trees from the prior generation as is.

We need the green tree node to detect whether it has been changed

As you said, it's an implementation detail. Exposing it would lock us into that impl detail forever.

For example, we cannot cache INamedTypeSymbol, because it is created on demand and varies from every Compilation object.

You should extract from the INamedType the data your generator actually needs, and use that subset of data as the value your IVP produces. That way, if that subset doesn't change, the pipeline will stop.

@weifenluo
Copy link

First of all, SourceProductionContext.AddSource, which is the final step of any source generator, is not incremental

Correct. That's because a good incremental generator will rarely ever get to that step after the pipeline has run once. With most users edits/changes, the pipeline will stop early, so we can use the texts/trees from the prior generation as is.

We need the green tree node to detect whether it has been changed

As you said, it's an implementation detail. Exposing it would lock us into that impl detail forever.

For example, we cannot cache INamedTypeSymbol, because it is created on demand and varies from every Compilation object.

You should extract from the INamedType the data your generator actually needs, and use that subset of data as the value your IVP produces. That way, if that subset doesn't change, the pipeline will stop.

So if I understand correctly, the pipeline can be short-circuited, it's not incremental.

Let's say we have 100 classes type1...type100 defined in the project, the first run will generate source code file 1...100, each depends on INamedTypeSymbol1...INamedTypeSymbol100, and public members of each INamedTypeSymbol.

If an edit in IDE triggers no change of generated 100 files at all, the pipeline can stop before actually generating the files. This is good improvement compares to V1 ISourceGenerator; however, if an edit in IDE triggers a change which will affect output source file5, all 100 files will be re-generated all together. IMO this is not incremental - an incremental source generator will only need to generate file5, leaving others untouched. I believe current pipeline does not support this, a real incremental source generator API cannot be that simple, that's why I was bit shocked when I first saw it.

Now let's get to pipeline short-circuit. Theoretically we can extract from the INamedTypeSymbol the data which generator actually needs, and use that subset of data as the value your IVP produces. Practically we're diffing almost the whole tree for almost every key stroke in IDE. The IDE compiler is true incremental, but also very complex, because it only needs to update a very small portion of green tree nodes for every change; however, the source generator has a different story, it receives every snapshot of Compilation object, and since we don't know where the edit happened, we have to diff the entire tree for every pipeline run. This will create a lot, if not all, red nodes, which is anti-pattern of Roslyn API, assuming only a very small portion of red nodes need to be created on demand.

@CyrusNajmabadi
Copy link
Member

the pipeline can be short-circuited, it's not incremental.

Short circuiting is incremental. This is the same way that incremental parsing works. The portions to be redone are computed, and the system stops once it discovers that redoing work would produce the same results as before. At that point, the old results are carried forward.

@CyrusNajmabadi
Copy link
Member

I believe current pipeline does not support this

It does support this. Just produce the same model values for nodes 1-4 and 6-100 and we won't do the following steps after that. We'll only recompute the source for model 5 and reuse the rest.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Mar 17, 2023

we have to diff the entire tree for every pipeline run.

No, you won't have to do this. The syntaxtree instances themselves are identical. So, if they are your model value (or part of your model), they will be equal across edits outside of them. So no need to diff them at all. And, if you even did diff, you'd just do the pointer check.

You only need to diff the trees that are actually different (i.e. don't have reference equality).

@CyrusNajmabadi
Copy link
Member

we have to diff the entire tree for every pipeline run. This will create a lot, if not all, red nodes, which is anti-pattern of Roslyn API, assuming only a very small portion of red nodes need to be created on demand.

You only need the red node tree for the current file. And that's not a problem because literally every feature (including the compiler) needs it. So it's going to be computed.

If you are that worried though, you can use IsIncrementallyEquivalent on the nodes as you walk the tree to avoid going into red nodes we can prove are identical.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Mar 17, 2023

Note: if you'd like to see a demonstration of this in action, I recommend looking at the impl in roslyn of ForAttributeWithMetadataName and the simple name helper it calls. You'll see how it is able to just reanalyze only a particular tree when it changes to build intermediary state. And it then uses that state to determine what even to examine elsewhere in the entire project.

https://github.com/dotnet/roslyn/blob/e47b15b2485bee978bafb5c7dcce99d8bdc2f8fd/src/Compilers/Core/Portable/SourceGeneration/Nodes/SyntaxValueProvider_ForAttributeWithSimpleName.cs#L65

@weifenluo
Copy link

@CyrusNajmabadi

Thanks a lot for all your explanations! I dug a little further into Roslyn source code repo, you're right, the pipeline is incremental.

If you are that worried though, you can use IsIncrementallyEquivalent on the nodes as you walk the tree to avoid going into red nodes we can prove are identical.

I can't find IsIncrementallyEquivalent anywhere, neither API documentation nor search in the source code repo. I do worry about the excessive red node creation, because the scenario I posted is greatly simplified for demonstration purpose only.

@CyrusNajmabadi
Copy link
Member

Sorry, it's called IsIncrementallyIdenticalTo. :)

@CyrusNajmabadi
Copy link
Member

because the scenario I posted is greatly simplified for demonstration purpose only.

We'd need to speak practically about what you're trying to do to to be able to make any determinations about the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Json source-generator Indicates an issue with a source generator feature tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants