Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only format source generated file on demand #49685

Open
YairHalberstadt opened this issue Dec 1, 2020 · 32 comments
Open

Only format source generated file on demand #49685

YairHalberstadt opened this issue Dec 1, 2020 · 32 comments

Comments

@YairHalberstadt
Copy link
Contributor

This is probably more of a VS issue than a roslyn one, but it will demand some coordination between the two so I'll post it here.

I maintain the StrongInject Source Generator.

When generating the code, I generate everything on a single line, without any superfluous whitespace.

I then call CSharpSyntaxTree.ParseText(SourceText.From(file, Encoding.UTF8)).GetRoot().NormalizeWhitespace().SyntaxTree.GetText(); to format the generated code.

Well over 2/3ds of the time in my Source Generator is spent just on this, even though StrongInject is a very computation intensive SG.

However in the majority of cases this is completely unnecessary, as the user will never see the code.

Instead it would be useful if when adding a file to a compilation via the SourceGenerator I could set an unformatted flag. Then when VS displays the file to the user it will format it for me. I imagine this would lead to extremely large savings in a lot of SG projects.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Dec 1, 2020
@jaredpar
Copy link
Member

jaredpar commented Dec 1, 2020

This is an interesting question. Essentially how do we balance speed in source generators vs. displaying the text nicely in the IDE when customers do step through it. Think it's probably more on the IDE side though so moving there for now.

@jinujoseph jinujoseph added Concept-Continuous Improvement and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Dec 1, 2020
@jinujoseph
Copy link
Contributor

cc @jasonmalinowski , @chsienki

@jinujoseph jinujoseph added this to the Backlog milestone Dec 1, 2020
@jasonmalinowski
Copy link
Member

I do wonder what the expensive part is in that dance -- is the intermediate parse or the normalization? And if it's the normalization is there some low hanging fruit we can fix there?

@chsienki: as it stands @YairHalberstadt is having to do two text reparses: the initial parse to get a tree to pass to NormalizeWhitespace, and then there's a second parse of the tree he had to convert back to. If the "add text" API had a "please format it" option, then we could do a parse, and do NormalizeWhitespace, and presume internally that NormalizeWhitespace doesn't a valid tree into an invalid one. (If it does, that's our own bug.) Even if we can't reduce the cost of the normalization (or the need to do it entirely), at least we can reduce a parse here.

@jaredpar: the tricky bit here is debugging: if the compiled version is the unformatted code, then the spans embedded in the PDB won't be right.

@AArnott also reported this internally a few weeks ago.

@sharwell
Copy link
Member

sharwell commented Dec 1, 2020

Do we have a feedback ticket with a performance trace to make sure NormalizeWhitespace isn't doing more work than necessary?

@YairHalberstadt
Copy link
Contributor Author

It's almost all NormalizeWhitespace, although the other parts aren't cheap either.

Also doing the double parse creates a lot of garbage.

@jasonmalinowski
Copy link
Member

@sharwell: @AArnott did share an ETL internally. I can fully imagine NormalizeWhitespace was never carefully optimized.

@YairHalberstadt
Copy link
Contributor Author

If this is only formatted on demand by VS, it would also allow us to use the simplify APIs.

Currently I generate fully qualified code for everything (including global::), to remove any chance of a conflict. However users would much rather see the file with imports added and names simplified. That API is both expensive, and only available in the IDE layer, so not possible to do in a source generator.

@CyrusNajmabadi
Copy link
Member

I'd rather we just fixup NormalizeWhitespace to be fast. I don't see why it would need to be slow.

@CyrusNajmabadi
Copy link
Member

Do we have a feedback ticket with a performance trace to make sure NormalizeWhitespace isn't doing more work than necessary?

My understanding of NW (from conversations over hte years) is taht it's completely unoptimized. So every token is visited and every token is rewritten no matter what. Though, in yair's case, that just might what needs to happen as he's just emitting one long string.

@sharwell
Copy link
Member

sharwell commented Dec 2, 2020

If someone submits a feedback ticket with a performance trace I'll be all over it 😄

@sharwell
Copy link
Member

sharwell commented Dec 2, 2020

Jason sent me a trace. First thing I notice is #48568 will help this. I'll send a PR shortly with additional improvements.

@sharwell
Copy link
Member

sharwell commented Dec 2, 2020

I submitted #49758 to implement a targeted improvement, and filed #49759 to address one of the more significant sources of performance overhead later. With both of those implemented, the next issue is finding a way to avoid calling GetNextToken so many times during the normalization process.

@stefanloerwald
Copy link

As much as I love performance improvements to NormalizeWhitespace: shouldn't we avoid computations in SGs as much as possible to keep their performance high? I would expect it to be a tiny fraction of generated source files that will ever be seen by a human, so operations that only do cosmetic changes to the generated code seem badly placed when they're happening essentially on every keystroke rather than just when the user actually wants to see the code.

@AArnott
Copy link
Contributor

AArnott commented Dec 3, 2020

@stefanloerwald We should normalize anyway. Imagine the debugging experience and exception stack traces. When an exception is thrown and shows a line number from a generated source file, do you want it to always say it's on "line 1"? If I were the one investigating a failure, I would certainly want a meaningful line number shown there.
Beyond stack traces, the pdb will only match against your unformatted source file. So if the IDE were to format it just as the document opened, the debugger would reject the file. While that might be a solvable problem, it would involve several teams and be quite expensive. If we can make NormalizeWhitespace decently fast, it's a better use of resources and ends up with VS shipping more meaningful features.

@stefanloerwald
Copy link

Good points, @AArnott, thanks for elaborating. Although of course not everything has to be in just one line, even if you don't normalize whitespace ;-)

@YairHalberstadt
Copy link
Contributor Author

Although of course not everything has to be in just one line, even if you don't normalize whitespace ;-)

Yep. The complexity of indenting my generated code is much higher than that of adding new lines.

@sharwell
Copy link
Member

sharwell commented Dec 3, 2020

The overhead can be nearly eliminated by making NormalizeWhitespace operate one token "ahead" of where it does today (i.e. instead of calling GetNextToken it just reads a field storing the previous token), in combination with #49759. I'm not a fan of different content for different build scenarios because it breaks debugging and all editor features. If we were to optimize a fast-path here, it would be the creation of a skeleton document (member signatures present, but the implementations all throw null) so IntelliSense understands the available members quickly.

@YairHalberstadt
Copy link
Contributor Author

@sharwell

The overhead can be nearly eliminated by making NormalizeWhitespace operate one token "ahead" of where it does today (i.e. instead of calling GetNextToken it just reads a field storing the previous token), in combination with #49759.

I'm currently giving this a go.

Currently NormalizeWhitespace uses GetNextToken() to calculate trailing trivia. Since CSharpSyntaxRewriter modifies the same token as it is currently visiting, this will require one of the following:

  1. NormalizeWhitespace moves such trailingTrivia to LeadingTrivia of the next token, with some special casing for the final token.
  2. We separately use InternalSyntax.GreenNode.EnumerateNodes to calculate the NextToken, without having to create a Red tree.
  3. Something else? I'm not sure what exactly - do you have any ideas?

@sharwell
Copy link
Member

@YairHalberstadt I'm not sure what the final form here would look like. I don't think either of the proposed changes will be particularly easy.

@Eli-Black-Work
Copy link

I'd like to follow up on this, since it's been a year 🙂

We have the same use case as @YairHalberstadt: We've written a source generator that outputs C# via string interpolation (e.g. $@"string {varName} = ""{stringValue}""), and we'd like to format the source before writing it.

@InflexCZE
Copy link

Same use case for us. Any improvements there would be greatly appreciated.
Regarding the possible optimizations of NormalizeWhitespace, there are few hot candidates:

GC -> 30%
SyntaxNormalizer.GetDeclarationDepth -> 10%
GreenNode.CreateList -> 5%
ChildSyntaxList.ItemInternal -> 3%

dotTrace sampling profile follows:
image
image
image

@CyrusNajmabadi
Copy link
Member

This is currently on the backlog. Recommendation would be to write out the new added source using just a string-builder, or invest in PRs that can improve things here. SyntaxNormalizer is very heavyweight though and, imo, not suitable for use in generators.

@Eli-Black-Work
Copy link

@AArnott Do you think it'd be worth filing a separate issue with your suggestion from #49685 (comment) (code generated by source generators should automatically be normalized)?

I fear that excellent suggestion will be lost among the other comments in this issue 🙂

@CyrusNajmabadi
Copy link
Member

automatic normalization absolutely cannot happen. It breaks semantics. Normalization doesn't preserve the original meaning of code, and thus must absolutely only happen in an opt-in fashion.

@Eli-Black-Work
Copy link

@CyrusNajmabadi Example? I'm having a hard time visualizing.

@CyrusNajmabadi
Copy link
Member

Foo();

//...
void Foo([CallerLineNumber] int i = 0);

Move 'Foo' and you change the meaning of the code and what is emitted. There are tools today that use these sorts of locations to make decisions. Changing the code around can change that and violate expectations they have for how code will be laid out (e.g. putting multiple things on the same line that may not have been before).

SGs are not a quick-and-dirty system that tries to infer what the user wants. It's a plain API: give us the exact source you want us to compile and we will treat that exactly as is.

If you want normalization, then just normalize the code yourself. Then you are stated explicitly that that is fine for your domain. But it leaves the compiler out of the job of having to say that it will both manipulate your code, but also completely preserve its meaning.

@AArnott
Copy link
Contributor

AArnott commented Nov 17, 2022

I think normalization will always be a perf hit and should therefore always be opt in. But I am curious when whitespace normalization would ever change the semantics of the code. @CyrusNajmabadi in your example, where would normalization move Foo such that semantics would change? Certainly rearranging code relative to each other can cause a problem, but how does adding/removing whitespace break it?
I'm excluding the case where whitespace is mandatory in order to avoid build breaks, since emitting a syntax tree with no trivia at all tends to not compile, yet adding whitespace tends to be what the developer intended.

@CyrusNajmabadi
Copy link
Member

I consider a "break" to be any scenario where the compiler emits different IL or metadata. In that event what is in the final dll is not the same as what would be there if no normalization had happened.

The above example demonstrates one case where such a break would occur. Location of code is reflected in the IL we emit, and I know if at least two cases where SG authors use that information. E.g. they use the calling file/line info to look up data in a map that the SG has prepopulated. So if the code moves, that will no longer work.

@AArnott
Copy link
Contributor

AArnott commented Nov 17, 2022

File/line info is in the PDB though. That's not IL or metadata. Are you only talking about moving source code invalidating where the SG expected to find the code references in the PDB? Or are you saying that adding whitespace changes IL or metadata in the dll itself?

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Nov 17, 2022

File/line info is in the PDB thoug

No. It's also in IL. See my example from above. This is a technique already being used.

Or are you saying that adding whitespace changes IL or metadata in the dll itself?

Yup. That's what I'm saying.

@Eli-Black-Work
Copy link

@CyrusNajmabadi Would you mind expounding on those two cases where SG authors use file/line info to look up data in a map that the SG has prepopulated? 🙂 I'm having a bit of trouble following.

@Eli-Black-Work
Copy link

Eli-Black-Work commented Feb 6, 2023

Foo();

//...
void Foo([CallerLineNumber] int i = 0);

Move 'Foo' and you change the meaning of the code and what is emitted. ...

@CyrusNajmabadi I just gave this another read. Now I understand your argument about how automatically formatting SG output could change the generated IL 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests