-
-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observing FEC compiled code is 25 percent slower in some cases #211
Comments
Notes regarding FEC:
Side-notes regarding compilation - still may be relevant at the end:
Stats regarding expressions used in #205:
|
I tested changing constant declaration threshold to 10. So that no variables are created for those expression which has less than 10 constants in FEC. The gains were in the error margin of measurement. |
@Havunen So maybe we are hitting some code size or stack exostiveness limits. The question is how those things are addressed by Linq. |
Effectively they reusing the vars if possible. In addition, there is some stack analysis is going on https://github.com/microsoft/referencesource/blob/3b1eaf5203992df69de44c783a3eda37d3d4cd10/System.Core/Microsoft/Scripting/Compiler/LambdaCompiler.cs#L237 |
The results: | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
| MsDI | 3.854 us | 0.0309 us | 0.0258 us | 1.00 | 0.00 | 0.9460 | - | - | 4.37 KB |
| DryIoc | 1.699 us | 0.0030 us | 0.0028 us | 0.44 | 0.00 | 0.6409 | - | - | 2.96 KB |
| DryIoc_WithoutFEC | 16.344 us | 0.1252 us | 0.1045 us | 4.24 | 0.05 | 1.0681 | - | - | 4.93 KB |
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------------------------- |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
| MsDI | 3.852 us | 0.0255 us | 0.0239 us | 1.00 | 0.00 | 0.9460 | - | - | 4.37 KB |
| DryIoc_WebRequestScoped | 2.128 us | 0.0113 us | 0.0106 us | 0.55 | 0.00 | 0.6866 | - | - | 3.17 KB |
| DryIoc_WebRequestScoped_WithoutFEC | 17.713 us | 0.1377 us | 0.1288 us | 4.60 | 0.04 | 1.0986 | - | - | 5.14 KB | |
I've tried other variations with closure variables and everything become more slower than before. |
Ok, more on variables. At the moment FEC uses all or nothing policy when deciding what to do with closure elements. But more ideal way would be to store element in local variable only when it used 2+ times. If element is used once then it is fine to load it from the closure. This will safe us more instructions and more local variables, right? When examining the expression we may count of closure element usages - the counters may reside in the separate array (or growing list) to constants. Then we may use the same array to store the variable indexes, given that the variable with zero index is always contain a closure array, the index > 1 will point to the corresponding variable. Likely we can reuse the same index array between many nested lambdas compilations. |
I have added the improvement with the variables but it did not change the results :/ |
I am planning to release v4.1 as-is - you may prefer to use |
Hey This issue might have something to do with multi threading I noticed that when I run LoadTest using (without FEC, without interpretation)
Single Thread: But when I run load test using (using FEC, without interpretation) ResolveAllControllersOnce of 156 controllers is done in 0,2247802 seconds
So it seems to me that FEC is 3X faster when its operating in single thread, but under heavy parallel load its performing worse than withoutFastExpressionCompiler option edit: This measurement needs to be verified, I might have accidentally included delegate compilation time... |
Another thing which I have tried to investigate could it be possible that the new version of FEC is hitting some dynamicMethod boundary call security check? If I understood correctly the previous version of FEC used to compile one big Expression, and now the new one has multiple small ones? Or maybe those compiled expressions ( dynamic methods ) calling each other are not optimized by JIT ? |
This may be an important observation. Hope it provides some clue.
No, FEC always compiled the nested lambdas separately, because I don't know other suitable ways to create dynamic delegates at the moment.
|
Yeah, this is suspicious... You are not using Tiered Compilation in Jit, right? I think this was started from the .Net Core 3 - so asking just in case. Otherwise, we need to confirm this hypothesis somehow, will try to google it. |
We may try this I remember trying it a long time ago without visible changes but maybe now it will make a difference. |
The interesting thing with the nested lambdas here for DryIoc scoped services, that they are not needed. They just help to refactor the lazy evaluation in a concise method call. Consider how it is done NOW in pseudocode: r => new A(
b: r.CurrentScope.GetOrAdd(42, rr => new B(
c: rr.CurrentScope.GetOrAdd(43, _ => new C())
)
); But what it does actually in pseudocode: r => new A(
b: r.CurrentScope.TryGet(42) ?? {
lock(entryForB) {
var b = new B(c: r.CurrentScope.TryGet(43) ?? {
lock(entryForC) {
var c = new C();
r.CurrentScope.Add(43, c);
return c;
}
}
r.CurrentScope.Add(42, b);
return b;
}
}
) I was trying to represent it that way for v4.1 but stopped in between because of rising complexity and lack of the time. In addition, the overall expression size grows but maybe it can be tackled by other means, or maybe it is a lesser problem than the nested lambda... |
This is throwing the exception that "... operation is invalid for DynamicMethod" :( |
Some more info about From the article, I got the impression the DynamicMethod comes pre-jitted. hmmm |
Hi @Havunen, What-if Instead of passing the nested lambda into the Scope methods, I will pass the expression itself. So instead of: r => new A(
b: r.CurrentScope.GetOrAdd(42, rr => new B(
c: rr.CurrentScope.GetOrAdd(43, _ => new C())
),
c: r.CurrentScope.GetOrAdd(43, _ => new C())
); It will become: var cRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
r => new C())));
var bRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
r => new B(c: rr.CurrentScope.GetOrAdd(43, cRef))))));
r => new A(
b: r.CurrentScope.GetOrAdd(42, bRef),
c: r.CurrentScope.GetOrAdd(43, cRef)
); Then inside the //...
lock (entry)
{
if (facRef.Value is Expression<FactoryDelegate> expr)
facRef.Swap(x => x is Expression<FactoryDelegate> e ? e.CompileFast() : x);
entry.Value = ((FactoryDelegate)facRef.Value).Invoke(resolver);
}
//... More details:
We should test to get the idea of improvement or regression. |
Some benchmark results in a branch
|
We can use an interpretation approach: Then create the lambda expression with added parameters, compile it and Invoke with passed values. OR given the FEC support for providing non-passed parameters directly with closure, pass them and the collected constants when compiling |
There is an important thing about using Ref expression - the compiled expression will be stored in Ref constant and therefore in expression cache per container. So resolving another service with the cached sub-graph expression will allow to the compiled delegate even in the interpretation phase! lt need to be tested though. |
The other important thing that we need to consider the code generation scanario. In this case hiding the expression in the Ref is not suitable for generation. So we need the condional logic and a different generated expression output, which adds to Complexety. So I would stop with this approach for now and look into other alternatives, like sharing the closure between nested lambdas and avoiding constant collection for closure when compiling. |
I think in the original code compilation time was excluded from the measurements. Timing started after each root was resolved twice so it should not include compilation right? |
@Havunen Soory, what measurements are you referring to? |
Invocation of compiled DynamicMethod delegate was/is ~25% slower compared to Linq compiled method |
Got it now. Yes, the 3rd time should just invoke the cached delegate compiled on the 2nd resolve. Yes, it was and you say it still the case now, right? What about single-threaded vs multi-threaded case, do they differ? |
I have not actually tested this since 4.2.0, need to verify it again.
Not sure, need to re-validate if this is the case |
Closing this thread. Let's start the new one if the issue arises. |
Found in #205
The text was updated successfully, but these errors were encountered: