Observing FEC compiled code is 25 percent slower in some cases #211

dadhi · 2020-01-16T05:48:33Z

Found in #205

dadhi · 2020-01-16T05:59:16Z

Notes regarding FEC:

Previous FEC version (V2) was using object-based closure for the small number of the closed constants - the new version (V3) is using array-based closure for everything to simplify implementation. AFAIK, for manually composed expression Linq version uses array-based closure and for the hoisted expression it uses object-based closure. So we should be on par with manually composed expression, right?
FEC V3 pre-loads closure constants into variables and in small FEC benchmarks it speeds up things and improves memory usage. Not sure about the Linq strategy.
FEC V3 emits nested lambdas differently in many aspects and FEC benchmarks display a big performance difference, FEC may produce 100X faster delegate. Assuming that 4.1.0 Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object. #205 intensively uses the nested lambdas - the final result is controversial at least.

Side-notes regarding compilation - still may be relevant at the end:

In 4.1.0 Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object. #205 FEC shows 4x faster compilation results - this is expected.
It seems that Linq version probes the Stack capacity and offloads the part of compilation to the different thread if Stack is not enough.
FEC compiles the same nested lambda expression only once, therefore it tracks the compiled lambdas on the root expression level.
FEC creates its own closure instance for the nested lambda with only members that required for this lambda. An alternative approach would be to create a single closure with all members for all nested lambdas.

Stats regarding expressions used in #205:

TBD...

Havunen · 2020-01-17T06:59:53Z

I tested changing constant declaration threshold to 10. So that no variables are created for those expression which has less than 10 constants in FEC. The gains were in the error margin of measurement.

dadhi · 2020-01-17T07:12:51Z

@Havunen
Thanks for putting it out of the way.
I've run a smaller benchmark from #44 and for both Scoped and InWebRequest reuse and found that FEC is 10x faster than without FEC, those are ~40 dependencies on 4 levels.

So maybe we are hitting some code size or stack exostiveness limits. The question is how those things are addressed by Linq.

dadhi · 2020-01-17T09:09:57Z

Found this code: https://github.com/microsoft/referencesource/blob/3b1eaf5203992df69de44c783a3eda37d3d4cd10/System.Core/Microsoft/Scripting/Compiler/LambdaCompiler.cs#L258

Effectively they reusing the vars if possible.

In addition, there is some stack analysis is going on https://github.com/microsoft/referencesource/blob/3b1eaf5203992df69de44c783a3eda37d3d4cd10/System.Core/Microsoft/Scripting/Compiler/LambdaCompiler.cs#L237

dadhi · 2020-01-18T19:37:07Z

I've run a smaller benchmark from #44 and for both Scoped and InWebRequest reuse and found that FEC is 10x faster than without FEC, those are ~40 dependencies on 4 levels.

The results:

|            Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|              MsDI |  3.854 us | 0.0309 us | 0.0258 us |  1.00 |    0.00 | 0.9460 |     - |     - |   4.37 KB |
|            DryIoc |  1.699 us | 0.0030 us | 0.0028 us |  0.44 |    0.00 | 0.6409 |     - |     - |   2.96 KB |
| DryIoc_WithoutFEC | 16.344 us | 0.1252 us | 0.1045 us |  4.24 |    0.05 | 1.0681 |     - |     - |   4.93 KB |
                
|                             Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------------------------- |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|                               MsDI |  3.852 us | 0.0255 us | 0.0239 us |  1.00 |    0.00 | 0.9460 |     - |     - |   4.37 KB |
|            DryIoc_WebRequestScoped |  2.128 us | 0.0113 us | 0.0106 us |  0.55 |    0.00 | 0.6866 |     - |     - |   3.17 KB |
| DryIoc_WebRequestScoped_WithoutFEC | 17.713 us | 0.1377 us | 0.1288 us |  4.60 |    0.04 | 1.0986 |     - |     - |   5.14 KB |

dadhi · 2020-01-18T20:23:53Z

I've tried other variations with closure variables and everything become more slower than before.
So given the results above this is a very interesting case.. no ideas now.

dadhi · 2020-01-18T20:51:02Z

Ok, more on variables.

At the moment FEC uses all or nothing policy when deciding what to do with closure elements. But more ideal way would be to store element in local variable only when it used 2+ times. If element is used once then it is fine to load it from the closure. This will safe us more instructions and more local variables, right?

When examining the expression we may count of closure element usages - the counters may reside in the separate array (or growing list) to constants. Then we may use the same array to store the variable indexes, given that the variable with zero index is always contain a closure array, the index > 1 will point to the corresponding variable.

Likely we can reuse the same index array between many nested lambdas compilations.

dadhi · 2020-01-20T09:28:13Z

I have added the improvement with the variables but it did not change the results :/

dadhi · 2020-01-20T15:07:31Z

I am planning to release v4.1 as-is - you may prefer to use WithoutFastExpressionCompiler. Then will try to tackle this case in the next version.

Havunen · 2020-04-02T19:50:28Z

Hey

This issue might have something to do with multi threading

I noticed that when I run LoadTest using (without FEC, without interpretation)

            var container = new Container(rules => rules
                .WithoutFastExpressionCompiler()
                .WithoutUseInterpretation()
                .With(FactoryMethod.ConstructorWithResolvableArguments))
                .WithWebApi(config);

Single Thread:
ResolveAllControllersOnce of 156 controllers is done in 0,2131989 seconds
ResolveAllControllersOnce of 156 controllers is done in 17,9537902 seconds

But when I run load test using (using FEC, without interpretation)

ResolveAllControllersOnce of 156 controllers is done in 0,2247802 seconds
ResolveAllControllersOnce of 156 controllers is done in 5,0540016 seconds

            var container = new Container(rules => rules
                .WithoutUseInterpretation()
                .With(FactoryMethod.ConstructorWithResolvableArguments))
                .WithWebApi(config);

So it seems to me that FEC is 3X faster when its operating in single thread, but under heavy parallel load its performing worse than withoutFastExpressionCompiler option

edit: This measurement needs to be verified, I might have accidentally included delegate compilation time...

Havunen · 2020-04-02T20:05:15Z

Another thing which I have tried to investigate could it be possible that the new version of FEC is hitting some dynamicMethod boundary call security check? If I understood correctly the previous version of FEC used to compile one big Expression, and now the new one has multiple small ones?

Or maybe those compiled expressions ( dynamic methods ) calling each other are not optimized by JIT ?

dadhi · 2020-04-03T06:44:57Z

So it seems to me that FEC is 3X faster when its operating in single thread, but under heavy parallel load its performing worse than withoutFastExpressionCompiler option

This may be an important observation. Hope it provides some clue.

If I understood correctly the previous version of FEC used to compile one big Expression, and now the new one has multiple small ones?

No, FEC always compiled the nested lambdas separately, because I don't know other suitable ways to create dynamic delegates at the moment.

Older version constructed a closure at runtime time. Newer version constructs only when required otherwise it is compiled time.
Older FEC did not track the same nested lambda exressions and may compile them twice, etc., new one compiles only once.
Plus new one adds other optimizations for closure representation, loading constants to vars (when it makes sense) - but you know this, right?

dadhi · 2020-04-03T06:49:21Z

Or maybe those compiled expressions ( dynamic methods ) calling each other are not optimized by JIT ?

Yeah, this is suspicious...

You are not using Tiered Compilation in Jit, right? I think this was started from the .Net Core 3 - so asking just in case.

Otherwise, we need to confirm this hypothesis somehow, will try to google it.

dadhi · 2020-04-03T07:07:13Z

Otherwise, we need to confirm this hypothesis somehow, will try to google it.

We may try this System.Runtime.CompilerServices.RuntimeHelpers.PrepareMethod(method.MethodHandle);

I remember trying it a long time ago without visible changes but maybe now it will make a difference.

dadhi · 2020-04-03T07:21:45Z

The interesting thing with the nested lambdas here for DryIoc scoped services, that they are not needed. They just help to refactor the lazy evaluation in a concise method call.

Consider how it is done NOW in pseudocode:

r => new A(
	b: r.CurrentScope.GetOrAdd(42, rr => new B(
		c: rr.CurrentScope.GetOrAdd(43, _ => new C())
	)    
);

But what it does actually in pseudocode:

r => new A(
	b: r.CurrentScope.TryGet(42) ?? { 
		lock(entryForB) { 
			var b = new B(c: r.CurrentScope.TryGet(43) ?? {
				lock(entryForC) {
					var c = new C();
					r.CurrentScope.Add(43, c);
					return c;
				}
			}
			r.CurrentScope.Add(42, b);
			return b; 
		} 
	}
)

I was trying to represent it that way for v4.1 but stopped in between because of rising complexity and lack of the time. In addition, the overall expression size grows but maybe it can be tackled by other means, or maybe it is a lesser problem than the nested lambda...

dadhi · 2020-04-03T11:03:40Z

System.Runtime.CompilerServices.RuntimeHelpers.PrepareMethod(method.MethodHandle);

This is throwing the exception that "... operation is invalid for DynamicMethod" :(

dadhi · 2020-04-03T11:08:44Z

Some more info about DynamicMethod handler https://www.codeproject.com/Articles/37549/CLR-Injection-Runtime-Method-Replacer

From the article, I got the impression the DynamicMethod comes pre-jitted. hmmm

dadhi · 2020-04-10T10:02:10Z

Hi @Havunen,

What-if

Instead of passing the nested lambda into the Scope methods, I will pass the expression itself.

So instead of:

r => new A(
	b: r.CurrentScope.GetOrAdd(42, rr => new B(
		c: rr.CurrentScope.GetOrAdd(43, _ => new C())
	),
    c: r.CurrentScope.GetOrAdd(43, _ => new C())
);

It will become:

var cRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
    r => new C())));
var bRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
    r => new B(c: rr.CurrentScope.GetOrAdd(43, cRef))))));

r => new A(
	b: r.CurrentScope.GetOrAdd(42, bRef),
    c: r.CurrentScope.GetOrAdd(43, cRef)
);

Then inside the GetOrAdd we will have:

//...
lock (entry) 
{
    if (facRef.Value is Expression<FactoryDelegate> expr) 
        facRef.Swap(x => x is Expression<FactoryDelegate> e ? e.CompileFast() : x);        
    
    entry.Value = ((FactoryDelegate)facRef.Value).Invoke(resolver);
}
//...

More details:

Usage of the Ref will ensure the caching of compiled delegate, and as long as we're using the same shared Ref Constant (via Expression cache) - we will ensure compilation to be done once.
We are ignoring the case with Func<TArgs..., TService> expression where the nested lambdas should use the closed TArgs from the upper lambda. But we can specialize for this.
Interpretation mode won't change - should just learn to extract the expression from the Constant(Ref.Of(expr)). Funny though, that in some cases it may be compiled already - so it should take advantage of it.

We should test to get the idea of improvement or regression.

dadhi · 2020-05-06T05:08:06Z

Some benchmark results in a branch

DryIoc/playground/Playground/RealisticUnitOfWorkBenchmark.cs

Line 1057 in 1b4f3ac

## Branch 258

dadhi · 2020-05-06T05:13:49Z

We are ignoring the case with Func<TArgs..., TService> expression where the nested lambdas should use the closed TArgs from the upper lambda. But we can specialize for this.

We can use an interpretation approach:
Pass the parameter values (Request.InputArgs) array plus the matching ParameterExpression array to lookup the value from the lambda expression.

Then create the lambda expression with added parameters, compile it and Invoke with passed values.

OR given the FEC support for providing non-passed parameters directly with closure, pass them and the collected constants when compiling ..WithPreCreatedClosure().

dadhi · 2020-05-10T20:16:49Z

There is an important thing about using Ref expression - the compiled expression will be stored in Ref constant and therefore in expression cache per container. So resolving another service with the cached sub-graph expression will allow to the compiled delegate even in the interpretation phase! lt need to be tested though.

dadhi · 2020-05-23T10:11:53Z

The other important thing that we need to consider the code generation scanario. In this case hiding the expression in the Ref is not suitable for generation. So we need the condional logic and a different generated expression output, which adds to Complexety.

So I would stop with this approach for now and look into other alternatives, like sharing the closure between nested lambdas and avoiding constant collection for closure when compiling.

Havunen · 2020-05-23T13:28:17Z

I think in the original code compilation time was excluded from the measurements. Timing started after each root was resolved twice so it should not include compilation right?

dadhi · 2020-05-23T16:20:09Z

@Havunen Soory, what measurements are you referring to?

Havunen · 2020-05-25T06:54:17Z

Invocation of compiled DynamicMethod delegate was/is ~25% slower compared to Linq compiled method

dadhi · 2020-05-25T07:06:31Z

I think in the original code compilation time was excluded from the measurements. Timing started after each root was resolved twice so it should not include compilation right?

Got it now. Yes, the 3rd time should just invoke the cached delegate compiled on the 2nd resolve.

Yes, it was and you say it still the case now, right?

What about single-threaded vs multi-threaded case, do they differ?

Havunen · 2020-05-25T07:52:46Z

Yes, it was and you say it still the case now, right?

I have not actually tested this since 4.2.0, need to verify it again.

What about single-threaded vs multi-threaded case, do they differ?

Not sure, need to re-validate if this is the case

dadhi · 2021-04-28T06:14:03Z

Closing this thread. Let's start the new one if the issue arises.

dadhi added bug Something isn't working help wanted Extra attention is needed labels Jan 16, 2020

dadhi mentioned this issue Jan 17, 2020

Reuse the local variables where possible dadhi/FastExpressionCompiler#241

Open

dadhi mentioned this issue Apr 3, 2020

Try pre-Jit nested lambdas dadhi/FastExpressionCompiler#247

Closed

dadhi mentioned this issue Apr 15, 2020

Improve the case with splitting the big object graph #265

Closed

dadhi mentioned this issue May 5, 2020

Support compilation without closure collection phase for expression with nested lambdas dadhi/FastExpressionCompiler#250

Open

dadhi added investigation and removed bug Something isn't working labels Oct 28, 2020

dadhi closed this as completed Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observing FEC compiled code is 25 percent slower in some cases #211

Observing FEC compiled code is 25 percent slower in some cases #211

dadhi commented Jan 16, 2020

dadhi commented Jan 16, 2020 •

edited

Loading

Havunen commented Jan 17, 2020 •

edited

Loading

dadhi commented Jan 17, 2020

dadhi commented Jan 17, 2020

dadhi commented Jan 18, 2020

dadhi commented Jan 18, 2020

dadhi commented Jan 18, 2020 •

edited

Loading

dadhi commented Jan 20, 2020 •

edited

Loading

dadhi commented Jan 20, 2020

Havunen commented Apr 2, 2020 •

edited

Loading

Havunen commented Apr 2, 2020 •

edited

Loading

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 10, 2020 •

edited

Loading

dadhi commented May 6, 2020

dadhi commented May 6, 2020 •

edited

Loading

dadhi commented May 10, 2020

dadhi commented May 23, 2020

Havunen commented May 23, 2020

dadhi commented May 23, 2020

Havunen commented May 25, 2020

dadhi commented May 25, 2020 •

edited

Loading

Havunen commented May 25, 2020

dadhi commented Apr 28, 2021

Observing FEC compiled code is 25 percent slower in some cases #211

Observing FEC compiled code is 25 percent slower in some cases #211

Comments

dadhi commented Jan 16, 2020

dadhi commented Jan 16, 2020 • edited Loading

Havunen commented Jan 17, 2020 • edited Loading

dadhi commented Jan 17, 2020

dadhi commented Jan 17, 2020

dadhi commented Jan 18, 2020

dadhi commented Jan 18, 2020

dadhi commented Jan 18, 2020 • edited Loading

dadhi commented Jan 20, 2020 • edited Loading

dadhi commented Jan 20, 2020

Havunen commented Apr 2, 2020 • edited Loading

Havunen commented Apr 2, 2020 • edited Loading

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 3, 2020

dadhi commented Apr 10, 2020 • edited Loading

dadhi commented May 6, 2020

dadhi commented May 6, 2020 • edited Loading

dadhi commented May 10, 2020

dadhi commented May 23, 2020

Havunen commented May 23, 2020

dadhi commented May 23, 2020

Havunen commented May 25, 2020

dadhi commented May 25, 2020 • edited Loading

Havunen commented May 25, 2020

dadhi commented Apr 28, 2021

dadhi commented Jan 16, 2020 •

edited

Loading

Havunen commented Jan 17, 2020 •

edited

Loading

dadhi commented Jan 18, 2020 •

edited

Loading

dadhi commented Jan 20, 2020 •

edited

Loading

Havunen commented Apr 2, 2020 •

edited

Loading

Havunen commented Apr 2, 2020 •

edited

Loading

dadhi commented Apr 10, 2020 •

edited

Loading

dadhi commented May 6, 2020 •

edited

Loading

dadhi commented May 25, 2020 •

edited

Loading