Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing FEC compiled code is 25 percent slower in some cases #211

Closed
dadhi opened this issue Jan 16, 2020 · 28 comments
Closed

Observing FEC compiled code is 25 percent slower in some cases #211

dadhi opened this issue Jan 16, 2020 · 28 comments
Labels
help wanted Extra attention is needed investigation

Comments

@dadhi
Copy link
Owner

dadhi commented Jan 16, 2020

Found in #205

@dadhi dadhi added bug Something isn't working help wanted Extra attention is needed labels Jan 16, 2020
@dadhi
Copy link
Owner Author

dadhi commented Jan 16, 2020

Notes regarding FEC:

  1. Previous FEC version (V2) was using object-based closure for the small number of the closed constants - the new version (V3) is using array-based closure for everything to simplify implementation. AFAIK, for manually composed expression Linq version uses array-based closure and for the hoisted expression it uses object-based closure. So we should be on par with manually composed expression, right?
  2. FEC V3 pre-loads closure constants into variables and in small FEC benchmarks it speeds up things and improves memory usage. Not sure about the Linq strategy.
  3. FEC V3 emits nested lambdas differently in many aspects and FEC benchmarks display a big performance difference, FEC may produce 100X faster delegate. Assuming that 4.1.0 Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object. #205 intensively uses the nested lambdas - the final result is controversial at least.

Side-notes regarding compilation - still may be relevant at the end:

  • In 4.1.0 Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object. #205 FEC shows 4x faster compilation results - this is expected.
  • It seems that Linq version probes the Stack capacity and offloads the part of compilation to the different thread if Stack is not enough.
  • FEC compiles the same nested lambda expression only once, therefore it tracks the compiled lambdas on the root expression level.
  • FEC creates its own closure instance for the nested lambda with only members that required for this lambda. An alternative approach would be to create a single closure with all members for all nested lambdas.

Stats regarding expressions used in #205:

  • TBD...

@Havunen
Copy link
Contributor

Havunen commented Jan 17, 2020

I tested changing constant declaration threshold to 10. So that no variables are created for those expression which has less than 10 constants in FEC. The gains were in the error margin of measurement.

@dadhi
Copy link
Owner Author

dadhi commented Jan 17, 2020

@Havunen
Thanks for putting it out of the way.
I've run a smaller benchmark from #44 and for both Scoped and InWebRequest reuse and found that FEC is 10x faster than without FEC, those are ~40 dependencies on 4 levels.

So maybe we are hitting some code size or stack exostiveness limits. The question is how those things are addressed by Linq.

@dadhi
Copy link
Owner Author

dadhi commented Jan 18, 2020

I've run a smaller benchmark from #44 and for both Scoped and InWebRequest reuse and found that FEC is 10x faster than without FEC, those are ~40 dependencies on 4 levels.

The results:

|            Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------ |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|              MsDI |  3.854 us | 0.0309 us | 0.0258 us |  1.00 |    0.00 | 0.9460 |     - |     - |   4.37 KB |
|            DryIoc |  1.699 us | 0.0030 us | 0.0028 us |  0.44 |    0.00 | 0.6409 |     - |     - |   2.96 KB |
| DryIoc_WithoutFEC | 16.344 us | 0.1252 us | 0.1045 us |  4.24 |    0.05 | 1.0681 |     - |     - |   4.93 KB |
                
|                             Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------------------------- |----------:|----------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|                               MsDI |  3.852 us | 0.0255 us | 0.0239 us |  1.00 |    0.00 | 0.9460 |     - |     - |   4.37 KB |
|            DryIoc_WebRequestScoped |  2.128 us | 0.0113 us | 0.0106 us |  0.55 |    0.00 | 0.6866 |     - |     - |   3.17 KB |
| DryIoc_WebRequestScoped_WithoutFEC | 17.713 us | 0.1377 us | 0.1288 us |  4.60 |    0.04 | 1.0986 |     - |     - |   5.14 KB |

@dadhi
Copy link
Owner Author

dadhi commented Jan 18, 2020

I've tried other variations with closure variables and everything become more slower than before.
So given the results above this is a very interesting case.. no ideas now.

@dadhi
Copy link
Owner Author

dadhi commented Jan 18, 2020

Ok, more on variables.

At the moment FEC uses all or nothing policy when deciding what to do with closure elements. But more ideal way would be to store element in local variable only when it used 2+ times. If element is used once then it is fine to load it from the closure. This will safe us more instructions and more local variables, right?

When examining the expression we may count of closure element usages - the counters may reside in the separate array (or growing list) to constants. Then we may use the same array to store the variable indexes, given that the variable with zero index is always contain a closure array, the index > 1 will point to the corresponding variable.

Likely we can reuse the same index array between many nested lambdas compilations.

@dadhi
Copy link
Owner Author

dadhi commented Jan 20, 2020

I have added the improvement with the variables but it did not change the results :/

@dadhi
Copy link
Owner Author

dadhi commented Jan 20, 2020

I am planning to release v4.1 as-is - you may prefer to use WithoutFastExpressionCompiler. Then will try to tackle this case in the next version.

@Havunen
Copy link
Contributor

Havunen commented Apr 2, 2020

Hey

This issue might have something to do with multi threading

I noticed that when I run LoadTest using (without FEC, without interpretation)

            var container = new Container(rules => rules
                .WithoutFastExpressionCompiler()
                .WithoutUseInterpretation()
                .With(FactoryMethod.ConstructorWithResolvableArguments))
                .WithWebApi(config);

Single Thread:
ResolveAllControllersOnce of 156 controllers is done in 0,2131989 seconds
ResolveAllControllersOnce of 156 controllers is done in 17,9537902 seconds

But when I run load test using (using FEC, without interpretation)

ResolveAllControllersOnce of 156 controllers is done in 0,2247802 seconds
ResolveAllControllersOnce of 156 controllers is done in 5,0540016 seconds

            var container = new Container(rules => rules
                .WithoutUseInterpretation()
                .With(FactoryMethod.ConstructorWithResolvableArguments))
                .WithWebApi(config);

So it seems to me that FEC is 3X faster when its operating in single thread, but under heavy parallel load its performing worse than withoutFastExpressionCompiler option

edit: This measurement needs to be verified, I might have accidentally included delegate compilation time...

@Havunen
Copy link
Contributor

Havunen commented Apr 2, 2020

Another thing which I have tried to investigate could it be possible that the new version of FEC is hitting some dynamicMethod boundary call security check? If I understood correctly the previous version of FEC used to compile one big Expression, and now the new one has multiple small ones?

Or maybe those compiled expressions ( dynamic methods ) calling each other are not optimized by JIT ?

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

So it seems to me that FEC is 3X faster when its operating in single thread, but under heavy parallel load its performing worse than withoutFastExpressionCompiler option

This may be an important observation. Hope it provides some clue.

If I understood correctly the previous version of FEC used to compile one big Expression, and now the new one has multiple small ones?

No, FEC always compiled the nested lambdas separately, because I don't know other suitable ways to create dynamic delegates at the moment.

  • Older version constructed a closure at runtime time. Newer version constructs only when required otherwise it is compiled time.
  • Older FEC did not track the same nested lambda exressions and may compile them twice, etc., new one compiles only once.
  • Plus new one adds other optimizations for closure representation, loading constants to vars (when it makes sense) - but you know this, right?

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

Or maybe those compiled expressions ( dynamic methods ) calling each other are not optimized by JIT ?

Yeah, this is suspicious...

You are not using Tiered Compilation in Jit, right? I think this was started from the .Net Core 3 - so asking just in case.

Otherwise, we need to confirm this hypothesis somehow, will try to google it.

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

Otherwise, we need to confirm this hypothesis somehow, will try to google it.

We may try this System.Runtime.CompilerServices.RuntimeHelpers.PrepareMethod(method.MethodHandle);

I remember trying it a long time ago without visible changes but maybe now it will make a difference.

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

The interesting thing with the nested lambdas here for DryIoc scoped services, that they are not needed. They just help to refactor the lazy evaluation in a concise method call.

Consider how it is done NOW in pseudocode:

r => new A(
	b: r.CurrentScope.GetOrAdd(42, rr => new B(
		c: rr.CurrentScope.GetOrAdd(43, _ => new C())
	)    
);

But what it does actually in pseudocode:

r => new A(
	b: r.CurrentScope.TryGet(42) ?? { 
		lock(entryForB) { 
			var b = new B(c: r.CurrentScope.TryGet(43) ?? {
				lock(entryForC) {
					var c = new C();
					r.CurrentScope.Add(43, c);
					return c;
				}
			}
			r.CurrentScope.Add(42, b);
			return b; 
		} 
	}
)

I was trying to represent it that way for v4.1 but stopped in between because of rising complexity and lack of the time. In addition, the overall expression size grows but maybe it can be tackled by other means, or maybe it is a lesser problem than the nested lambda...

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

System.Runtime.CompilerServices.RuntimeHelpers.PrepareMethod(method.MethodHandle);

This is throwing the exception that "... operation is invalid for DynamicMethod" :(

@dadhi
Copy link
Owner Author

dadhi commented Apr 3, 2020

Some more info about DynamicMethod handler https://www.codeproject.com/Articles/37549/CLR-Injection-Runtime-Method-Replacer

From the article, I got the impression the DynamicMethod comes pre-jitted. hmmm

@dadhi
Copy link
Owner Author

dadhi commented Apr 10, 2020

Hi @Havunen,

What-if

Instead of passing the nested lambda into the Scope methods, I will pass the expression itself.

So instead of:

r => new A(
	b: r.CurrentScope.GetOrAdd(42, rr => new B(
		c: rr.CurrentScope.GetOrAdd(43, _ => new C())
	),
    c: r.CurrentScope.GetOrAdd(43, _ => new C())
);

It will become:

var cRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
    r => new C())));
var bRef = Contant(Ref.Of<object>((Expression<FactoryDelegate>)(
    r => new B(c: rr.CurrentScope.GetOrAdd(43, cRef))))));

r => new A(
	b: r.CurrentScope.GetOrAdd(42, bRef),
    c: r.CurrentScope.GetOrAdd(43, cRef)
);

Then inside the GetOrAdd we will have:

//...
lock (entry) 
{
    if (facRef.Value is Expression<FactoryDelegate> expr) 
        facRef.Swap(x => x is Expression<FactoryDelegate> e ? e.CompileFast() : x);        
    
    entry.Value = ((FactoryDelegate)facRef.Value).Invoke(resolver);
}
//...

More details:

  • Usage of the Ref will ensure the caching of compiled delegate, and as long as we're using the same shared Ref Constant (via Expression cache) - we will ensure compilation to be done once.
  • We are ignoring the case with Func<TArgs..., TService> expression where the nested lambdas should use the closed TArgs from the upper lambda. But we can specialize for this.
  • Interpretation mode won't change - should just learn to extract the expression from the Constant(Ref.Of(expr)). Funny though, that in some cases it may be compiled already - so it should take advantage of it.

We should test to get the idea of improvement or regression.

@dadhi
Copy link
Owner Author

dadhi commented May 6, 2020

Some benchmark results in a branch

@dadhi
Copy link
Owner Author

dadhi commented May 6, 2020

We are ignoring the case with Func<TArgs..., TService> expression where the nested lambdas should use the closed TArgs from the upper lambda. But we can specialize for this.

We can use an interpretation approach:
Pass the parameter values (Request.InputArgs) array plus the matching ParameterExpression array to lookup the value from the lambda expression.

Then create the lambda expression with added parameters, compile it and Invoke with passed values.

OR given the FEC support for providing non-passed parameters directly with closure, pass them and the collected constants when compiling ..WithPreCreatedClosure().

@dadhi
Copy link
Owner Author

dadhi commented May 10, 2020

There is an important thing about using Ref expression - the compiled expression will be stored in Ref constant and therefore in expression cache per container. So resolving another service with the cached sub-graph expression will allow to the compiled delegate even in the interpretation phase! lt need to be tested though.

@dadhi
Copy link
Owner Author

dadhi commented May 23, 2020

The other important thing that we need to consider the code generation scanario. In this case hiding the expression in the Ref is not suitable for generation. So we need the condional logic and a different generated expression output, which adds to Complexety.

So I would stop with this approach for now and look into other alternatives, like sharing the closure between nested lambdas and avoiding constant collection for closure when compiling.

@Havunen
Copy link
Contributor

Havunen commented May 23, 2020

I think in the original code compilation time was excluded from the measurements. Timing started after each root was resolved twice so it should not include compilation right?

@dadhi
Copy link
Owner Author

dadhi commented May 23, 2020

@Havunen Soory, what measurements are you referring to?

@Havunen
Copy link
Contributor

Havunen commented May 25, 2020

Invocation of compiled DynamicMethod delegate was/is ~25% slower compared to Linq compiled method

@dadhi
Copy link
Owner Author

dadhi commented May 25, 2020

I think in the original code compilation time was excluded from the measurements. Timing started after each root was resolved twice so it should not include compilation right?

Got it now. Yes, the 3rd time should just invoke the cached delegate compiled on the 2nd resolve.

Yes, it was and you say it still the case now, right?

What about single-threaded vs multi-threaded case, do they differ?

@Havunen
Copy link
Contributor

Havunen commented May 25, 2020

Yes, it was and you say it still the case now, right?

I have not actually tested this since 4.2.0, need to verify it again.

What about single-threaded vs multi-threaded case, do they differ?

Not sure, need to re-validate if this is the case

@dadhi dadhi added investigation and removed bug Something isn't working labels Oct 28, 2020
@dadhi
Copy link
Owner Author

dadhi commented Apr 28, 2021

Closing this thread. Let's start the new one if the issue arises.

@dadhi dadhi closed this as completed Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed investigation
Projects
None yet
Development

No branches or pull requests

2 participants