Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed diagnosers flow, reduced heap allocations in Engine to 0 #277

Merged
merged 11 commits into from Oct 15, 2016

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Oct 5, 2016

I wanted to implement a tail call diagnoser with my friends for #273 but we got stucked. We signed up for a jit event but we were never getting it, most probably because the method was already jitted before we even started the ETW session. But I am not sure, it's a guess.

The diagnosers flow was quite complicated and not very flexible. So I have changed it a little.

How it works now (Auto-generated program.cs):

  1. The first thing to do is to notify master process that diagnoser can attach.
  2. The JIT based diagnosers can attach at this moment, and we are sure that no jitting was done.
  3. When Engine is created and prepared, we run Setup once and notify the diagnosers that this can happen. The GC diagnoser can attach at this moment and avoid the problem of including [Setup] allocations GC Diagnoser should not include allocations done by Setup method #186 .
  4. Engine runs, but it does not call Setup and/or Cleanup anymore
  5. We notify diagnosers that we are going to call Cleanup, so for example GC Diagnoser can detach at this moment, and don't include Cleanup allocations.
  6. Call Cleanup

@AndreyAkinshin @mattwarren could you guys take a look at the code? I created a PR to start the discussion.

If you like this solution then I would like to continue the work:

  • reduce the number of heap allocations in Engine to 0 to give better accuracy for Memory Diagnoser.
  • And improve allocated bytes calculation per operation to get better accuracy, for example 100 bytes for following benchmark:
[Benchmark]
public void New() => new byte[100];

@AndreyAkinshin what do you think about setup/cleanup changes I made?

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Oct 6, 2016

@adamsitnik, I like your changes, we are definitely on the right way. Will do a code review on the weekend.

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 7, 2016

I like your changes, we are definitely on the right way. Will do a code review on the weekend.

@AndreyAkinshin cool!

Now I am profiling the Engine to reduce the number of allocations. Currently when I run a benchmark with empty method (does nothing, 0 allocations) with TargetCount = 5000 the results are : 7x GC 0, 2.302,08 Bytes Allocated/Op. Let's see how far I can get ;)

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 7, 2016

Ok, little update. I was able to get 0 GCs for empty method, the overhead went from 2000 bytes to 20 bytes per target invocation. My current problem is that Visual Studio Profiler can not tell me where the 20 bytes are ;) So I will have to try another tool.

@AndreyAkinshin Most of the changes so far were not ugly. The only ugly thing was to pre-allocate List<Measurement> that contains the results.

The breaking change is to not display results during run, but display them all at the end of benchmark run. But I hope it's not a problem.

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Oct 7, 2016

The breaking change is to not display results during run, but display them all at the end of benchmark run. But I hope it's not a problem.

Could we keep the old behavior? It's really nice to look at results one by one during benchmarking. Moreover, it allows to understand what's going on right now and check how long single iteration takes (so, it's possible to detect mistakes in the benchmark design on early stages).

P.S. Probably, we have some allocations because of the Statistics instances. In this case, we could implement incremental logic.

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 7, 2016

I have one idea: there is always a separate run when any diagnoser is attached, so I could print results immediately by default and in the diagnosers run delay the prining.

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Oct 7, 2016

LGTM.

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 7, 2016

Ok, I updated the code.

My To do list:

  • find the missing 18-22 bytes for most simple scenario: no warmup, no pilot, fixed target count
  • analyse and improve for scenarion with warmup and pilot

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 9, 2016

@AndreyAkinshin Ok, I updated the code once again. I am very close to getting it done ;)

I also tried to set InvocationCount to some value, but then realized it was not passed to engine. Is this commit correct fix?

The results are getting more stable and accurate, but 100KB limitation of ETW allocation event is still a problem.

Default Config + MemoryDiagnoser:
image

Keep in mind that 10 bytes array = 10 bytes for bytes, 8 + 8 bytes for extra object fields and + 8 byte for the reference to array.

@mattwarren
Copy link
Contributor

mattwarren commented Oct 10, 2016

The results are getting more stable and accurate, but 100KB limitation of ETW allocation event is still a problem.

Yeah that's always going to be a fundamental problem with ETW events, maybe it's time to look at using AppDomain.MonitoringTotalAllocatedMemorySize instead?

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 10, 2016

Yeah that's always going to be a fundamental problem with ETW events, maybe it's time to look at using AppDomain.MonitoringTotalAllocatedMemorySize instead?

I like this idea, I will do some PoC.

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 11, 2016

Ok, I am done. All possible heap allocations were eliminated from Engine.Run()

Memory profiling results of auto-genearated benchmark runner process for benchmark:

[Benchmark(Description = "new byte[10]")]
public byte[] TenBytes() => new byte[10];

image

Next steps:

  1. @AndreyAkinshin could you do the code review?
  2. When the code passes code review Itry to make the Memory Diagnostics a build in part of BDN. For classic toolchain I want to use AppDomain.MonitoringTotalAllocatedMemorySize as @mattwarren suggested, for Core I want to use GC.GetAllocatedBytesForCurrentThread which got exposed 2 days ago

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Oct 11, 2016

@adamsitnik, great! Could you also fix failed test? I will do the full review in a few days.

Copy link
Member

@AndreyAkinshin AndreyAkinshin left a comment

Awesome pull request. It needs some minor changes and it could be merged into master.

@@ -5,6 +5,9 @@ namespace BenchmarkDotNet.Samples.JIT
{
// See http://en.wikipedia.org/wiki/Inline_expansion
// See http://aakinshin.net/en/blog/dotnet/inlining-and-starg/
#if !CORE
[Diagnostics.Windows.Configs.InliningDiagnoserConfig]

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

{
// TODO: use Accuracy.RemoveOutliers
// TODO: check if resulted measurements are too small (like < 0.1ns)
double overhead = Idle == null ? 0.0 : new Statistics(Idle.Select(m => m.Nanoseconds)).Median;

This comment has been minimized.


double variance = 0;
for (int i = 0; i < measurements.Count; i++)
variance += Math.Pow(measurements[i].Nanoseconds - mean, 2) / (N - 1);

This comment has been minimized.

@@ -17,50 +19,87 @@ $AdditionalLogic$

namespace BenchmarkDotNet.Autogenerated
{
public class Program : global::$TargetTypeName$
public class Program

This comment has been minimized.


namespace BenchmarkDotNet.Diagnostics.Windows.Configs
{
public class InliningDiagnoserConfigAttribute : Attribute, IConfigSource

This comment has been minimized.


namespace BenchmarkDotNet.Diagnostics.Windows.Configs
{
public class MemoryDiagnoserConfigAttribute : Attribute, IConfigSource

This comment has been minimized.

// shouldn't be more than a few seconds. This increases the likelihood that
// all relevant events are processed by the collection thread by the time we
// are done with the benchmark.
Thread.Sleep(TimeSpan.FromSeconds(3));

This comment has been minimized.

logger.WriteLineInfo($"{benchmark.DisplayInfo}");
logger.WriteLineHeader(new string('-', 20));
Logger.WriteLine();
Logger.WriteLineHeader(new string('-', 20));

This comment has been minimized.

private void Blackhole<T>(T input) { }

[MethodImpl(MethodImplOptions.NoInlining)]
private unsafe void Blackhole(byte* input) { }

This comment has been minimized.

private IConfig CreateConfig(IDiagnoser diagnoser, int targetCount)
{
return ManualConfig.CreateEmpty()
.With(Job.Dry.WithLaunchCount(1).WithWarmupCount(1).WithTargetCount(targetCount).With(GcMode.Default.WithForce(false)))

This comment has been minimized.

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 11, 2016

@AndreyAkinshin Thanks! I updated the code.


var InterquartileRange = Q3 - Q1;
var LowerFence = Q1 - 1.5 * InterquartileRange;
var UpperFence = Q3 + 1.5 * InterquartileRange;

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@mattwarren
Copy link
Contributor

mattwarren commented Oct 12, 2016

@adamsitnik Great job, you've given the Diagnosers a long overdue and sorely needed fix/tidy-up thanks for doing that and the new IDiagnoser API looks really nice.

Plus the work on reducing the BenchmarkDotNet memory allocations is really cool, I like this approach of forcing allocations to happen up-front

Just one small thing I noticed, we now have quite a few warnings messages when you build the code (which don't seem to be there on master), can some/all of these be removed or suppressed?

@mattwarren
Copy link
Contributor

mattwarren commented Oct 12, 2016

Slight tongue-in-cheek but also slightly serious, should we ban using System.Linq from certain parts of BenchmarkDotNet? Like the Roslyn code-guidelines do

  • Avoid allocations in compiler hot paths:
    • Avoid LINQ.
    • Avoid using foreach over collections that do not have a struct enumerator.
    • Consider using an object pool. There are many usages of object pools in the compiler to see an example.

That always makes me laugh as the people who designed/wrote LINQ are banned from using it (admittedly only in certain parts of their code)

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 12, 2016

@mattwarren Thanks! It was a lot of fun to me to track even the smallest allocations. Btw. first call to list.Sort was allocating 1740 bytes!! That's insane

As for the LINQ I was wondering if we could somehow ban it on unit-test level.

Pseudocode: Assert.False(typeof(Engine).References.Any(reference => reference.Name == "System.Linq"))

Anyway I will most probably check for heap allocations in engine before each BDN release.

@mattwarren
Copy link
Contributor

mattwarren commented Oct 12, 2016

Anyway I will most probably check for heap allocations in engine before each BDN release.

Maybe we should dog-food our own product and use BenchmarkDotNet to automate this for us? I.e. a unit test that runs on each build and fails if we allocate too much or certain code paths?

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 12, 2016

Maybe we should dog-food our own product and use BenchmarkDotNet to automate this for us? I.e. a unit test that runs on each build and fails if we allocate too much or certain code paths?

Good idea! I could cover that with integration tests

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 14, 2016

@AndreyAkinshin I have pushed the last commit with rename + I added and verified possibility to use custom Engine.

Next week I should be able to implement universal Memory Diagnoser, so if you could wait with 0.10.0 until then it would be great!

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Oct 15, 2016

@adamsitnik,

  1. Could we merge this PR into master right now? Or should we wait for the universal Memory Diagnoser?
  2. I'm going to release v0.10.0 at the end of October.

@adamsitnik adamsitnik changed the title Changed diagnosers flow Changed diagnosers flow, reduced heap allocations in Engine to 0 Oct 15, 2016
@adamsitnik adamsitnik merged commit afa5865 into master Oct 15, 2016
2 checks passed
@adamsitnik adamsitnik deleted the diagnosersFlow branch Oct 15, 2016
@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 15, 2016

@AndreyAkinshin I will do separate PR with universal Memory Diagnoser within next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants