Skip to content

Conversation

@bitfaster
Copy link
Owner

@bitfaster bitfaster commented Nov 28, 2023

Hookup benchly to automate generating Benchmark test results.

.NET6 latency:

Now with correct time units:

BitFaster Caching Benchmarks LruJustGetOrAdd- NET 6 0-columnchart

Box plot for LFU:

This likely needs to be split out per job to be able to see anything useful, but it is working.

BitFaster Caching Benchmarks LfuJustGetOrAdd-boxplot

Sketch freq:

BitFaster Caching Benchmarks Lfu SketchFrequency-columnchart

v0.6 bugs fixed:

  • Layout broken when using a parameterized test.
  • Legend should be hidden for single job type.

BitFaster Caching Benchmarks Lfu SketchFrequency-columnchart

@coveralls
Copy link

coveralls commented Nov 28, 2023

Coverage Status

coverage: 99.159% (-0.08%) from 99.237%
when pulling 71820f4 on users/alexpeck/benchly
into 28e7203 on main.

@ben-manes
Copy link

I got a new laptop last week and this is at 16 threads 😁

image

@bitfaster
Copy link
Owner Author

Wow - super impressive! Is it an M3 Max?

I am overdue a round of tests on .NET8 with dynamic PGO, but I doubt I will get close to that. I need a new laptop...

@ben-manes
Copy link

haha, yes; the 14-core version. I hadn't upgraded my laptop since 2016 and the battery gave out so it became hard coerce it to power up. I didn't expect it to be that good!

@bitfaster
Copy link
Owner Author

For reference I can only just hit those numbers on a monster desktop with an Intel 13900K, with > 24 threads (I think I showed you this last year):

image

@bitfaster
Copy link
Owner Author

Thats an extremely impressive result for a 14 cores/threads - in addition to the hw the Java runtime/JIT must be well sorted.

A few weeks ago I got an M2 Mac mini to test ARM - I just did a quick check, with .NET8 ConcurrentLfu hits about 80 million/sec for me. That's a pretty substantial gap. The M3 max has 12 perf cores vs the M2's 4, but I'm still more than 10x slower.

@ben-manes
Copy link

hmm.. that's interesting. I figured that since this is a cpu-bound workload where everything is in cache, the wider instruction decode engine might be giving it a large boost. On my 2016 laptop it was around 100M/s at 8 threads, iirc. I don't really think anything beyond a synthetic 60M/s is an actually visible benefit in the real world.

Maybe our benchmarks differ significantly (here is mine)? I use a zipfian distribution to simulate a realistic request distribution with hot entries. This should take care of JIT warmup, by returning the value it should avoid dead code elimination (which would skew if untrue), and I tried to eliminate any benchmark overhead. What about yours?

@bitfaster
Copy link
Owner Author

I realized I was running a debug build 🤦, then ran release and it was slower for LFU but faster for other caches. When I disabled dynamic PGO it was better in release but still not as good as debug (which is super weird).

Then I tried with #496 where I was experimenting with different JIT hints a while back, and it boosted it up to 150 million/sec which is a bit more like it.

I will need to see how these JIT hints perform outside of a toy test program and on other platforms (I had smoke tested this on .NET7 on Windows on an old skylake CPU where I think it made almost no difference - and concluded since it would clash with the new tiered JIT I shouldn't bother). Perhaps this is a case where the new tiered JIT in .NET gets confused. Totally agree that such tests can be quite misleading, and PGO of a wonky test further muddies the waters.

I believe I generate a Zipf distribution, my test runner is here. The dead code elimination is an excellent point - that is not mitigated in my test. Now you have prompted me to do a more thorough job of it.

As a related aside, I have been gradually tweaking the code that schedules the drain thread in LFU lookups and reduced the lookup latency micro bench by about 30%/increased throughput by about 10-12%. I figured out how to replicate what I think are equivalent half fence reads/writes, but I think I ended up using a GetAcquire before entrering the lock whereas I think you have GetOpaque. This type of thing is unfortunately quite obscure in .NET whereas Java's varhandle is way richer.

@bitfaster
Copy link
Owner Author

BenchmarkDotNet is the closest equivalent to Java Microbenchmark Harness, but it doesn't support threaded tests. Your tests are very clean - its nice you can specify threaded tests like that.

@bitfaster bitfaster merged commit 9441fa9 into main Dec 4, 2023
@bitfaster bitfaster deleted the users/alexpeck/benchly branch December 4, 2023 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants