-
Notifications
You must be signed in to change notification settings - Fork 39
Use benchly #518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use benchly #518
Conversation
|
Wow - super impressive! Is it an M3 Max? I am overdue a round of tests on .NET8 with dynamic PGO, but I doubt I will get close to that. I need a new laptop... |
|
haha, yes; the 14-core version. I hadn't upgraded my laptop since 2016 and the battery gave out so it became hard coerce it to power up. I didn't expect it to be that good! |
|
Thats an extremely impressive result for a 14 cores/threads - in addition to the hw the Java runtime/JIT must be well sorted. A few weeks ago I got an M2 Mac mini to test ARM - I just did a quick check, with .NET8 ConcurrentLfu hits about 80 million/sec for me. That's a pretty substantial gap. The M3 max has 12 perf cores vs the M2's 4, but I'm still more than 10x slower. |
|
hmm.. that's interesting. I figured that since this is a cpu-bound workload where everything is in cache, the wider instruction decode engine might be giving it a large boost. On my 2016 laptop it was around 100M/s at 8 threads, iirc. I don't really think anything beyond a synthetic 60M/s is an actually visible benefit in the real world. Maybe our benchmarks differ significantly (here is mine)? I use a zipfian distribution to simulate a realistic request distribution with hot entries. This should take care of JIT warmup, by returning the value it should avoid dead code elimination (which would skew if untrue), and I tried to eliminate any benchmark overhead. What about yours? |
|
I realized I was running a debug build 🤦, then ran release and it was slower for LFU but faster for other caches. When I disabled dynamic PGO it was better in release but still not as good as debug (which is super weird). Then I tried with #496 where I was experimenting with different JIT hints a while back, and it boosted it up to 150 million/sec which is a bit more like it. I will need to see how these JIT hints perform outside of a toy test program and on other platforms (I had smoke tested this on .NET7 on Windows on an old skylake CPU where I think it made almost no difference - and concluded since it would clash with the new tiered JIT I shouldn't bother). Perhaps this is a case where the new tiered JIT in .NET gets confused. Totally agree that such tests can be quite misleading, and PGO of a wonky test further muddies the waters. I believe I generate a Zipf distribution, my test runner is here. The dead code elimination is an excellent point - that is not mitigated in my test. Now you have prompted me to do a more thorough job of it. As a related aside, I have been gradually tweaking the code that schedules the drain thread in LFU lookups and reduced the lookup latency micro bench by about 30%/increased throughput by about 10-12%. I figured out how to replicate what I think are equivalent half fence reads/writes, but I think I ended up using a GetAcquire before entrering the lock whereas I think you have GetOpaque. This type of thing is unfortunately quite obscure in .NET whereas Java's varhandle is way richer. |
|
BenchmarkDotNet is the closest equivalent to Java Microbenchmark Harness, but it doesn't support threaded tests. Your tests are very clean - its nice you can specify threaded tests like that. |


Hookup benchly to automate generating Benchmark test results.
.NET6 latency:
Now with correct time units:
Box plot for LFU:
This likely needs to be split out per job to be able to see anything useful, but it is working.
Sketch freq:
v0.6 bugs fixed: