Skip to content

Conversation

PMunch
Copy link
Contributor

@PMunch PMunch commented May 15, 2018

Changed a couple of procedures into templates, made the code a bit cleaner by removing superfluous result = statements. Also switched to echo no newline like the C++ version, and switched to taking in editable Nodes instead of having to return tuples of Nodes (similar to how C++ uses pointers).

This makes Nim faster than C++ raw-pointer version when run with the markAndSweep GC, and the standard GC uses less memory than C++ raw-pointers (at least in my testing, averaged over 100 runs) but it is a bit slower.

On the devel branch the markAndSweep GC also drops to ~800KiB of memory usage, but leaving it at 5MiB for now is fine if you don't want to try that.

EDIT: By the way, to enable LTO compile like this:
nim c -d:release --gc:markAndSweep --passC:-flto main.nim

@frol
Copy link
Owner

frol commented May 15, 2018

Well, this definitely goes beyond "naive" implementation. Let's have it a separate implementation.

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

I wouldn't really say it's that far from naïve, basically a port from the C++ version with the var stuff, and templates and macros are as everyday as procedures and loops in Nim 😃 The only thing not being completely naïve is the {.noInit.} pragma and the uint32 conversions.

But a separate implementation is fine. Should I just create a separate file in the Nim folder?

@mratsim
Copy link

mratsim commented May 15, 2018

@PMunch is there a perf difference between proc {.inline.} and templates here?

@frol
Copy link
Owner

frol commented May 15, 2018

I am still using Nim 0.18 and here are my results:

Setup Time, seconds Memory, MB
Nim 0.18 -d:release --passC:-flto 0.31 0.63
Nim 0.18 -d:release --gc:markAndSweep --passC:-flto 0.208 5
C++ raw pointers 0.211 0.38
D (see #23) 0.198 0.25
Rust (raw pointers, WIP) 0.218 11 (does not free at all at the momemt)

We are starting to have a tight compentiton on this edge :)

Please, put this solution in a separate file.

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

Nothing considerable:
Templates: 0.225708346534653
{.inline.}: 0.226656207920792
After all they do pretty much the same, only benefit of using templates is that it's Nim handling it, so it can theoretically do more optimisations and you are guaranteed that it will inline them (IIRC {.inline.} only tells the C compiler that you should inline them).

@narimiran
Copy link
Contributor

Well, this definitely goes beyond "naive" implementation. Let's have it a separate implementation.

FWIW, I've looked at the code and the changes, and IMO this shouldn't be a separate implementation, but the (only) implementation used.

@mratsim
Copy link

mratsim commented May 15, 2018

I prefer to use the rule of using the least powerful construct available between proc/macro/templates.

  1. inline procs are implemented with a C preprocessor macro so should be inlined and optimized the same by GCC/Clang.
  2. Assuming:
    template split(orig: Node, value: int32, lower, equal, greater: var Node) =
      var equalGreater: Node
      splitBinary(orig, lower, equalGreater, value)
      splitBinary(equalGreater, equal, greater, value + 1)
    if value is the result of (x div y) * exp(foo) the template will do:
     splitBinary(orig, lower, equalGreater, (x div y) * exp(foo))
     splitBinary(equalGreater, equal, greater, (x div y) * exp(foo) + 1)
    i.e. computation is done twice (unless optimized by the compiler).
    If there are side-effects echo x; (x div y) * exp(foo), they will even be executed twice.

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

Pfft, D using no D runtime and only C stdlib is cheating 😃 My Nim version uses only standard Nim features and the garbage collector.

Just kidding, good job of the D guys to be able to push it that far, would be interesting to see if I could push Nim even further, but then it wouldn't really be Nim. The Nim code you have there is something you could expect to find in a library or the stdlib. In fact, save for the stdout.write, this is pure Nim which can run on the JS target as well. Performance isn't super great though (native JavaScript code: 1.340s, 49MiB; Nim JavaScript code: 2.855s, 25MiB).

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

That's true @mratsim, both have their pros and cons (maybe you wanted to run it twice?). In this case though the performance is about the same, templates appear to be slightly faster, but that could just be measuring inaccuracy.

@frol
Copy link
Owner

frol commented May 15, 2018

FWIW, I've looked at the code and the changes, and IMO this shouldn't be a separate implementation, but the (only) implementation used.

The Nim code you have there is something you could expect to find in a library or the stdlib.

I completely agree that it is the code quality that you expect to see in stdlib and best libraries out there, but our original idea was to see where one could arrive in each language with initial implementation (i.e. before you start profiling it, if that will ever even happen).

Yet, given a huge interest in the topic (I have merged 22 PRs in the last 24 hours), I am going to create a separate table in the main README, where we can show off all the dirty tricks (I am going to move PyPy and --gc:* tweaks there).

@edmundsmith
Copy link
Contributor

The key speedup for D happens to be unrelated to no-RT/C stdlib, it's about making merge tail recursive, similar to how split is tail recursive. I changed Node* merge(Node*, Node*) to be void merge(Node*, Node*, ref Node* result) and made the penultimate step (right before calling the next merge) be to assign to the ref result what was previously the returned value. This avoids having to build up a stack of returned values, and allows the compiler to transform merge into a single flat loop rather than a unary tree/list of calls.

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

Aah, I thought about doing that as well. But I couldn't be bothered. Might have a peek at your implementation and see if I can copy it over to Nim to see if we get any further speedup.

@frol I get your point, and I agree that --gc:* tweaks could go there. But as I said the Nim implementation I have there is not something super esoteric. It's probably what I would've written as a first attempt if I was mindful of performance (not returning objects is always a good thing, and that is the primary change).

@frol
Copy link
Owner

frol commented May 15, 2018

@PMunch Well, none of the other solutions (except C++ which is used for the "bare metal" implementation base-line) breaks the common API and that is the design. If that is changed, we will have to redo everything from scratch.

@narimiran
Copy link
Contributor

original idea was to see where one could arrive in each language with initial implementation i.e. before you start profiling it)

From the readme:

but then we still wanted to compare the results with idiomatic (a.k.a. common practices) implementations for C++ ("raw-pointers") and Rust ("idiomatic").

Well, the "initial implementation" hugely depends on the person writing the code.

Somebody less experienced would write something slow and/or memory inefficient (been there, done that), while somebody more experienced would write something similar to @PMunch's version (call it "idiomatic a.k.a. common practices" Nim-version, if you will) right from the start.

I still think this is not "show off all the dirty tricks" version, but the regular one.

@frol
Copy link
Owner

frol commented May 15, 2018

I still think this is not "show off all the dirty tricks" version, but the regular one.

Please, let's have two versions:

  1. Has the same API as others (even C++ "shared_ptr" version does that, though many C++ devs would argue that raw pointers FTW) - that is the existing Nim version.
  2. Can be anything that nim compiler will agree to compile - that is the proposed Nim version.

@frol
Copy link
Owner

frol commented May 15, 2018

BTW, is there a way to get a statically linked executable with Nim?

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

Statically linked to what?

BTW, here is what dirty tricks looks like in Nim: #32

@alaviss
Copy link
Contributor

alaviss commented May 15, 2018

@frol Simillar to how you do it in C, pass -static to the linker with --passL:'-static'. That should eliminate all dynamic links from C stdlibs

@frol
Copy link
Owner

frol commented May 15, 2018

Statically linked to what?

To all the runtimes (including libc), like -static flag on GCC / Clang.

TocarIP and others added 10 commits May 15, 2018 16:32
Use *Node instead of Node (same as D/C++).
This changes performance from:

real    0m3.840s
user    0m4.647s
sys     0m0.087s

to:

real    0m0.653s
user    0m0.709s
sys     0m0.014s

On my machine (core i7-6700).
…ation options to speed Swift solution by 20%
* Make gitignore more local, and add .vs directory.
* (minor) reformat project file
* Don't use prerelease sdk; appears if anything to have slightly higher startup cost
* Allow using C# 7.1 features - particularly tuples.
* Inline/simplify
* Remove redundant namespace qualification
* Remove redundant initializers
* Use expression-bodied-constructor
* Use var
* Elide redundant arguments
* More var
* Use expression-bodied members for one-liners
* Use else-if to emphasize mutually exclusive options
* Use tuples rather than special purpose class
* Mark non-instantiable class static
* Update readme
@PMunch PMunch force-pushed the nim-faster-than-c branch from 1721be8 to 1c8b5bb Compare May 15, 2018 14:33
@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

Sorry about that, didn't mean for all those commits to end up there..

@PMunch
Copy link
Contributor Author

PMunch commented May 15, 2018

I updated the manual version to also use my optimisations. And moved my optimisations into main_fast.nim.

@frol frol merged commit a803558 into frol:master May 15, 2018
@frol
Copy link
Owner

frol commented May 15, 2018

@PMunch Thank you!

@yglukhov
Copy link

yglukhov commented May 16, 2018

@mratsim +1
@PMunch, replacing all templates with procs does no difference for me. Sometimes procs are faster :). @mratsim has already made a couple of observations that generated asm in lots of cases is precisely the same for procs and templates. Changing those could give this sample more chances to become the idiomatic impl for this benchmark ;)

@PMunch
Copy link
Contributor Author

PMunch commented May 16, 2018

@yglukhov I think the problem was that it didn't follow the same pattern as the others since it takes in var arguments instead of returning a compound type. If it's just the templates I guess we could change it. I notice a very slight dip in performance from 0.282s to 0.291s averaged over 1000 runs (minimum was 0.193s and 0.195s respectively).

@frol
Copy link
Owner

frol commented May 16, 2018

I think the problem was that it didn't follow the same pattern as the others since it takes in var arguments instead of returning a compound type.

That is exactly right.

@PMunch
Copy link
Contributor Author

PMunch commented May 16, 2018

Why is Nim with --gc:markAndSweep and not the fast or manual variant in the "tunned" (assume this was meant to be tuned?) section?

@frol
Copy link
Owner

frol commented May 16, 2018

@PMunch I have just restructured the existing README and rerunning all the results at the moment. Bumped into Go: #42 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants