New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haskell version should be more idiomatic, use proper tools & threading, use benchmarking #26

Merged
merged 13 commits into from Feb 15, 2016

Conversation

Projects
None yet
10 participants
@bitemyapp
Contributor

bitemyapp commented Feb 14, 2016

Fixes #17, #24, #25

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 14, 2016

Contributor

New version (Based on @bartavelle's solution) takes 51ms instead of 2.5 seconds on my machine.

Contributor

bitemyapp commented Feb 14, 2016

New version (Based on @bartavelle's solution) takes 51ms instead of 2.5 seconds on my machine.

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 14, 2016

Contributor

@atemerev is there something blocking this PR?

Contributor

bitemyapp commented Feb 14, 2016

@atemerev is there something blocking this PR?

@codygman

This comment has been minimized.

Show comment
Hide comment
@codygman

codygman Feb 14, 2016

@atemerev I'm looking forward to seeing this Haskell version in the benchmark, if there's anything blocking this PR I can help as well!

@atemerev I'm looking forward to seeing this Haskell version in the benchmark, if there's anything blocking this PR I can help as well!

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 14, 2016

Contributor

I added more results for running these on Ubuntu and tried to get at least basic intervals for the non-Haskell stuff (I don't know of criterion equivalents for the others). Ideally so people can get a sense of the upper and lower bounds.

Contributor

bitemyapp commented Feb 14, 2016

I added more results for running these on Ubuntu and tried to get at least basic intervals for the non-Haskell stuff (I don't know of criterion equivalents for the others). Ideally so people can get a sense of the upper and lower bounds.

@ChristopherKing42

This comment has been minimized.

Show comment
Hide comment
@ChristopherKing42

ChristopherKing42 Feb 15, 2016

Contributor

The current version of Parallel.hs is wrong. It outputs 500000500000 instead of 499999500000. This is fixed in bitemyapp#1.

Contributor

ChristopherKing42 commented Feb 15, 2016

The current version of Parallel.hs is wrong. It outputs 500000500000 instead of 499999500000. This is fixed in bitemyapp#1.

@Gabriel439

This comment has been minimized.

Show comment
Hide comment
@Gabriel439

Gabriel439 Feb 15, 2016

I like this version because it shows the performance trade off of various points on the design spectrum

I like this version because it shows the performance trade off of various points on the design spectrum

@Rydgel

This comment has been minimized.

Show comment
Hide comment
@Rydgel

Rydgel Feb 15, 2016

I really want to see this merged as well.

Rydgel commented Feb 15, 2016

I really want to see this merged as well.

cabal-version: >=1.10
executable skynet
ghc-options: -O2 -threaded -rtsopts

This comment has been minimized.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 15, 2016

+1, go gets an unfair advantage in current master since it's using channels, while Haskell simply uses waits. The proper comparison should be between languages using the same paradigm (and since Haskell can use multiple, that offers an advantage here).

ghost commented Feb 15, 2016

+1, go gets an unfair advantage in current master since it's using channels, while Haskell simply uses waits. The proper comparison should be between languages using the same paradigm (and since Haskell can use multiple, that offers an advantage here).

version: 0.1.0.0
synopsis: Simple project template from stack
description: Please see README.md
homepage: http://github.com/bitemyapp/skynet#readme

This comment has been minimized.

@ChristopherKing42

ChristopherKing42 Feb 15, 2016

Contributor

Shouldn't this be @atemerev's repo? (I'm not sure what the conventions for gitHub and cabal are.)

@ChristopherKing42

ChristopherKing42 Feb 15, 2016

Contributor

Shouldn't this be @atemerev's repo? (I'm not sure what the conventions for gitHub and cabal are.)

This comment has been minimized.

@Gabriel439

Gabriel439 Feb 15, 2016

All of these fields are optional and can be removed. Only the name field is mandatory IIRC

@Gabriel439

Gabriel439 Feb 15, 2016

All of these fields are optional and can be removed. Only the name field is mandatory IIRC

This comment has been minimized.

@bitemyapp

bitemyapp Feb 15, 2016

Contributor

It's from the stack template, an oversight. I can remove it if it's an issue. I didn't expect it would get uploaded to Hackage so shouldn't have any relevance.

@bitemyapp

bitemyapp Feb 15, 2016

Contributor

It's from the stack template, an oversight. I can remove it if it's an issue. I didn't expect it would get uploaded to Hackage so shouldn't have any relevance.

@ChristopherKing42

This comment has been minimized.

Show comment
Hide comment
@ChristopherKing42

ChristopherKing42 Feb 15, 2016

Contributor

@wizzard0 is there anything else needed to be done with this Pull Request?

Contributor

ChristopherKing42 commented Feb 15, 2016

@wizzard0 is there anything else needed to be done with this Pull Request?

atemerev added a commit that referenced this pull request Feb 15, 2016

Merge pull request #26 from bitemyapp/master
Haskell version should be more idiomatic, use proper tools & threading, use benchmarking

@atemerev atemerev merged commit 82f318a into atemerev:master Feb 15, 2016

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 15, 2016

Contributor

@atemerev Thank you :)

Contributor

bitemyapp commented Feb 15, 2016

@atemerev Thank you :)

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

chan version of this is returning a wrong result.

After analysing the parallel version.... I'm contemplating whether a C++ pure compile time version is doable and fair.

oblitum commented Feb 16, 2016

chan version of this is returning a wrong result.

After analysing the parallel version.... I'm contemplating whether a C++ pure compile time version is doable and fair.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

OK, this is essentially the same algorithm that is being used in the new haskell "parallel" version that is being unfairly compared to all other non-parallel schemes, only that it's ported to C++ and, hey, it's a compile-time version, isn't it also fair? I'm not bothering constructing a PR for this, except in case the owner is interested in merging this to compare it side-by-side, just like haskell is doing, because, hey, it's a idiomatic language feature! why not 😄

#include <chrono>
#include <cstdint>
#include <iostream>

constexpr int64_t skynet(int64_t levels, int64_t children, int64_t position = 0) {
    if (levels == 0)
        return position;
    int64_t sum = 0;
    for(int64_t i = 0; i < children; ++i)
        sum += skynet(levels - 1, children, position*children + i);
    return sum;
}

int main() {
    using namespace std;
    using namespace std::chrono;

    auto start = high_resolution_clock::now();
    constexpr auto result = skynet(6, 10);  // remove constexpr to eval at runtime
    auto elapsed = high_resolution_clock::now() - start;

    cout << "Result: " << result << '\n';
    cout << "time    " << duration_cast<milliseconds>(elapsed).count() << " ms" << endl;
}

With clang, you should compile this with:

  • clang++ -fconstexpr-steps=10000000 -std=c++14 -O3 skynet.cpp -o skynet

with gcc, do this:

  • g++ -std=c++14 -O3 skynet.cpp -o skynet

Clang is smarter and compiles this in quite a short time compared to GCC. Compilation time for clang should be shorter than what haskell/stack takes to compile its stuff.

Time to run: 0 ms

Switching to runtime version is easy, as I've commented in the code:

Time for runtime version: 10 ms

In all cases, this one has evolved like the haskell version, and not only deprecated concurrency/channels/CSP/actors/whatever, but also parallelism! As they say: "less is exponentially more".

oblitum commented Feb 16, 2016

OK, this is essentially the same algorithm that is being used in the new haskell "parallel" version that is being unfairly compared to all other non-parallel schemes, only that it's ported to C++ and, hey, it's a compile-time version, isn't it also fair? I'm not bothering constructing a PR for this, except in case the owner is interested in merging this to compare it side-by-side, just like haskell is doing, because, hey, it's a idiomatic language feature! why not 😄

#include <chrono>
#include <cstdint>
#include <iostream>

constexpr int64_t skynet(int64_t levels, int64_t children, int64_t position = 0) {
    if (levels == 0)
        return position;
    int64_t sum = 0;
    for(int64_t i = 0; i < children; ++i)
        sum += skynet(levels - 1, children, position*children + i);
    return sum;
}

int main() {
    using namespace std;
    using namespace std::chrono;

    auto start = high_resolution_clock::now();
    constexpr auto result = skynet(6, 10);  // remove constexpr to eval at runtime
    auto elapsed = high_resolution_clock::now() - start;

    cout << "Result: " << result << '\n';
    cout << "time    " << duration_cast<milliseconds>(elapsed).count() << " ms" << endl;
}

With clang, you should compile this with:

  • clang++ -fconstexpr-steps=10000000 -std=c++14 -O3 skynet.cpp -o skynet

with gcc, do this:

  • g++ -std=c++14 -O3 skynet.cpp -o skynet

Clang is smarter and compiles this in quite a short time compared to GCC. Compilation time for clang should be shorter than what haskell/stack takes to compile its stuff.

Time to run: 0 ms

Switching to runtime version is easy, as I've commented in the code:

Time for runtime version: 10 ms

In all cases, this one has evolved like the haskell version, and not only deprecated concurrency/channels/CSP/actors/whatever, but also parallelism! As they say: "less is exponentially more".

@oblitum

This comment has been minimized.

Show comment
Hide comment

oblitum commented Feb 16, 2016

cc #2, #11.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Feb 16, 2016

What makes you think the Haskell version is not parallel, or is not computing its result at runtime? The version I wrote that is very close to what has been merged does!

What makes you think the Haskell version is not parallel, or is not computing its result at runtime? The version I wrote that is very close to what has been merged does!

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Feb 16, 2016

Here is a profiling run without the benchmarking:

        Tue Feb 16 08:29 2016 Time and Allocation Profiling Report  (Final)

           skynet +RTS -p -RTS

        total time  =        0.12 secs   (125 ticks @ 1000 us, 1 processor)
        total alloc = 182,304,880 bytes  (excludes profiling overheads)

COST CENTRE  MODULE   %time %alloc

skynet.sky   Parallel  57.6   70.7
skynet.sky.\ Parallel  42.4   29.3


                                                               individual     inherited
COST CENTRE          MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                 MAIN                     60           0    0.0    0.0   100.0  100.0
 run                 Parallel                122           0    0.0    0.0     0.0    0.0
  skynet             Parallel                123           1    0.0    0.0     0.0    0.0
 CAF:main1           Main                    118           0    0.0    0.0     0.0    0.0
  main               Main                    120           1    0.0    0.0     0.0    0.0
 CAF:run1            Parallel                110           0    0.0    0.0     0.0    0.0
  run                Parallel                121           1    0.0    0.0     0.0    0.0
 CAF:run4            Parallel                109           0    0.0    0.0   100.0  100.0
  run                Parallel                124           0    0.0    0.0   100.0  100.0
   skynet            Parallel                125           0    0.0    0.0   100.0  100.0
    skynet.sky       Parallel                126     1111111   57.6   70.7   100.0  100.0
     skynet.sky.\    Parallel                130     1111110   42.4   29.3    42.4   29.3
 CAF:childnums_r4kW  Parallel                108           0    0.0    0.0     0.0    0.0
  run                Parallel                127           0    0.0    0.0     0.0    0.0
   skynet            Parallel                128           0    0.0    0.0     0.0    0.0
    skynet.childnums Parallel                129           1    0.0    0.0     0.0    0.0
 CAF:run2            Parallel                107           0    0.0    0.0     0.0    0.0
  run                Parallel                132           0    0.0    0.0     0.0    0.0
 CAF:run3            Parallel                106           0    0.0    0.0     0.0    0.0
  run                Parallel                131           0    0.0    0.0     0.0    0.0
 CAF                 Data.Time.Clock.UTC     105           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.FD        100           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Event.Thread         98           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding          95           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.Text       92           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Conc.Signal          88           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding.Iconv    85           0    0.0    0.0     0.0    0.0

You can see that the sky function is executed the correct amount of times. Also:

$ time .stack-work/install/x86_64-linux/lts-5.1/7.10.3/bin/skynet +RTS -N8 -p
Result: 499999500000 in 0.128098s
 +RTS -N8 -p  0,93s user 0,05s system 735% cpu 0,134 total

It is using all cores on my laptop.

Here is a profiling run without the benchmarking:

        Tue Feb 16 08:29 2016 Time and Allocation Profiling Report  (Final)

           skynet +RTS -p -RTS

        total time  =        0.12 secs   (125 ticks @ 1000 us, 1 processor)
        total alloc = 182,304,880 bytes  (excludes profiling overheads)

COST CENTRE  MODULE   %time %alloc

skynet.sky   Parallel  57.6   70.7
skynet.sky.\ Parallel  42.4   29.3


                                                               individual     inherited
COST CENTRE          MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                 MAIN                     60           0    0.0    0.0   100.0  100.0
 run                 Parallel                122           0    0.0    0.0     0.0    0.0
  skynet             Parallel                123           1    0.0    0.0     0.0    0.0
 CAF:main1           Main                    118           0    0.0    0.0     0.0    0.0
  main               Main                    120           1    0.0    0.0     0.0    0.0
 CAF:run1            Parallel                110           0    0.0    0.0     0.0    0.0
  run                Parallel                121           1    0.0    0.0     0.0    0.0
 CAF:run4            Parallel                109           0    0.0    0.0   100.0  100.0
  run                Parallel                124           0    0.0    0.0   100.0  100.0
   skynet            Parallel                125           0    0.0    0.0   100.0  100.0
    skynet.sky       Parallel                126     1111111   57.6   70.7   100.0  100.0
     skynet.sky.\    Parallel                130     1111110   42.4   29.3    42.4   29.3
 CAF:childnums_r4kW  Parallel                108           0    0.0    0.0     0.0    0.0
  run                Parallel                127           0    0.0    0.0     0.0    0.0
   skynet            Parallel                128           0    0.0    0.0     0.0    0.0
    skynet.childnums Parallel                129           1    0.0    0.0     0.0    0.0
 CAF:run2            Parallel                107           0    0.0    0.0     0.0    0.0
  run                Parallel                132           0    0.0    0.0     0.0    0.0
 CAF:run3            Parallel                106           0    0.0    0.0     0.0    0.0
  run                Parallel                131           0    0.0    0.0     0.0    0.0
 CAF                 Data.Time.Clock.UTC     105           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.FD        100           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Event.Thread         98           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding          95           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.Text       92           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Conc.Signal          88           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding.Iconv    85           0    0.0    0.0     0.0    0.0

You can see that the sky function is executed the correct amount of times. Also:

$ time .stack-work/install/x86_64-linux/lts-5.1/7.10.3/bin/skynet +RTS -N8 -p
Result: 499999500000 in 0.128098s
 +RTS -N8 -p  0,93s user 0,05s system 735% cpu 0,134 total

It is using all cores on my laptop.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

@bartavelle I didn't state it's not parallel, I've stated it is, meaning that this turn in the implementation has changed dramatically what's being compared, whatever this bench is trying to compare. Camparing Go using channels and goroutines with this (haskell version), which is just trivial parallel code without any channel, etc, makes as much sense as comparing the haskell parallel version to the c++ compile-time/runtime version (I can't even state at what extent the compiler would unroll, vectorize, etc, the linear c++ code [runtime], effectively making it parallel to some extent). Despite this, maybe it's still possible to shoehorn something like Boost.Coroutine2 into this just for sake of claiming to be using coroutines.

oblitum commented Feb 16, 2016

@bartavelle I didn't state it's not parallel, I've stated it is, meaning that this turn in the implementation has changed dramatically what's being compared, whatever this bench is trying to compare. Camparing Go using channels and goroutines with this (haskell version), which is just trivial parallel code without any channel, etc, makes as much sense as comparing the haskell parallel version to the c++ compile-time/runtime version (I can't even state at what extent the compiler would unroll, vectorize, etc, the linear c++ code [runtime], effectively making it parallel to some extent). Despite this, maybe it's still possible to shoehorn something like Boost.Coroutine2 into this just for sake of claiming to be using coroutines.

@bartavelle

This comment has been minimized.

Show comment
Hide comment
@bartavelle

bartavelle Feb 16, 2016

Ah sorry, I didn't get your point. I think you are right, but this is a benchmark game ;)

Ah sorry, I didn't get your point. I think you are right, but this is a benchmark game ;)

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

@bartavelle so, is there anyone left to beat 0? ;-)

oblitum commented Feb 16, 2016

@bartavelle so, is there anyone left to beat 0? ;-)

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 16, 2016

Contributor

@oblitum I already rejected a PR to calculate it at compile-time in Haskell on my fork. There's your "0". I wouldn't accept anything not parallel/concurrent either.

Contributor

bitemyapp commented Feb 16, 2016

@oblitum I already rejected a PR to calculate it at compile-time in Haskell on my fork. There's your "0". I wouldn't accept anything not parallel/concurrent either.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

@bitemyapp you mean compile-time?

oblitum commented Feb 16, 2016

@bitemyapp you mean compile-time?

@bitemyapp

This comment has been minimized.

Show comment
Hide comment
@bitemyapp

bitemyapp Feb 16, 2016

Contributor

@oblitum It's 0142 in my timezone and not far off in yours. Slow down son.

Contributor

bitemyapp commented Feb 16, 2016

@oblitum It's 0142 in my timezone and not far off in yours. Slow down son.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Feb 16, 2016

@bitemyapp I'm not into this bench anymore, I'd just like to point out that after this pull the README has been left inconsistent/misleading. The implementation has been massaged for parallelism, for which numbers were published, while claiming it "coroutine/channel" has been left untouched. And, as said before, the internal chan version is also returning wrong results.

oblitum commented Feb 16, 2016

@bitemyapp I'm not into this bench anymore, I'd just like to point out that after this pull the README has been left inconsistent/misleading. The implementation has been massaged for parallelism, for which numbers were published, while claiming it "coroutine/channel" has been left untouched. And, as said before, the internal chan version is also returning wrong results.

@ChristopherKing42

This comment has been minimized.

Show comment
Hide comment
Contributor

ChristopherKing42 commented Feb 16, 2016

@jb55

This comment has been minimized.

Show comment
Hide comment
@jb55

jb55 Feb 17, 2016

👎 How is this in the spirit of the competition at all?

The main benchmark should be changed to use concurrency (forkIO), not tight-loop parallelism.

jb55 commented Feb 17, 2016

👎 How is this in the spirit of the competition at all?

The main benchmark should be changed to use concurrency (forkIO), not tight-loop parallelism.

@Gabriel439

This comment has been minimized.

Show comment
Hide comment
@Gabriel439

Gabriel439 Feb 17, 2016

So I think the number to report should be the version using concurrency (i.e. forkIO) but I believe the other versions should still be retained to show better ways to solve the same problem

So I think the number to report should be the version using concurrency (i.e. forkIO) but I believe the other versions should still be retained to show better ways to solve the same problem

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 29, 2016

@oblitum I disagree.

Just to be clear, Haskell lists are not iterators. Consider the following sequential Haskell version:

main :: IO ()
main = print $ sum [0..999999]
$ ghc -O2 slow.hs
$ time ./slow
499999500000

real 0m0.040s
user 0m0.038s
sys  0m0.004s

Let's compare that with a sequential Rust version:

fn main() {
    let sum: i64 = (0..1000000).fold(0, |sum, x| sum+x);
    println!("{}",sum);
}
$ rustc -O test.rs
$ time ./test
499999500000
real 0m0.003s
user 0m0.002s
sys  0m0.001s

You might be tempted to think that Haskell is slower. But it's not! Lists are not loops. Consider a fairer comparison:

summation :: Int -> Int -> Int
summation accum end
    | end == 0  = accum
    | otherwise = summation (accum+end) (end-1)

main :: IO ()
main = print $ summation 0 999999
$ ghc -O2 fast.hs
$ time ./fast
499999500000

real 0m0.003s
user 0m0.002s
sys  0m0.001s

The sequential Haskell version is pretty close to the sequential Rust version, despite the fact that the Haskell version also needed to spawn a garbage collector. You might get slightly better performance in a language without a garbage collector, such as Rust, but it's not going to be a huge difference, given that you properly optimize the code.

I do agree that rpar does work stealing, and it is slightly unfair to compare it to other languages which are not doing the same thing. However, I don't think the Rust version you wrote is equivalent, because it's using iterators instead of lists.

ghost commented May 29, 2016

@oblitum I disagree.

Just to be clear, Haskell lists are not iterators. Consider the following sequential Haskell version:

main :: IO ()
main = print $ sum [0..999999]
$ ghc -O2 slow.hs
$ time ./slow
499999500000

real 0m0.040s
user 0m0.038s
sys  0m0.004s

Let's compare that with a sequential Rust version:

fn main() {
    let sum: i64 = (0..1000000).fold(0, |sum, x| sum+x);
    println!("{}",sum);
}
$ rustc -O test.rs
$ time ./test
499999500000
real 0m0.003s
user 0m0.002s
sys  0m0.001s

You might be tempted to think that Haskell is slower. But it's not! Lists are not loops. Consider a fairer comparison:

summation :: Int -> Int -> Int
summation accum end
    | end == 0  = accum
    | otherwise = summation (accum+end) (end-1)

main :: IO ()
main = print $ summation 0 999999
$ ghc -O2 fast.hs
$ time ./fast
499999500000

real 0m0.003s
user 0m0.002s
sys  0m0.001s

The sequential Haskell version is pretty close to the sequential Rust version, despite the fact that the Haskell version also needed to spawn a garbage collector. You might get slightly better performance in a language without a garbage collector, such as Rust, but it's not going to be a huge difference, given that you properly optimize the code.

I do agree that rpar does work stealing, and it is slightly unfair to compare it to other languages which are not doing the same thing. However, I don't think the Rust version you wrote is equivalent, because it's using iterators instead of lists.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 29, 2016

@siddhanathan 😄 thanks for your analysis. Sorry for not being fair with Haskell now, I guess we are even then?

To tell the truth I just cared to use Haskell's parallel version as-is because it's what is present here in the repo, whether lists are not loops, are lazy evaluated, etc, seems a bit far from the rpar topic, and a Haskell specific optimization.

IIRC, Haskell offers mechanisms for non lazy evaluated lists, so, I wonder whether just using it would lead to the same improvement you got, even though it would still be different from Rust's iterator-over-range version which should probably be reduced to a loop automatically by the compiler (unsure when using Rayon, but it's a common expectation from Rust compiler to do that), anyway, in Rust I'm still working in the language with the high level concept of a list (a range and a iterator is not conceptually the same as a loop) while knowing the compiler is smart enough to turn it into a loop in machine code.

oblitum commented May 29, 2016

@siddhanathan 😄 thanks for your analysis. Sorry for not being fair with Haskell now, I guess we are even then?

To tell the truth I just cared to use Haskell's parallel version as-is because it's what is present here in the repo, whether lists are not loops, are lazy evaluated, etc, seems a bit far from the rpar topic, and a Haskell specific optimization.

IIRC, Haskell offers mechanisms for non lazy evaluated lists, so, I wonder whether just using it would lead to the same improvement you got, even though it would still be different from Rust's iterator-over-range version which should probably be reduced to a loop automatically by the compiler (unsure when using Rayon, but it's a common expectation from Rust compiler to do that), anyway, in Rust I'm still working in the language with the high level concept of a list (a range and a iterator is not conceptually the same as a loop) while knowing the compiler is smart enough to turn it into a loop in machine code.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 30, 2016

@oblitum I doubt the strictness (strict vs lazy) or type of data structure (linked list vs arrays) would yield that sort of speedup. There's stream fusion, but that's a whole new topic.

I played around with the MVar code a little, and managed to reduce the time it spent in garbage collection:

{-# LANGUAGE BangPatterns #-}

import Control.Concurrent (forkIO)
import Control.Concurrent.MVar (MVar, newEmptyMVar, putMVar, takeMVar)
import Data.Time.Clock (getCurrentTime, diffUTCTime)

loop :: Int -> Int -> Int -> Int -> Int -> IO Int
loop !accum num size div !i
    | i == 0 = return accum
    | otherwise = do
        c <- newEmptyMVar
        forkIO (skynet c subNum sizeDiv div)
        s <- takeMVar c
        loop (accum+s) num size div (i-1)
      where
        !subNum = num + (i-1) * sizeDiv
        !sizeDiv = size `quot` div

skynet :: MVar Int -> Int -> Int -> Int -> IO ()
skynet c num size div
    | size == 1 = putMVar c num
    | otherwise = do
        result <- loop 0 num size div div
        putMVar c result

main :: IO ()
main = do
    start <- getCurrentTime
    c <- newEmptyMVar
    forkIO (skynet c 0 1000000 10)
    result <- takeMVar c
    end <- getCurrentTime
    putStrLn $ concat [ "Result: "
                      , show result
                      , " in "
                      , show (diffUTCTime end start) ]

Definitely not the prettiest code, but it yields similar performance compared to the go version:

$ go run skynet.go 
Result: 499999500000 in 1221 ms.
$ ghc -O2 -threaded -rtsopts mvar.hs
$ ./mvar +RTS -N4
Result: 499999500000 in 1.343052s

It's still spending way too much time in garbage collection. I'm sure the numbers can definitely improve further.

As @jb55 mentioned earlier, perhaps it's best to replace the numbers from the current haskell benchmarks with these numbers.

ghost commented May 30, 2016

@oblitum I doubt the strictness (strict vs lazy) or type of data structure (linked list vs arrays) would yield that sort of speedup. There's stream fusion, but that's a whole new topic.

I played around with the MVar code a little, and managed to reduce the time it spent in garbage collection:

{-# LANGUAGE BangPatterns #-}

import Control.Concurrent (forkIO)
import Control.Concurrent.MVar (MVar, newEmptyMVar, putMVar, takeMVar)
import Data.Time.Clock (getCurrentTime, diffUTCTime)

loop :: Int -> Int -> Int -> Int -> Int -> IO Int
loop !accum num size div !i
    | i == 0 = return accum
    | otherwise = do
        c <- newEmptyMVar
        forkIO (skynet c subNum sizeDiv div)
        s <- takeMVar c
        loop (accum+s) num size div (i-1)
      where
        !subNum = num + (i-1) * sizeDiv
        !sizeDiv = size `quot` div

skynet :: MVar Int -> Int -> Int -> Int -> IO ()
skynet c num size div
    | size == 1 = putMVar c num
    | otherwise = do
        result <- loop 0 num size div div
        putMVar c result

main :: IO ()
main = do
    start <- getCurrentTime
    c <- newEmptyMVar
    forkIO (skynet c 0 1000000 10)
    result <- takeMVar c
    end <- getCurrentTime
    putStrLn $ concat [ "Result: "
                      , show result
                      , " in "
                      , show (diffUTCTime end start) ]

Definitely not the prettiest code, but it yields similar performance compared to the go version:

$ go run skynet.go 
Result: 499999500000 in 1221 ms.
$ ghc -O2 -threaded -rtsopts mvar.hs
$ ./mvar +RTS -N4
Result: 499999500000 in 1.343052s

It's still spending way too much time in garbage collection. I'm sure the numbers can definitely improve further.

As @jb55 mentioned earlier, perhaps it's best to replace the numbers from the current haskell benchmarks with these numbers.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2016

@siddhanathan Yes, de facto. For example, the following Rust version, which creates a vector instead of using a range is an order of magnitude slower, or par with the current Haskell version:

extern crate time;
extern crate rayon;

use time::PreciseTime;
use rayon::prelude::*;

type T = usize;

fn skynet(levels: T, children: T) -> T {
    fn sky(levels: T, children: T, position: T) -> T {
        let childnums: Vec<_> = (0..children).collect();
        match levels {
            0 => position,
            _ => childnums.par_iter()
                 .map(|cn| sky(levels - 1, children, position * children + cn)).sum()
        }
    }
    sky(levels, children, 0)
}

fn main() {
    let start = PreciseTime::now();
    let result = skynet(6, 10);
    let end = PreciseTime::now();
    println!("Result: {} in {} ms", result, start.to(end).num_milliseconds());
}

I'd like to have a short way for creating an in-stack initialized array instead of using a vector, but still, I guess it would not change things much.

oblitum commented May 30, 2016

@siddhanathan Yes, de facto. For example, the following Rust version, which creates a vector instead of using a range is an order of magnitude slower, or par with the current Haskell version:

extern crate time;
extern crate rayon;

use time::PreciseTime;
use rayon::prelude::*;

type T = usize;

fn skynet(levels: T, children: T) -> T {
    fn sky(levels: T, children: T, position: T) -> T {
        let childnums: Vec<_> = (0..children).collect();
        match levels {
            0 => position,
            _ => childnums.par_iter()
                 .map(|cn| sky(levels - 1, children, position * children + cn)).sum()
        }
    }
    sky(levels, children, 0)
}

fn main() {
    let start = PreciseTime::now();
    let result = skynet(6, 10);
    let end = PreciseTime::now();
    println!("Result: {} in {} ms", result, start.to(end).num_milliseconds());
}

I'd like to have a short way for creating an in-stack initialized array instead of using a vector, but still, I guess it would not change things much.

@wizzard0 wizzard0 referenced this pull request Aug 1, 2016

Open

Note out of date results #54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment