Rust implementation #95

tom91136 · 2021-03-25T15:42:37Z

This PR adds a standalone Rust implementation of the BabelStream benchmark and partially addresses #78.

Supported program arguments and output format should be identical to the C++ version.
Parallelism is implemented using Rayon, a single threaded version is also implemented but not currently used.

Support for platforms other than CPU will be added in a separate PR.

andy-thomason · 2021-03-26T10:36:35Z

rust-stream/src/main.rs

+    fn init_arrays(&mut self, init: (T, T, T));
+    fn copy(&mut self);
+    fn mul(&mut self);
+    fn add(&mut self);
+    fn triad(&mut self);
+    fn nstream(&mut self);
+    fn dot(&mut self) -> T;


It would be useful to have more comments, for example on these traits. What is the overall goal of the code,
what each of these tests accomplishes, what is the expected behaviour?

andy-thomason · 2021-03-26T10:37:52Z

rust-stream/src/main.rs

@@ -0,0 +1,413 @@
+use std::fmt::{Debug, Display};


Consider adding an inner doc comment (//!) here to describe the goals and objectives of this CLI tool.

As an outsider, it helps to get an overview.

andy-thomason

Very nicely done. Good idiomatic rust.

A bit more description would be handy, but good otherwise.

tom91136 · 2021-03-26T14:03:27Z

Thanks @andy-thomason ! Yep, I'll mirror the original comments in the C++ version.

64 · 2021-03-27T18:33:57Z

Consider passing target-cpu=native to rustc (this is similar to -march=native). You can do this in the build.rustflags option in a .cargo/config.toml file (see https://doc.rust-lang.org/cargo/reference/config.html).

EDIT: You may also want to run cargo fmt

64 · 2021-03-27T21:37:24Z

rust-stream/src/main.rs

+        let a = &self.a;
+        let b = &self.b;
+        (0..self.size).into_par_iter().fold(|| T::default(), |acc, i| acc + a[i] * b[i]).sum::<T>()


You can write this a little more idiomatically as:

self.a.par_iter().zip(&self.b).map(|(&a, &b)| a * b).sum()

(although I'm not sure if lack of associativity will mess things up...)

As Matt says, sequential floating point operations perform quite
badly in standard Rust as there is no "fast math" option to ignore the associativity
constraints. This leads to summation loops not vectorising as the LLVM autovectoriser
will not break the constraint.

(see https://stackoverflow.com/questions/30863510/how-do-i-compile-with-ffast-math for example).

You can write less idiomatic code that sums chunks using the chunks_exact(..) iterator
which will vectorise, but it is a shame that you have to. I am working on some code transformation
tools that may help in this case as part of our extendr R extension project.

I would generally avoid using SIMD crates and intrinsics unless you really have to.

tom91136 · 2021-03-29T08:18:47Z

There's an ongoing issue with NUMA awareness. Currently looking at possible solutions, don't merge yet.

andy-thomason · 2021-03-30T14:41:02Z

It would be interesting to see if Rayon gets NUMA support. They would need to split the thread pool. (more a crossbeam thing).

# Conflicts: # README.md

tomdeakin

LGTM - let me know when you're happy.

Add rustfmt and use target-cpu=native Add option for libc malloc, basic thread pinning, touch-free allocation Split modules

Fix wrong nstream in plain_stream

tom91136 · 2021-06-16T05:05:44Z

Sorry, turns out I forgot I've stashed a big chunk of work which contains a Crossbeam version and flags for pinning/malloc from a while ago.
With those commits, suggestions for rustfmt and target-cpu=native has also been added, thanks @64!
@andy-thomason The new Crossbeam version uses mutable chunks for each thread, I wonder if there's a more idiomatic Rust way of doing it.

@tomdeakin I've cleaned everything up and added CI for it. You might want to skim through it again, there's a standalone README as well.

andy-thomason · 2021-06-16T08:32:49Z

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory
which puts a heavy load on the memory controller which is shared amongst all the threads.
They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}

64 · 2021-06-16T13:15:38Z

rust-stream/README.md

+
+```shell
+> rustup install nightly
+> rustup default nightly # optional, this sets `+nightly` automatically for cargo calls later


Instead of setting nightly as a global default you could recomment rustup override set nightly which sets it for the current directory only

64 · 2021-06-16T13:32:52Z

Hm, I thought Rust uses malloc by default anyway?

tom91136 · 2021-06-21T11:55:25Z

Hm, I thought Rust uses malloc by default anyway?

I must have been living under a rock! I thought Rust may still use jemalloc in certain cases (one of the crates brought in jemallocator, that's probably why I thought that). The malloc option/experiment is mainly there to match what the native C version would do, that is, to prevent Rust from touching that uninitialised memory in any way.

tom91136 · 2021-06-21T11:55:48Z

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory
which puts a heavy load on the memory controller which is shared amongst all the threads.
They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}

I'll give this a go.

tomdeakin · 2021-11-25T13:10:06Z

Reviewed, and will check the last suggestion.

tom91136 · 2021-11-30T00:20:25Z

@andy-thomason If I understand correctly about using std::thread::spawn, then my data must be in a form of some Arc<Mutex<T>>, and since we're working on the chunks, I came up with something like this:

    use std::sync;

    let threads = 2;

    let mut xs = vec![1, 2, 3, 4];
    let cs = xs
        .chunks(threads)
        .map(|x| { return Arc::new(Mutex::new(x.to_vec())) } )
        .collect::<Vec<_>>();

    let ts = (0..threads).map(move  |t| {
      let tc = Arc::clone(&cs[t]);
      std::thread::spawn( move || {
        let mut data = tc.lock().unwrap();
        for i in 0..(*data).len() {
          (*data)[i] = 0
        }
      })
    }).collect::<Vec<_>>();

    for t in ts.into_iter() { t.join().unwrap();  }

Which seems to be quite verbose compared to crossbeam's scope. There's also this unavoidable heap allocation unless the vectors are 'static.
Does crossbeam::thread::scope actually have control over the memory that gets captured?

andy-thomason · 2021-11-30T14:33:38Z

crossbeam::thread::scope uses pointers and unsafe impl Send/Sync internally. You can do the same
using the standard library, but if you are just starting with Rust then use crossbeam.

crossbeam::thread::scope is externally safe, because the task terminates before the lifetime of the reference.
but std::thread must join separately and so the lifetime may be dangling after the thread call.

You can't safely send a reference across a thread boundary, but you can share a Vec, for example
or make your own wrapper which implements Send and Sync.

andy-thomason · 2021-11-30T15:30:40Z

I've made an example of sharing a mutable slice with the standard library here:

https://github.com/atomicincrement/multithread-std/blob/main/src/main.rs

Check out the Rustonomicon

Note that if you care about NUMA and other things then you will need to work a bit harder
than using rayon. But rayon/crossbeam is still a very good generalised library.

Add BABELSTREAM_NUM_THREADS env option Refactor driver Bump dependencies

tom91136 · 2021-12-06T16:15:38Z

@andy-thomason Thanks for that, I was able to implement the unsafe version in the latest commit using your example.
I've reran some of the benchmarks with different combinations of new options (--init, --pin, etc), here's the result on a dual socket Xeon machine:

(--init corresponds to alloc in the chart)

The results here is very similar to what we're getting for Julia; the pinning doesn't consider the topology of the NUMA nodes and just pins them based a linear thread id. If we set the OMP version to do close placing, the results will be similar to Rust or Julia.

There's some weird performance drops for Arc when --pin and --init is set, will have to look into that later.

In the end, what did the trick is a combination of manual thread pinning and leaving the Vec uninitialised:

let mut xs = Vec::with_capacity_in(size, allocator);
unsafe {
  xs.set_len(size);
}

It's also quite interesting that the Unsafe implementation can achieve the same performance as the uninitialised Arc, Crossbeab, and Unsafe implementations.

I think this is ready for merge unless there's anything that stands out (cc @andy-thomason @64).
We'll be doing more experiments on Rust's performance as this gets merged.

# Conflicts: # .github/workflows/main.yaml

# Conflicts: # README.md

Initial Rust implementation

9b23f6e

tom91136 requested a review from tomdeakin March 25, 2021 15:42

tom91136 self-assigned this Mar 25, 2021

andy-thomason reviewed Mar 26, 2021

View reviewed changes

andy-thomason approved these changes Mar 26, 2021

View reviewed changes

64 reviewed Mar 27, 2021

View reviewed changes

Merge branch 'main' into rust

2ff883f

# Conflicts: # README.md

tom91136 mentioned this pull request May 23, 2021

Move all C++ impl. to ./cpp and remove Makefiles #102

Merged

tomdeakin approved these changes Jun 9, 2021

View reviewed changes

Merge branch 'main' into rust

c70a5da

tom91136 added the enhancement label Jun 14, 2021

tom91136 added 3 commits June 15, 2021 23:13

Add Crossbeam implementation

fdb2c18

Add rustfmt and use target-cpu=native Add option for libc malloc, basic thread pinning, touch-free allocation Split modules

Add integration tests and CI

ce4d6cf

Fix wrong nstream in plain_stream

Don't debug print args

e3bd583

64 reviewed Jun 16, 2021

View reviewed changes

tomdeakin added this to the v4.0 milestone Jun 22, 2021

Add unsafe and Arc implementation

c61b93d

Add BABELSTREAM_NUM_THREADS env option Refactor driver Bump dependencies

tom91136 added 4 commits December 9, 2021 15:58

Merge branch 'main' into rust

c0fffb9

# Conflicts: # .github/workflows/main.yaml

Merge branch 'main' into rust

f6d7258

# Conflicts: # README.md

Move rust to appropriate folder

18376f7

Fix rust CI path

07c7e23

tom91136 mentioned this pull request Dec 9, 2021

Add Julia and Rust versions #78

Closed

tomdeakin merged commit 9ec3018 into main Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust implementation #95

Rust implementation #95

tom91136 commented Mar 25, 2021

andy-thomason Mar 26, 2021

andy-thomason Mar 26, 2021

andy-thomason left a comment

tom91136 commented Mar 26, 2021

64 commented Mar 27, 2021 •

edited

Loading

64 Mar 27, 2021 •

edited

Loading

andy-thomason Mar 29, 2021

tom91136 commented Mar 29, 2021

andy-thomason commented Mar 30, 2021 •

edited

Loading

tomdeakin left a comment

tom91136 commented Jun 16, 2021

andy-thomason commented Jun 16, 2021

64 Jun 16, 2021

64 commented Jun 16, 2021

tom91136 commented Jun 21, 2021

tom91136 commented Jun 21, 2021

tomdeakin commented Nov 25, 2021

tom91136 commented Nov 30, 2021

andy-thomason commented Nov 30, 2021

andy-thomason commented Nov 30, 2021

tom91136 commented Dec 6, 2021

Rust implementation #95

Rust implementation #95

Conversation

tom91136 commented Mar 25, 2021

andy-thomason Mar 26, 2021

Choose a reason for hiding this comment

andy-thomason Mar 26, 2021

Choose a reason for hiding this comment

andy-thomason left a comment

Choose a reason for hiding this comment

tom91136 commented Mar 26, 2021

64 commented Mar 27, 2021 • edited Loading

64 Mar 27, 2021 • edited Loading

Choose a reason for hiding this comment

andy-thomason Mar 29, 2021

Choose a reason for hiding this comment

tom91136 commented Mar 29, 2021

andy-thomason commented Mar 30, 2021 • edited Loading

tomdeakin left a comment

Choose a reason for hiding this comment

tom91136 commented Jun 16, 2021

andy-thomason commented Jun 16, 2021

64 Jun 16, 2021

Choose a reason for hiding this comment

64 commented Jun 16, 2021

tom91136 commented Jun 21, 2021

tom91136 commented Jun 21, 2021

tomdeakin commented Nov 25, 2021

tom91136 commented Nov 30, 2021

andy-thomason commented Nov 30, 2021

andy-thomason commented Nov 30, 2021

tom91136 commented Dec 6, 2021

64 commented Mar 27, 2021 •

edited

Loading

64 Mar 27, 2021 •

edited

Loading

andy-thomason commented Mar 30, 2021 •

edited

Loading