Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust implementation #95

Merged
merged 11 commits into from
Dec 9, 2021
Merged

Rust implementation #95

merged 11 commits into from
Dec 9, 2021

Conversation

tom91136
Copy link
Member

This PR adds a standalone Rust implementation of the BabelStream benchmark and partially addresses #78.

Supported program arguments and output format should be identical to the C++ version.
Parallelism is implemented using Rayon, a single threaded version is also implemented but not currently used.

Support for platforms other than CPU will be added in a separate PR.

@tom91136 tom91136 requested a review from tomdeakin March 25, 2021 15:42
@tom91136 tom91136 self-assigned this Mar 25, 2021
Comment on lines 78 to 84
fn init_arrays(&mut self, init: (T, T, T));
fn copy(&mut self);
fn mul(&mut self);
fn add(&mut self);
fn triad(&mut self);
fn nstream(&mut self);
fn dot(&mut self) -> T;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have more comments, for example on these traits. What is the overall goal of the code,
what each of these tests accomplishes, what is the expected behaviour?

@@ -0,0 +1,413 @@
use std::fmt::{Debug, Display};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an inner doc comment (//!) here to describe the goals and objectives of this CLI tool.

As an outsider, it helps to get an overview.

Copy link

@andy-thomason andy-thomason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nicely done. Good idiomatic rust.

A bit more description would be handy, but good otherwise.

@tom91136
Copy link
Member Author

Thanks @andy-thomason ! Yep, I'll mirror the original comments in the C++ version.

@64
Copy link

64 commented Mar 27, 2021

Consider passing target-cpu=native to rustc (this is similar to -march=native). You can do this in the build.rustflags option in a .cargo/config.toml file (see https://doc.rust-lang.org/cargo/reference/config.html).

EDIT: You may also want to run cargo fmt

Comment on lines 207 to 209
let a = &self.a;
let b = &self.b;
(0..self.size).into_par_iter().fold(|| T::default(), |acc, i| acc + a[i] * b[i]).sum::<T>()
Copy link

@64 64 Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write this a little more idiomatically as:

self.a.par_iter().zip(&self.b).map(|(&a, &b)| a * b).sum()

(although I'm not sure if lack of associativity will mess things up...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Matt says, sequential floating point operations perform quite
badly in standard Rust as there is no "fast math" option to ignore the associativity
constraints. This leads to summation loops not vectorising as the LLVM autovectoriser
will not break the constraint.

(see https://stackoverflow.com/questions/30863510/how-do-i-compile-with-ffast-math for example).

You can write less idiomatic code that sums chunks using the chunks_exact(..) iterator
which will vectorise, but it is a shame that you have to. I am working on some code transformation
tools that may help in this case as part of our extendr R extension project.

I would generally avoid using SIMD crates and intrinsics unless you really have to.

@tom91136
Copy link
Member Author

There's an ongoing issue with NUMA awareness. Currently looking at possible solutions, don't merge yet.

@andy-thomason
Copy link

andy-thomason commented Mar 30, 2021

It would be interesting to see if Rayon gets NUMA support. They would need to split the thread pool. (more a crossbeam thing).

# Conflicts:
#	README.md
Copy link
Contributor

@tomdeakin tomdeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - let me know when you're happy.

Add rustfmt and use target-cpu=native
Add option for libc malloc, basic thread pinning, touch-free allocation
Split modules
Fix wrong nstream in plain_stream
@tom91136
Copy link
Member Author

Sorry, turns out I forgot I've stashed a big chunk of work which contains a Crossbeam version and flags for pinning/malloc from a while ago.
With those commits, suggestions for rustfmt and target-cpu=native has also been added, thanks @64!
@andy-thomason The new Crossbeam version uses mutable chunks for each thread, I wonder if there's a more idiomatic Rust way of doing it.

@tomdeakin I've cleaned everything up and added CI for it. You might want to skim through it again, there's a standalone README as well.

@andy-thomason
Copy link

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory
which puts a heavy load on the memory controller which is shared amongst all the threads.
They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}


```shell
> rustup install nightly
> rustup default nightly # optional, this sets `+nightly` automatically for cargo calls later
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting nightly as a global default you could recomment rustup override set nightly which sets it for the current directory only

@64
Copy link

64 commented Jun 16, 2021

Hm, I thought Rust uses malloc by default anyway?

@tom91136
Copy link
Member Author

Hm, I thought Rust uses malloc by default anyway?

I must have been living under a rock! I thought Rust may still use jemalloc in certain cases (one of the crates brought in jemallocator, that's probably why I thought that). The malloc option/experiment is mainly there to match what the native C version would do, that is, to prevent Rust from touching that uninitialised memory in any way.

@tom91136
Copy link
Member Author

You might want to try out the good old thread pool + atomic variable scheduler.

The problem with crossbeam/rayon is that they tend to use disjoint sections of memory
which puts a heavy load on the memory controller which is shared amongst all the threads.
They are also quite bulky, but very good when threads are blocked.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

fn main() {
    let next_work_item = Arc::new(AtomicUsize::new(0));
    let chunk_size = 1024;
    let job_size = 1235678;

    let threads = (0..8).map(|_tid| {
        let next_work_item = next_work_item.clone();
        std::thread::spawn(move || 
            loop {
                let work_item = next_work_item.fetch_add(1, Ordering::Acquire);
                let imin = work_item * chunk_size;
                if imin > job_size {
                    break;
                }
                let imax = (imin + chunk_size).min(job_size);
                for i in imin..imax {
                    // do something.
                }
            }
        )
    }
    ).collect::<Vec<_>>();

    for t in threads.into_iter() {
        t.join().unwrap();
    }
}

I'll give this a go.

@tomdeakin tomdeakin added this to the v4.0 milestone Jun 22, 2021
@tomdeakin
Copy link
Contributor

Reviewed, and will check the last suggestion.

@tom91136
Copy link
Member Author

@andy-thomason If I understand correctly about using std::thread::spawn, then my data must be in a form of some Arc<Mutex<T>>, and since we're working on the chunks, I came up with something like this:

    use std::sync;

    let threads = 2;

    let mut xs = vec![1, 2, 3, 4];
    let cs = xs
        .chunks(threads)
        .map(|x| { return Arc::new(Mutex::new(x.to_vec())) } )
        .collect::<Vec<_>>();

    let ts = (0..threads).map(move  |t| {
      let tc = Arc::clone(&cs[t]);
      std::thread::spawn( move || {
        let mut data = tc.lock().unwrap();
        for i in 0..(*data).len() {
          (*data)[i] = 0
        }
      })
    }).collect::<Vec<_>>();

    for t in ts.into_iter() { t.join().unwrap();  }

Which seems to be quite verbose compared to crossbeam's scope. There's also this unavoidable heap allocation unless the vectors are 'static.
Does crossbeam::thread::scope actually have control over the memory that gets captured?

@andy-thomason
Copy link

crossbeam::thread::scope uses pointers and unsafe impl Send/Sync internally. You can do the same
using the standard library, but if you are just starting with Rust then use crossbeam.

crossbeam::thread::scope is externally safe, because the task terminates before the lifetime of the reference.
but std::thread must join separately and so the lifetime may be dangling after the thread call.

You can't safely send a reference across a thread boundary, but you can share a Vec, for example
or make your own wrapper which implements Send and Sync.

@andy-thomason
Copy link

I've made an example of sharing a mutable slice with the standard library here:

https://github.com/atomicincrement/multithread-std/blob/main/src/main.rs

Check out the Rustonomicon

Note that if you care about NUMA and other things then you will need to work a bit harder
than using rayon. But rayon/crossbeam is still a very good generalised library.

Add BABELSTREAM_NUM_THREADS env option
Refactor driver
Bump dependencies
@tom91136
Copy link
Member Author

tom91136 commented Dec 6, 2021

@andy-thomason Thanks for that, I was able to implement the unsafe version in the latest commit using your example.
I've reran some of the benchmarks with different combinations of new options (--init, --pin, etc), here's the result on a dual socket Xeon machine:
image

(--init corresponds to alloc in the chart)

The results here is very similar to what we're getting for Julia; the pinning doesn't consider the topology of the NUMA nodes and just pins them based a linear thread id. If we set the OMP version to do close placing, the results will be similar to Rust or Julia.

There's some weird performance drops for Arc when --pin and --init is set, will have to look into that later.

In the end, what did the trick is a combination of manual thread pinning and leaving the Vec uninitialised:

let mut xs = Vec::with_capacity_in(size, allocator);
unsafe {
  xs.set_len(size);
}

It's also quite interesting that the Unsafe implementation can achieve the same performance as the uninitialised Arc, Crossbeab, and Unsafe implementations.

I think this is ready for merge unless there's anything that stands out (cc @andy-thomason @64).
We'll be doing more experiments on Rust's performance as this gets merged.

@tomdeakin tomdeakin merged commit 9ec3018 into main Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants