-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust implementation #95
Conversation
rust-stream/src/main.rs
Outdated
fn init_arrays(&mut self, init: (T, T, T)); | ||
fn copy(&mut self); | ||
fn mul(&mut self); | ||
fn add(&mut self); | ||
fn triad(&mut self); | ||
fn nstream(&mut self); | ||
fn dot(&mut self) -> T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to have more comments, for example on these traits. What is the overall goal of the code,
what each of these tests accomplishes, what is the expected behaviour?
rust-stream/src/main.rs
Outdated
@@ -0,0 +1,413 @@ | |||
use std::fmt::{Debug, Display}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an inner doc comment (//!) here to describe the goals and objectives of this CLI tool.
As an outsider, it helps to get an overview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nicely done. Good idiomatic rust.
A bit more description would be handy, but good otherwise.
Thanks @andy-thomason ! Yep, I'll mirror the original comments in the C++ version. |
Consider passing EDIT: You may also want to run |
rust-stream/src/main.rs
Outdated
let a = &self.a; | ||
let b = &self.b; | ||
(0..self.size).into_par_iter().fold(|| T::default(), |acc, i| acc + a[i] * b[i]).sum::<T>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write this a little more idiomatically as:
self.a.par_iter().zip(&self.b).map(|(&a, &b)| a * b).sum()
(although I'm not sure if lack of associativity will mess things up...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Matt says, sequential floating point operations perform quite
badly in standard Rust as there is no "fast math" option to ignore the associativity
constraints. This leads to summation loops not vectorising as the LLVM autovectoriser
will not break the constraint.
(see https://stackoverflow.com/questions/30863510/how-do-i-compile-with-ffast-math for example).
You can write less idiomatic code that sums chunks using the chunks_exact(..)
iterator
which will vectorise, but it is a shame that you have to. I am working on some code transformation
tools that may help in this case as part of our extendr R extension project.
I would generally avoid using SIMD crates and intrinsics unless you really have to.
There's an ongoing issue with NUMA awareness. Currently looking at possible solutions, don't merge yet. |
It would be interesting to see if Rayon gets NUMA support. They would need to split the thread pool. (more a crossbeam thing). |
# Conflicts: # README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - let me know when you're happy.
Add rustfmt and use target-cpu=native Add option for libc malloc, basic thread pinning, touch-free allocation Split modules
Fix wrong nstream in plain_stream
Sorry, turns out I forgot I've stashed a big chunk of work which contains a Crossbeam version and flags for pinning/malloc from a while ago. @tomdeakin I've cleaned everything up and added CI for it. You might want to skim through it again, there's a standalone README as well. |
You might want to try out the good old thread pool + atomic variable scheduler. The problem with crossbeam/rayon is that they tend to use disjoint sections of memory
|
rust-stream/README.md
Outdated
|
||
```shell | ||
> rustup install nightly | ||
> rustup default nightly # optional, this sets `+nightly` automatically for cargo calls later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting nightly as a global default you could recomment rustup override set nightly
which sets it for the current directory only
Hm, I thought Rust uses malloc by default anyway? |
I must have been living under a rock! I thought Rust may still use jemalloc in certain cases (one of the crates brought in |
I'll give this a go. |
Reviewed, and will check the last suggestion. |
@andy-thomason If I understand correctly about using use std::sync;
let threads = 2;
let mut xs = vec![1, 2, 3, 4];
let cs = xs
.chunks(threads)
.map(|x| { return Arc::new(Mutex::new(x.to_vec())) } )
.collect::<Vec<_>>();
let ts = (0..threads).map(move |t| {
let tc = Arc::clone(&cs[t]);
std::thread::spawn( move || {
let mut data = tc.lock().unwrap();
for i in 0..(*data).len() {
(*data)[i] = 0
}
})
}).collect::<Vec<_>>();
for t in ts.into_iter() { t.join().unwrap(); } Which seems to be quite verbose compared to crossbeam's |
You can't safely send a reference across a thread boundary, but you can share a |
I've made an example of sharing a mutable slice with the standard library here: https://github.com/atomicincrement/multithread-std/blob/main/src/main.rs Check out the Rustonomicon Note that if you care about NUMA and other things then you will need to work a bit harder |
Add BABELSTREAM_NUM_THREADS env option Refactor driver Bump dependencies
@andy-thomason Thanks for that, I was able to implement the unsafe version in the latest commit using your example. ( The results here is very similar to what we're getting for Julia; the pinning doesn't consider the topology of the NUMA nodes and just pins them based a linear thread id. If we set the OMP version to do There's some weird performance drops for Arc when In the end, what did the trick is a combination of manual thread pinning and leaving the let mut xs = Vec::with_capacity_in(size, allocator);
unsafe {
xs.set_len(size);
} It's also quite interesting that the Unsafe implementation can achieve the same performance as the uninitialised Arc, Crossbeab, and Unsafe implementations. I think this is ready for merge unless there's anything that stands out (cc @andy-thomason @64). |
# Conflicts: # .github/workflows/main.yaml
# Conflicts: # README.md
This PR adds a standalone Rust implementation of the BabelStream benchmark and partially addresses #78.
Supported program arguments and output format should be identical to the C++ version.
Parallelism is implemented using Rayon, a single threaded version is also implemented but not currently used.
Support for platforms other than CPU will be added in a separate PR.