Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel queries #292

Merged
merged 22 commits into from
Sep 8, 2020
Merged

Parallel queries #292

merged 22 commits into from
Sep 8, 2020

Conversation

GrantMoyer
Copy link
Contributor

@GrantMoyer GrantMoyer commented Aug 22, 2020

Implements IntoParallelIterator for QueryBorrow for rayon integration. I didn't end up using hecs built in BatchedIter, since it doesn't really mesh well with rayon.

Implements a ParallelIterator trait on top of bevy_task, which operates on batches. Then implements ParallelIterator for bevy_ecs::query's BatchedIterator.

Still needs to be tested. Example included

@GrantMoyer
Copy link
Contributor Author

I've added a small example for parallel iteration. Running the example revealed some bugs in my implementation, which I've fixed. Now the example produces the expected output.

Obviously, running correctly for one example doesn't prove the ParallelIterator implementation is correct, but I've had enough chance to think about it now that I'm pretty confident it is.

Note that I've pub used rayon in bevy_ecs, so that the ParallelIterator traits are available, but I'm not actually sure what the best practice is here.

@karroffel karroffel added A-ECS Entities, components, systems, and events C-Enhancement A new feature labels Aug 23, 2020
@GrantMoyer GrantMoyer marked this pull request as ready for review August 25, 2020 22:44
@GrantMoyer GrantMoyer marked this pull request as draft August 31, 2020 02:52
@GrantMoyer
Copy link
Contributor Author

I'm reworking this to emulate the std::iter::Iterator API using bevy_tasks.

@GrantMoyer GrantMoyer marked this pull request as ready for review September 1, 2020 04:05
@GrantMoyer
Copy link
Contributor Author

I think this is ready to merge.

@cart
Copy link
Member

cart commented Sep 2, 2020

looking now

@cart
Copy link
Member

cart commented Sep 2, 2020

@aclysma @lachlansneff no pressure (i'm pretty comfortable reviewing this), but as the authors of bevy_tasks I would welcome your input here.

@lachlansneff
Copy link
Contributor

This is a very interesting PR. I appreciate how much work you've put into emulating the ParallelIterator trait with bevy_tasks. My only worry is that it's not quite the same as something like rayon. .map won't run that closure on the task pool, so the performance characteristics are not quite the same.

@GrantMoyer
Copy link
Contributor Author

GrantMoyer commented Sep 2, 2020

I had that worry at first too, but actually, since it sends whole iterators to threads in the task pool, the mapped function is run on worker threads. You basically build up a computation description, then send it with batches of data to worker threads. That's also why the map function has to be Send, Sync and Clone.

@lachlansneff
Copy link
Contributor

Ah, I see, you have to call iter_batched first which gives you an iterator over batches of a particular size. Have you done any benchmarking?

@GrantMoyer
Copy link
Contributor Author

No benchmarking yet. I'd like to try it on the 50000 boids example @TheJasonLessard was working on though. I could run a benchmark with a more synthetic test.

@lachlansneff
Copy link
Contributor

Keep me and @aclysma posted.

@aclysma
Copy link
Contributor

aclysma commented Sep 2, 2020

This looks really neat. I'd love to see some practical uses of it in action. The thing is, I don't find myself using iterator-heavy functional style programming that much as a personal preference. So I'm probably not a great person to give an opinion on this! :)

@GrantMoyer
Copy link
Contributor Author

GrantMoyer commented Sep 3, 2020

Benchmarks

I've added three very basic benchmarks to the repo. They use the Criterion crate to run the benchmarks with cargo bench. The three benchmarks are detailed below. System specs for benchmark:

  • Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  • L1d cache: 128 KiB
  • L1i cache: 128 KiB
  • L2 cache: 1 MiB
  • L3 cache: 6 MiB

Overhead of ParallelIterator

Bencmark description: for each element of a 10000 element Vec, do nothing. ParallelIterators use a batch size of 100.

Test Average Duration
Iterator 1.43 ns†
ParallelIterator 1 Threads 35,760. ns
ParallelIterator 2 Threads 45,060. ns
ParallelIterator 4 Threads 58,769. ns
ParallelIterator 8 Threads 63,104. ns
ParallelIterator 16 Threads 61,878. ns
ParallelIterator 32 Threads 62,084. ns

† I suspect rustc is optimizing this test out

for_each() of expensive operation over a long Vec

Benchmark description: for each element of a 10000 element Vec, run a 10000 iteration loop. ParallelIterators use a batch size of 100.

Test Average Duration
Iterator 135.50 ms
ParallelIterator 1 Threads 136.07 ms
ParallelIterator 2 Threads 76.61 ms
ParallelIterator 4 Threads 48.38 ms
ParallelIterator 8 Threads 29.99 ms
ParallelIterator 16 Threads 30.14 ms
ParallelIterator 32 Threads 30.22 ms

10 map()s of an expensive operation over a long Vec

Benchmark description: For each element of a 10000 element Vec, map a function that doubles its input 1000 times over the Vec 10 times. ParallelIterators use a batch size of 100.

Test Average Duration
Iterator 146.95 ms
ParallelIterator 1 Threads 146.85 ms
ParallelIterator 2 Threads 81.17 ms
ParallelIterator 4 Threads 51.20 ms
ParallelIterator 8 Threads 33.83 ms
ParallelIterator 16 Threads 33.80 ms
ParallelIterator 32 Threads 33.96 ms

@GrantMoyer
Copy link
Contributor Author

I think Criterion may be useful for reproducibly benchmarking code throughout bevy, but I can revert adding the benchmark the repo if we don't want to use Criterion or store benchmarks in the repo in general.

@cart
Copy link
Member

cart commented Sep 4, 2020

Yeah adding benchmarks is definitely something we've been wanting to do for awhile, and criterion has my preference here. Adding criterion benchmarks to this pr is definitely welcome.

I'll try to wrap up my review today.

Copy link
Member

@cart cart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! I have no comments on the various iterator implementations, they all look correct and close to ideal.

Just a couple of superficial comments on the high level apis then I think this is ready to go.

Eventually I'd love to find a way to make this work (in a way that doesn't tank performance):

query.par_iter(10).for_each(&pool, |a| a.do_thing())

But thats a separate issue that non-parallel queries suffer from too.

///
/// Useful for distributing work over a threadpool using the
/// ParallelIterator interface.
pub fn iter_batched<'q>(&'q mut self, batch_size: u32) -> BatchedIter<'q, 'w, Q> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to rename this to par_iter() and rename BatchedIter to ParIter (or BatchedParIter). The returned iterator is only usable in a parallel context, it better captures the intent of callers, improves discoverability, and would help differentiate this from the internal hecs BatchedIter (which implements Iterator).

"Processing entity {}",
i.fetch_add(1, atomic::Ordering::Relaxed)
);
thread::sleep(Duration::from_secs(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this helps illustrate "expensive" tasks. This example helped prove to me that this works as expected, but I think I would prefer it if the example was a bit more practical. I don't want newbies thinking that they need to throw thread::sleep or AtomicUsize in their parallel iterators.

Can we make this more like a "minimal example showing how to set up parallel execution"?

@GrantMoyer
Copy link
Contributor Author

I renamed BatchedIter to ParIter and I changed the example to be more practical. The new example is much more beautiful, and also demonstrates .filter() nicely.

@lachlansneff
Copy link
Contributor

@GrantMoyer: Do you think it'd be possible to automatically figure out the optimum batch size based on the number of task pools, so users can just call .par_iter() instead of .par_iter(size_of_batches)?

@aclysma
Copy link
Contributor

aclysma commented Sep 6, 2020

I would avoid automatic batch sizing. It would be bad to split 32 extremely quick operations across 32 cores, for example. The end-user should make an active decision based on what the workload is like.

Copy link
Contributor

@aclysma aclysma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a deeper look, I think the high-level questions I have are:

  • Quantity:
    • Do we want something like this at all? (Probably)
    • How much of it should we have? (20% of work that covers 80% of uses, or exhaustive?). I generally favor starting simple/minimal and growing it based on need and practical use than trying to anticipate all possible needs. This is maybe a little more than I personally would start with, but it's a matter of taste and this certainly doesn't seem over the top. :)
  • Quality
    • Is the high-level design appropriate as a first-pass? (I think so)
    • Are the implementations correct for what they set out to do?
      • Some of them I'm not sure if they necessarily "work" for parallel iterator
      • A short one-sentence summary, link to in-depth stdlib documentation, and short remarks on implementation/caveats would be helpful without being onerous to write. I wouldn't object to this being done in a follow-on PR.

///
/// Useful for distributing work over a threadpool using the
/// ParallelIterator interface.
pub fn par_iter<'q>(&'q mut self, batch_size: u32) -> ParIter<'q, 'w, Q> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[inline] would make this more consistent with some of the other functions in this file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its worth pointing out that inline sometimes regresses performance. iterators are especially weird in that respect. its worth testing perf for every inline/non-inline decision made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't profiled inline vs not inline here, so I'd prefer to let the compiler make the decision for now.

}

unsafe impl<'q, Q: HecsQuery> Send for Batch<'q, Q> {}
unsafe impl<'q, Q: HecsQuery> Sync for Batch<'q, Q> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I'm not sure why Batch needs to be Sync? When do two threads need to have access to the same batch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets remove these if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out only unsafe impl Send for Batch {} is needed.

pub(crate) iter: Option<P>,
}

impl<B, P> ParallelIterator<B> for Fuse<P>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about fuse. It seems like changing your batch size would change the output in surprising ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuse just makes sure that once next_batch() returns None, it always returns None.

For flatten, I think it's important to the iter-like API. Flatten necessarily changes the batch size, but note that the batch size isn't always fixed to begin with. In particular, query::ParIter doesn't have a fixed batch size.


// TODO: Investigate optimizations for less copying
/// Note that find consumes the whole iterator, and return a Vec of all results
fn find<F>(mut self, pool: &TaskPool, f: F) -> Vec<Self::Item>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as filter and collect?

Copy link
Contributor Author

@GrantMoyer GrantMoyer Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah. It should probably be removed or changed to only return the first.

let mut pos = None;
for item in batch {
if newf(item) {
pos = pos.or(Some(len));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this break to short circuit?

Copy link
Contributor Author

@GrantMoyer GrantMoyer Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really make sense to short circuit, since the whole batch has already been consumed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread this, didn't catch the .or().

However, do we want to continue calling newf for every element if pos is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll remove the unnecessary calls to the predicate.

}

unsafe impl<'q, 'w, Q: HecsQuery> Send for ParIter<'q, 'w, Q> {}
unsafe impl<'q, 'w, Q: HecsQuery> Sync for ParIter<'q, 'w, Q> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about forcing sync for ParIter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParIter is BatchedIter from hecs, but with ParallelIterator implemented instead of Iterator. Hecs implements Sync for BatchedIter, but I haven't thought much about if that's sound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can remove the impls and things still work, lets remove them 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

(0, None)
}

fn count(mut self, pool: &TaskPool) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned people are going to end up putting extremely small units of work into some of these and blast it across threads and end up being a lot slower because of it. "Some people might misuse it" isn't a great reason to not do something but I feel kind of obligated to say it.

Copy link
Contributor Author

@GrantMoyer GrantMoyer Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should add advice to the docs saying not to use ParallelIterator without profiling your code, because it has significant overhead relative to some tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah a quick doc calling that out is a good idea. Its also worth adding a comment to the example, as many people learn using those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd like the example to use much more than 128 sprites, since it's probably faster to just do the parallel_query example serially. However, for now, the limiting factor is how long it takes to draw a large number of sprites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added documentation in a few places calling out the overhead concerns with ParallelIterator.

}

// TODO: Investigate optimizations for less copying
fn partition<C, F>(mut self, pool: &TaskPool, f: F) -> (C, C)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely understand not wanting to duplicate the docs for the standard library, but a very short description for what it does and a link to the standard docs would be helpful. In fact, I think it's good to say a bit less so that the implementation-specific notes (for example all() not short-circuiting) are more likely to be seen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential improvement would be to partition by batch instead of item and return an iterator for each set of batches. (I feel like it's something that could be done in a follow-up change although it does change behavior in a non-trivial way.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added short decriptions of each ParallelIterator method and linked each to the relevant Iterator docs.

For partition, I'd rather leave performance improvements for the future. I think small breaking changes are still acceptable.

});
}

// Bounce sprties outside the window
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sprites

for _ in 0..128 {
commands
.spawn(SpriteComponents {
material: materials.add(texture_handle.into()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to encourage users to reuse materials when they can (as different materials require gpu state changes during rendering). can we hoist this above the for loop like this?

let material = materials.add(texture_handle.into());

@GrantMoyer
Copy link
Contributor Author

I rebased on master to avoid a conflict with the recent ParallelExecutor improvement.

@cart
Copy link
Member

cart commented Sep 7, 2020

CI failure was from a nightly clippy update that added new criteria. I've resolved the issues in master.

@cart
Copy link
Member

cart commented Sep 8, 2020

Ok I think this is good to merge. I do share @aclysma's concern that we may have started with a "bigger" implementation than we should have, but I'm happy enough with each iterator impl that I'm comfortable merging this as-is.

Thanks for your hard work @GrantMoyer!

@cart cart merged commit 586303f into bevyengine:master Sep 8, 2020
mrk-its pushed a commit to mrk-its/bevy that referenced this pull request Oct 6, 2020
Add support for Parallel Queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Enhancement A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants