Skip to content
This repository has been archived by the owner on Apr 18, 2022. It is now read-only.

Implement thread-safe ECS #10

Closed
ebkalderon opened this issue Jan 19, 2016 · 84 comments
Closed

Implement thread-safe ECS #10

ebkalderon opened this issue Jan 19, 2016 · 84 comments
Assignees
Labels
diff: normal Achievable by an reasonable experienced developer. If new to Amethyst, may need some guidance. pri: critical This issue has to be fixed ASAP. For the owning team, nothing else should be worked on.
Projects
Milestone

Comments

@ebkalderon
Copy link
Member

With game state management mostly stable, the next step is to implement an entity-component-system framework in the engine. This framework must follow the pure ECS model, in which entities are just IDs and components are POD structs.

Though the ECS model is extremely flexible and easy to data-drive, it has its shortcomings:

  • Difficult to cache important data
  • Difficult to parallelize, heavy reliance on mutable state

In a useful Reddit post, tomaka describes an alternative model similar to MVC with some parallelism. I propose a combination of the two designs resulting in a thread-safe ECS.

(Quoted from the other announcements section in TWIA Issue 1)

Clearly, the World struct containing our entities and components is analogous to the model in MVC. But systems typically require mutable state even for strictly read-only operations, so they don’t fit into either views or controllers, nor are they easy to parallelize.

However, there are two possible ways to remedy this issue.

  1. Systems are split into two separate traits: Updaters and Viewers. The Viewers only require read access to the World, so they can run in parallel. Updaters request mutable access to the World and submit a list of changes they want to make to the content. These changes are applied serially according to a predetermined priority.
  2. Systems are kept as one unit (perhaps renamed to “Processors” for clarity), and they all read the World struct in parallel but can optionally submit changes to the World, which are applied serially once everyone has finished reading.

Experiments are ongoing on the ecs branch.

@ebkalderon ebkalderon added type: feature A request for a new feature. diff: normal Achievable by an reasonable experienced developer. If new to Amethyst, may need some guidance. pri: critical This issue has to be fixed ASAP. For the owning team, nothing else should be worked on. labels Jan 19, 2016
@ebkalderon ebkalderon added this to the 1.0 milestone Jan 19, 2016
@rphmeier
Copy link

You may be interested in my (very early stages, very experimental) similar project here:
https://github.com/rphmeier/rust_toy_engine/tree/master/src/ecs

I've done quite a bit of thinking about parallelizing processors (as I call them) over the past few months, and I've arrived at the following conclusions:

Processors will often have a hierarchy of priority, where certain processors must be run before others.
Processors which do not conflict (i.e. do not access any of the same data) can be run concurrently with no issue. Most general processing can be split up into a "read" phase and a "write" phase.

From this I came up with the following design:

  • Internal game state (entity manager, component data blobs) is wrapped in an RwLock.
  • Every processor is passed a WorldHandle, which consists of a reference to the locked state as well as a handle to a thread pool.
  • The WorldHandle can be used to iterate over all entities with a given set of components either asynchronously or synchronously, with only read access to their data.
  • This iteration will internally produce a queue of actions, which can be consumed during the processor's write phase.

I am still in the middle of implementing the parallel iteration, as well as the MPSC queue, but the intent
is to have a standard processor look like this:

// The particle component this uses.
struct Particle {
    pos: Position,
    effect: Effect,
    frames_left: u32,
}

impl Component for Particle {}

struct ParticleProcessor {
    graphics: Graphics,
}
impl Processor for ParticleProcessor {
    fn process<'a, 'b, C: Components + 'a>(&'a mut self, world: WorldHandle<'a, 'b, C>) {
        enum Action {
            // draw the effect and decrement the frame count
            Decrement(Entity),
            // destroy the entity.
            Destroy(Entity)
        }

        // read phase is here -- no writes will occur while this is processing.
        let actions = world.all_with::<Particle>().for_each_async(|e| {
            // entities here are guaranteed to have the particle component.
            let part = world.get_component::<Particle>(e).unwrap();
            if part.frames_left != 0 {
                Action::Decrement(e),
            } else {
                Action::Destroy(e),
            }
        });

        // write phase is here. get exclusive access to the world.
        let mut write = world.write();

        for action in actions {
            match action {
                Action::Decrement(e) => {
                    let mut part = write.get_mut_component::<Particle>(e).unwrap();
                    part.frames -= 1;
                    draw_particle(&mut self.graphics, part.pos, part.effect);
                }
                Action::Destroy(e) => {
                    write.destroy_entity(e);
                }
            }
        }
    }
}

From this example we can see that there are a few different degrees of parallelism to processors.
Some can perform their reading phase asynchronously and then are required to do some work on a specific thread (graphics processors generally come to mind). Some will be able to do neither phase in parallel. Hopefully most will be capable of running purely in parallel with other processors, locking only to get exclusive access to the state.

Although the example here is somewhat contrived, the main idea is that as much work should be done
during the read phase as possible, deferring updates until the later write phase. Once a processor is ready to enter its write phase, it should be very quick to simply apply its updates to the state and finish up.
Actual execution of processors looks something like this:

let mut my_world = ...;

my_world.process(|ctxt| {
    // all processors executed in this group will execute purely asynchonously, except for
    // contention during their write phases.
    ctxt.process_group(|group| {
        group.process(&mut A);
        group.process(&mut B);
        group.process(&mut C);
    });
    // execution returns here only once all grouped processors have completely finished,
    // guaranteeing an order of events.

    // these processors may use some parallelism internally,
    // but are not able to be sent to other threads to execute.
    ctxt.process_sync(&mut D);
    ctxt.process_sync(&mut E);

    // E is only run once D is fully complete.
});

All this being said, it seems like your initial reaction has similar elements to mine, but it rather focuses on a priority of events rather than a priority of processors.

@ebkalderon
Copy link
Member Author

Great information! Seems like our approaches are indeed pretty similar, for the most part. For further comparison, here is an academic paper called A Concurrent Component-Based Entity Architecture for Game Development (2015). It arrives at a similar conclusion to ours:

  1. Process: Equivalent to our "Processors".
  2. Past State: Refers to the read-only world state from the previous frame read by Processes.
  3. Future State: Refers to the modified world state being generated by the Processes during the current frame.

If possible, I prefer a lockless approach to parallelism, taking advantage of the Rust ownership system to catch data races at compile time, rather than at runtime with a mutex or similar locking data structure. However, I am open to using locks if concrete benchmarks prove better performance.

@rphmeier
Copy link

@ebkalderon Thanks for the paper. I haven't seen it yet, but it seems promising. I am wary of the idea of using N "past" states and a "future" state -- this necessarily means copying the game state N times per frame to different points in memory. Of course, there are situations like networking where copying game state, or part of it, is necessary.

One issue I have with the paper is the scheduling aspect of processors. The author chooses to assign each processor purely to one thread, using a task-parallel model rather than a data-parallel one. A large portion of the scheduling section is concessions that processor time is unpredictable, and there is likely to be a decent amount of wasted CPU time due to that unpredictability. I favor a work-stealing approach, where parallelizable processors simply decide which data they need, how to process it, and let all the threads work as fast as they can in order to do that. There is no excess time spent on scheduling or waiting for overloaded threads to finish.

I also am not completely convinced that purely lock-free is even viable here: using your model, at some the threads must contend for access to the MPSC queue which produces component updates. It might be possible to implement such a queue in a lock-free way, but consider that the costs of locking an unlocked mutex are vastly overestimated (often not much more than an atomic compare-and-swap).

The key to any multithreaded ECS is idempotence. The same state S0 passed through the same processors should always lead to an identical output state S1. My initial approach was similar to yours, where every processor produced some amount of changes to the state, to be applied later. I toyed with a concept of "commutative" and "non-commutative" changes which could be made to component data -- these are strangely similar to Rust's borrowing system:

  • you could either make N commutative updates to an instance of a component, or
  • you could have one single non-commutative update to the instance.

To break these rules would lead to a runtime error. The implementation of this design led a few glaring issues:

  • defining updates to components at the type level was unreasonably verbose
  • it was difficult to defer changes until post-processing efficiently

One data model I toyed with for complete concurrency was giving every component type its own data buffer, rather than storing all an entity's component data together in a block. If you protect each of these buffers with a mutex of its own, processors which do not access the same component data can interleave with zero contention. The caveat is that the entity manager is a resource which many processors will contend for access to, so it became the bottleneck.

I am curious what your mental model for the types of updates is currently. How does the user specify which component types the world manages and what order updates should be applied in? What type do updates have? Where are they stored? How are they iterated over to be applied?

@ebkalderon
Copy link
Member Author

Thanks for the response! Just to clarify, there is no copying of the world being done. There is only the existing world state and the future world state being computed by the processors. The reason why I contend that my proposal could be made lock-free is because it is task-parallel (AFAIK, a performance disadvantage) rather than data-parallel, with discrete parallel-read and serial-write stages, where the processors' writing priority is determined by the order in which they were passed into the Application through a builder pattern.

That said, I appreciate you explaining your position. The data-parallel design you described seems sound and probably more efficient (concrete benchmarks would have to be made to compare both to be sure). I welcome more feedback on both approaches, or other alternatives, from other participants before the final decision is made.

@HeroesGrave
Copy link

Avoid trying to handle process scheduling. If any kind of scheduling is required by the final design, provide the tools to allow the user of the library to do it.

They'll be the ones that are able to properly benchmark different schedules in order to get the best performance for their particular game.

@kvark
Copy link
Member

kvark commented Feb 2, 2016

@HeroesGrave nice to see you here!
I don't think your advice applies 100% in this context. I understand why you'd want to avoid scheduling in ecs-rs, since it's purely an ECS library. Amethyst, on the other hand, is a game engine, so it might as well take the responsibility of scheduling the systems.

@mangecoeur
Copy link

Some thoughts from my attempts to parallelize a Flocking simulation:

  • if two systems affect the same entity properties (e.g. position) they must run in a deterministic order, and so cannot be run in parallel on the same entity. If one system needs access to all entities to do its thing, that precludes any other system from running in parallel, and limits the amount of paralism the current system can do, because it must avoid changing the world that it is using as an input to calculate it's next update.
  • In the flocking example, acceleration vectors for each entitiy depends on the locations, velocities of neighbours - if you were to iterate-and-write you change positions, which messes with the simulation. You need to access all entites in order to calculate the update for each entitiy. One crude solution is to have two copies of the entities and switch between them - one is read-only and the other write-only, and you 'blit' them for each update. You can parallelize as much as you like, since you know you only ever write to one boid at a time and you can have shared reads. You could even simulate several frames in advance if you like and invalidate-on-input. You can optimize using spatial data structures, which could also use parallel builders.
  • From the flocking example, I feel that parallelism should be left up to the System implementations, because they know much more about what they are trying to do (flocking, physics, etc) and therefore can be much smarter about how (or if) to run it in parallel - including doing fancy things like pushing physics work to the GPU.
  • The job of the ECS would be to make writing parallel systems easier and more memory efficient. For example, for flocking you only need access to the position and velocity vectors. The ECS could let you "borrow" these from the World and they would have much lower memory footprint than copies of the whole world would.
  • In addition, this would enable parallelism for Systems that are guaranteed not to affect the same entity components. For instance one system could affect only the "Color" components of the World's Entities, while another the "Position" components. Using Rust's borrowing system, it could guarantee that nothing is trying write to both in parallel.
  • End result: ECS runs Systems in parallel that can be statically guaranteed not to tread on each others toes. Systems that cannot run in parallel run one after the other, but internally can use memory-efficient access to borrowed properties to parallelize their internal workings.

@Oflor
Copy link
Contributor

Oflor commented Feb 9, 2016

Isn't there a way to implement systems which do not affect (write to) the same components?

@ebkalderon
Copy link
Member Author

@Oflor Honestly, that is going to be difficult. Most notably: the physics processor, the particle processor, and the animation processor all modify entities' transform data. One could break down the components into ever-more granular and specific pieces, but this introduces a host of unnecessary complexities (having multiple transform components, syncing duplicate data between them, etc).

I personally prefer to establish and enforce strict order dependencies between processors, either similar to @mangecoeur's or @rphmeier's approaches, or according to my original proposal.

@ebkalderon
Copy link
Member Author

The user-facing API design of the ECS has been drafted on the ECS Design page on the wiki. Please leave your feedback here or on Gitter!

Edit: Updated link.

@mangecoeur
Copy link

Just had a look at the proposed usage. One thing I'm not keen on is that the proposed way of adding Processors doesn't make it sufficiently explicit that they will run in the order they are added, so that something like this:

.with(Input::new())
.with(Physics::new())

will not necessarily have the same result as this:

.with(Physics::new())
.with(Input::new())

Since there was some emphasis on making everything predictable and deterministic I think the its important that this is really clear.

Some alternative suggestions

  • More explicit naming might help, like .append_processor(Physics::new()).
  • Create a vector of processors or processor references and add them, something like:
processors = vec!{Physics::new(), Input::new()}
world.add_processors(processors)
  • Create a list of processors using a macro (since vec might not be work as desired)
processors = processors_group!{Physics::new(), Input::new()}
world.add_processors(processors)

@mangecoeur
Copy link

Also a question: how are Processors associated with the Components that they need to process?

@lschmierer
Copy link
Contributor

Since there was some emphasis on making everything predictable and deterministic I think the its important that this is really clear.

Since we want to run our ECS multi-threaded, will this order be used for the sequential writes to world and reading is allways simultaneously?

Or do we want to control which systems are run in parallel?
Therefore I would suggest something like this:

SimulationBuilder()::new()
    .append(Rendering::new())
    .append_parallel(Physics::new(), Ui::new())
    .finalize()

Also a question: how are Processors associated with the Components that they need to process?

I think processors simply get a reference to the world and query the components they need to access.

Proposal for the API of the world:

impl World {
    /// Creates a new empty world.
    pub fn new() -> World;

    /// Creates a new entity in the world and returns a handle to it.
    pub fn create_entity(&mut self) -> Entity;

    /// Destroys a given entity and removes its components.
    pub fn destroy_entity(&mut self, entity: Entity);

    /// Attaches a component to an entity.
    pub fn insert_component<T: Any>(&mut self, entity: Entity, comp: T);

    /// Remove a component from an entity.
    pub fn remove_component<T: Any>(&mut self, entity: Entity);

    /// Get component by entity.
    pub fn component<T: Any>(&mut self, entity: Entity) -> Option<&T>;

    pub fn component_mut<T: Any>(&mut self, entity: Entity) -> Option<&mut T>;

    /// Iterate over components by type.
    pub fn components1<A: Any>(&self)-> Iterator<(Entity, &A)>;
    pub fn components2<A: Any, B: Any>(&self)-> Iterator<(Entity, &A, &B)>;
    pub fn components3<A: Any, B: Any, C: Any>(&self)-> Iterator<(Entity, &A, &B, &C)>;
    // some more...

    pub fn components1_mut<A: Any>(&self)-> Iterator<(Entity, &mut A)>;
    pub fn components2_mut<A: Any, B: Any>(&self)-> Iterator<(Entity, &mut A, &mut B)>;
    pub fn components3_mut<A: Any, B: Any, C: Any>(&self)-> Iterator<(Entity, &mut A, &mut B, &mut C)>;
    // some more...

}

@kvark
Copy link
Member

kvark commented Mar 10, 2016

(sorry if I missed something in this thread, not studied it through)

ctxt.process_group(|group| {
        group.process(&mut A);
        group.process(&mut B);
        group.process(&mut C);
    });

and:

    .append_parallel(Physics::new(), Ui::new())

Both of these approaches enforce a fork-join model on the API level, which imposes a tight upper bound on the performance you can squeeze out of a parallel system, since chooses a rather tiny subset of all possible execution graphs. For example, what if Ui doesn't need to wait for the previous operation while Physics does?

Optimally, each system just specifies the dependencies for it to start, and then some sort of a scheduler figures out when to start it. We did some experiments with @csherratt on fibe-rs, you could use its example for inspiration:

    let ha = task(move |_| {print!("Hello, ")}).start(&mut front);
    task(move |_| {println!("World!")}).after(ha.signal()).start(&mut front);

I think processors simply get a reference to the world and query the components they need to access.

If you don't explicitly expose which components are being read, and which are modified per system, you don't provide the scheduler an opportunity to figure out the proper timeline for their execution.

One could specify the systems directly as dependencies, or build the dependency graph judging by which components are getting modified and read. The latter would work well if you could actually enforce the fact that these and only these components are what each system receives for processing.

@Oflor
Copy link
Contributor

Oflor commented Mar 11, 2016

Paralleling over systems might be an overkill if we already parallel over entities. There are going to be more entities than systems (if not, we should be able run it in a single thread anyway).

Specifying accessed components requires dealing with trait-objects and TypeIds (I think), and that is already less efficient than compile-time system managers.

@mangecoeur
Copy link

Specifying accessed components requires dealing with trait-objects and TypeIds (I think), and that is already less efficient than compile-time system managers.

Wouldn't this require that Components should be defined by implementing a trait? That would make it difficult to create game objects by loading their definitions from declarative data files (would would be forced to define all entities in rust source code).

If you don't explicitly expose which components are being read, and which are modified per system, you don't provide the scheduler an opportunity to figure out the proper timeline for their execution.

I think this is important - even if it does mean messing with TypeIds, it opens up a lot of doors for speeding up execution, which should offset the extra complexity.

Another thought - it may be possible to perform these checks at compile time using compiler plugins. Declarative game object definitions could be read in at compile time (instead of at runtime) to allow these checks to be made. It may even be possible to build re-usable machinery that can either perform the checks at runtime or at compile time, so that while you prototype you can edit entities dynamically without waiting for a re-compile every time, but when you move to release you could bake everything in. Just an idea...

@kvark
Copy link
Member

kvark commented Mar 11, 2016

@Oflor

Paralleling over systems might be an overkill if we already parallel over entities. There are going to be more entities than systems (if not, we should be able run it in a single thread anyway).

I haven't considered that. At the first sight, parallel entities processing would involve a lot of synchronization overhead. Like, you'd need some kind of dynamic sync before accessing each entity, as opposed to just starting a system. Another major problem would be the cache contention - if components are close to each other in memory, and one of them gets modified by one thread, access to all other components in this cache line will be delayed significantly for other threads... In other words, I'm not convinced it's worth it. Parallelizing systems just seems like an easier task, and it gives rather good performance - that's what Bungie and Naughty Dog are doing, IIRC.

Specifying accessed components requires dealing with trait-objects and TypeIds (I think), and that is already less efficient than compile-time system managers.

I'm pretty sure it should be possible to specify at compile time too. @mangecoeur proposed one way, via compiler plugins, but I'd also like to explore non-plugin approaches.

@ghost
Copy link

ghost commented Mar 14, 2016

// Create the entity eid with the compoents Foo{}/Bar{}
let eid = ENTITY.with(Foo{}).with(Bar{}).create(&mut world);

// Update the component Baz
eid.with(Baz{}).update(&mut world);

The advantage of the builder API like I have shown above is that it does not lock the world object while you are building the list of components that you are modifying. This can let functions like this exist

// do some update 
fn update(eid: Entity) -> EntityWith<(Foo,)> { eid.with(Foo{}) }

// in parallel update all entities and write the results into `world`
// not sure if this would actually work this elegantly (haven't got to trying it yet)
entities.parallel_map(update).write(&mut world);

Accessing the data could be done with queries on the world object.

// select every entity from world that has a Foo attached to it
for (eid, foo) in world.select::<(Foo,)>() {

}

// select every entity from world where Foo & Bar are attached to it
for (eid, (foo, bar)) in world.select::<(Foo, Bar)>() {

}

// select every entity from world where Foo & Bar are attached and 
// it is a member of the entity set
for (eid, (foo, bar)) in world.select::<(Foo, Bar)>().only(&entity_set) {

}

The biggest problem with these queries (or any iterator for that matter) is that you can't actually modify the world object since it is immutably borrowed by the select.

The solution I have always reverted to is making it cheap to clone world and then modify the cloned copy. I did this through copy-on-write, but I'm not sure it is a great general solution as it has performance penalties.

Alternatively systems could work in parallel with past copies, and write their updates to a changelist. This can be annoying as you can't see the writes you have made until they are committed to the world. But I know can be made to work.

@ebkalderon
Copy link
Member Author

@csherratt We were initially going to go with the change list model, but eventually decided against it because of suspected latency. However, we haven't done any concrete profiling to confirm this. Have you tested this and/or the locking method before? If so, what were your conclusions?

@rphmeier
Copy link

@csherratt's approach sounds really similar to the basic "queries" I've been imagining while experimenting. I can try and work a little on the implementation, to see how feasible they really are. If user-specified storage of component data is allowed (e.g. positional data structures), they can also compose with custom filters associated with that storage to get the entities desired with maximum efficiency.

The main issue with cloning the world is that it can get extremely expensive when you've got a lot of entities, and a lot of data to go with them. Even if it were as simple as a raw memcpy, you'd probably end up with each thread making one copy for itself. With a lot of data, you'd end up wasting a lot of time just copying it each frame. This would probably dwarf the cost of having a change list, walking it, and applying changes. I would strongly advocate taking a lock-based approach, since mutexes without contention are practically free. This does mean that whatever way we schedule processors, it should prioritize minimizing contention over data.

@kvark I do think you're right that having an implicitly fork-join execution model limits the space of execution graphs drastically. However, it also works simply to minimize contention over components, and the subset of execution graphs which it does allow are generally efficient. The alternate approach is what you said above: have a signal-based approach where processors wait to be "kickstarted" by their dependencies. This does mean that each processor needs to be aware of not only its dependencies, but also its successors. Can we easily encode this information without much boilerplate?

@ghost
Copy link

ghost commented Mar 15, 2016

You could define a system based on it's inputs and what it modifies. The scheduler knowing this knowledge is useful. as @kvark has said.

So, where does this lead? If a system is defined by what it's input is, the system is pending on those inputs to be stable before it can process. If a system modifies one or more components it is both waiting on that component to be stable, and it is produce a new version of said components.

A Future is an exact match for this model. A system is run when all the components futures are ready, and when it is complete a number of futures are now marked as ready and other systems that depend on this component are run. If two system are run back to back that don't depend on output from one another they are run in parallel. If a system does depend on a component from another, it implicitly waits until an upstream system is finished running before it runs itself.

The ordering follows the order in software in the case of a conflict.

let mut world = World{};
loop {
  // run is a magical function that in practice would probably take more boiler plate then this.
  // I think it is possible to write, but have not attempted to yet.
  world = run(world /* world is moved here */, system_a);
  world = run(world, system_b);
  world = run(world, system_c);
  world = run(world, system_d);
}

@rphmeier about the copying, I said it would be copy-on-write. Meaning that when you do a insert or delete the data-structure only copies the nodes that are shared with a read-only copy. It's not free as the overhead is pushed to the writer (in the form of checks), but it is very convenient to work with since cloning is just bumping an atomic reference count.

I'm not a fan of locking. It's not because the locking itself is expensive. There are two problems

  1. Multiple writers wanting to update the same component have none-deterministic ordering.
  2. Writers wait behind readers, stalling their progress. If Foo is shared with 10 systems, If a system needs to modify Foo it needs to wait for every other with a readlock on Foo to release it before you can make progress. It just takes one system running slow to stall everything. With the futures only writers can stall readers which is still a problem, but there are less writers then readers.

@kvark
Copy link
Member

kvark commented Mar 15, 2016

@csherratt Are you thinking about the Future as a concept (based on Pulse), or usable as is from the standard library? So somewhere we'd have each component storage associated with a Future, which gets passed to the next system that reads/writes this component. If a system modifies it, it replaces the Future with the new one, so that the next system will need to wait. Do I understand your idea correctly?

I also realized that the scheduling problem can be very much simplified. Sorry if this is obvious for someone, but I feel it being important to mention explicitly - we don't have to implement a full-featured scheduler for the ECS systems. Instead, we follow the order in which user specifies the systems and components they modify. Each system either waits or executes in parallel, depending on whether or not the input components are locked for writing, and the output components are still being read.

This is very simple for the user: they don't have to think about it, but they can still fine-tune the execution order according to their specific requirements that the scheduler wouldn't know about anyway ;)

Going back to @csherratt 's run function. Why does it need to receive the world by the move? Also, I'd expect it to not be a freestanding function, and instead be called as scheduler.run(...).

So, given that the scheduling problem is addressed by relaxing the order and using Future for synchronization, I see only one big fat problem left: how do we specify the in/out components for systems such that this information is both passed at run-time to the scheduler, and 2) - directly enforces the system implementation to only work with these components (at compile time, going back to the trait System topic)?

@ghost
Copy link

ghost commented Mar 15, 2016

Yes, I was thinking of something like Future via Pulse specificity something I called a SharedFuture. Basically a future that gives a read-only view of the data once it is ready, so multiple people can wait on it.

@kvark there is no problem with passing in &mut World and having the system modify it by swapping the systems it cares about. Just posted an example of the first thing that came to mind.

The one tricky area I foresee is enumeration of entities that are being created. It might be good enough to just use a mutex protected object. We could do the same thing as the futures of course.

@doppioslash
Copy link

@kvark Actors are about getting as parallel and distributed as possible. The principles to achieve maximum parallelisation are founding to Erlang and OTP.

"Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang."

Basically following those principles is a way to get code to naturally run on as many cpus as you have available. Some of them are already compatible with Rust's principles, like immutability by default.
It may not be trivial to devise an Entity Component system compatible to those principles, but it's likely to give to whoever manages that, big parallelisation advantages compared to everyone else.
The difficulty after you have Actors is synchronising them, which is what that Virding's talk I linked is about.

@kvark
Copy link
Member

kvark commented Mar 28, 2016

@doppioslash so if we try to categorize the actors approach into the ECS wiki, will it go as "message passing" solution for the component updates? Or somewhere else?

@doppioslash
Copy link

@kvark I think "message passing" is correct.
What do you think, @OvermindDL1 ?

@OvermindDL1
Copy link

@kvark @doppioslash Correct, ECS and Actor are two different paradigms, however they can be used in tandem, they are not incompatible. Specifically what I would do is have an Actor system for all Entities, but have the Entities themselves manage their own components. By keeping the components internally then you can have the actors process things themselves, fully async, and you can pack them with the entity definition in memory as well (which is what Erlang does for an Actor, it allows the Erlang garbage collector to be blazing fast as it either needs to collect within a single Actor (very rare as it holds everything on the 'stack' so as functions are popped or TCO is used then things are just unseen even though still exist, if the GC runs it is generally only because the Actor got a very deep stack and is not using it so deep anymore so it will free up some of the tail memory, which you can force via the hibernate command to fully compact the Actor), or the GC 'runs' when an Actor dies by just freeing up the entire Actor memory (technically holding on to it to allocate another Actor within). The Erlang VM (BEAM) is blazing fast for Actors due to its design (which I know quite well in the JIT HiPE level so I can describe anything if you need), and I've yet to see a language implement it so it runs as fast or as safe, however the BEAM is fully dynamic, I.E. not typed, so that harms its efficiency (about on par with LUA for math, not great, but not bad, just not 'C' speed, although its interaction points of PORTS, NIFs, and others make up for that for needed areas) but the overall programming design is a great thing to model, and based on what I have been learning about Rust I think an Actor model could actually be implemented in it properly unlike in Java and C++ where the holes utterly cause it to fail in the corner cases.

'Within' the Actor you would allocate on its 'stack' the component data, it could have a cache on that same Actor 'stack' of which systems need to be run within it. Sadly this means that the function calls are inverted if the Actor itself runs it, however you could invert it internally so that the systems run on the internal Actor data. If the Actors remain tightly packed (say an on-created component allocation internally with externally linked dynamic components, or rebuild it on the fly, whichever makes more sense) than you could still get most of the cache efficiency as well, though honestly I would say do not even bother with this pre-optimization, the point of Actors is to run comparatively rarely by only responding to messages, just have them send messages. Basically say you have an Actor, it does not know where it is, it does not know its position, the Actor model is modeled after real life, if you are floating in space you have no clue where you are, not even on earth, you have to use reference points, you see the light coming from the reflected planetary surface of Earth to see your location, same with Actors, they do not hold their position themselves, they would query (send a message to) the 'World' that they are in (an octree perhaps, or however it wants to partition them internally, maybe it queries the physics engine for where it itself is). They only hold state data that is internal to them, their Position is not internal data, just like it is not to you, it is all relative, you need to figure out your location based on prior knowledge, looking at your surroundings, etc.. The nice thing about this design is say you have a game server, your Actors could, for example, be running across a whole bank of servers. You do have to design the game a little different, you have no such thing as lockstep (which I personally think is an anti-design anyway), messages take non-trivial time (usually microseconds, practically immeasurable, but the message will be received when the receiver is next scheduled, might be immediate, might be a second away if it is on a busy server across the world), and it is very easy to program for once you learn it, but it has such great advantages.

So for this, probably 'message passing' is the most accurate, but still not accurate, as the internal ECS system could still be any of the others as they run 'internally to' an Actor, but the Actors communicate via messages.

I just got home after a long day so I hope the above is understandable, I think I need a nap. ^.^

@OvermindDL1
Copy link

I do want to make clear, imagine an Actor being implemented underneath via something kind of like this (in pseudo-C++):

struct ActorData {
    ActorPid id;
    AtomicQueue messageStack;
    GrowableStack stack;
}

Erlangs have some (a lot of) extra fields for optimization purposes, but that is the basics of it.

The ActorPid would be say a struct of 2 64 bit integers, where there is the 'node' id and the 'self' id, the node being the machine ID that it is running on and the self id being the actual ID of the node, in Erlang they work kind of like the Handles I mentioned previously, the index ID's are reused but there is a counter of use so it cannot be re-accessed again.

If you send a message to an Actor pid then it appends the message onto the AtomicQueue end and is done with it. Every Erlang function takes an internal hidden pointer to its position in its stack, which is just untyped memory, the variables in Erlang are untyped, but the memory itself is typed, so doing something like anInt = 42 makes a name (not really a variable, it is immutable, just points to a point in its stack, at the end of the 'active' part when the name was defined, unless the JIT optimizes it out, which it often does), but the memory it points to is typed, an integer like the above will be a 32bit/64bit (depending on platform width) with the first few bits defining the type (only a few built-in types, sorted in such a way for efficiency, like a 0 integer is actually a fully 32/64bit 0 integer in memory). Obviously in Rust I would not recommend that as it is actually typed and that provides a great boon.

So the scheduler is running (one per CPU core) and it sees that the Actor that received the message has a non-empty message queue, so it then calls the frozen function (in Erlang a receive call) and the Actor grabs a message on the front of the stack and processes it until the Actor pauses again by calling receive. The OTP framework wraps up the raw receive calls (which the JIT splits up into distinct function calls internally anyway) into a thing with callback, the most basic of which is the GenServer, which defines an 'init' callback (on Actor creation, so it can setup initial state or load itself from a database or whatever), a set of handle_call methods (which are called when a message is received that wants a return message), a set of handle_cast methods (which are called with a message but does not send a reply, or it can do it later if it wants), and a set of handle_info methods (which handle messages that were passed to the Actor not via the OTP framework (which is just a wrapper message around the base message, the OTP style is very simple, yet very powerful), as well as a terminate method (which is called when one of the other functions request its death). As Erlang/BEAM/Elixir is functional and immutable it carries around the state in the argument and returns the mutated state (which will then be passed in other methods later), and if it wants to die it returns a value to stop itself, which then calls the terminate.

If an Actor terminates uncleanly (crashes, divides by zero, whatever) the terminate will not be called but the Actor will be outright killed (all others keep going) and the Supervisor that it was created under will see the reason of why it died and usually recreate it if you have it do so, if too many die too fast then the Supervisor will (if your settings tell it to) then kill itself too, also killing all Actors under its supervision, to where 'its' Supervisor can recreate it and it can recreate the others with fresh good state from the DB or whatever. That tree of Supervisors that go down to basic GenServers (which have other things built on them like GenEvents, GenFSM's, etc...) defines the Supervision Tree of the entire Application (of which there can be multiple running Applications within an entire project). When a Supervisor creates an Actor it can create it locally on the same Node (by default), on another Node, ephemerally, or in a few other ways, but it knows about them, watches them, sets up Monitors with the Schedulers so it knows when they die (or an entire Node dies, like hardware failure), etc...

Basically, a simple Elixir Actor without OTP would be (:something is an Atom in Elixir, like my Atoms in C++ though a bit more powerful, think of them as a global Enum for quick integral matching):

def getOtherThing():
  receive
    {:wait, anything} -> # wait for anything and return it!
      anything
  after 5000 -> # Nothing received after 5 seconds, how sad...  I could leave this out to wait forever...
    :nothing
  end
end

def myLoop(num):
  receive
    {:get, toPid} -> # Someone requested what my number is, lets tell them!
      toPid ! {:theNum, num} # Sending them a message of a tuple with the atom :theNum and my number
      myLoop(num) # Tail call, no stack growth

    {:set, newNum} -> # Someone wants to set my number, sure!
      myLoop(newNum) # Tail call, no stack growth

    {:set_get_old, toPid, newNum} -> # Someone wants to set my number and get the old one sent back to them, sure!
      toPid ! {:theNum, num}
      myLoop(newNum) # Tail call, no stack growth

    :wait_for_another_thing -> # Ooo, someone says they will send me something else, then ignore it
      getOtherThing() # This has an embedded receive that will return whatever the next message receive of a :wait type
      myLoop(num) # Tail call, no stack growth

    :stop -> # Aww, someone wants me to die...  Oh well...  :-(
      # just return, do nothing... you have to return 'something' in Elixir, so I'll just do the atom :stop
      :stop # Tail, exiting, this Actor will die

  after 1000 ->
    IO.puts "Nothing happened for a second, I'm bored..." # The IO system can even redirect output if you want to another Node, say you connected a shell into the system to introspect it
    myLoop(num) # Tail Call, no stack growth
end

You could use it like:

boop = spawn myLoop(5) # Spawn a new Actor with the initial arg of 5, returns the PID
boop ! {:get, self()} # Sends a message to boop, boop should respond fairly immediately, can dump all pending messages on self() in the shell by doing:
flush() # returns the number that boop sent back to us
boop ! {:set, 10} # Have boop use the number 10 now!
boop ! :erghblergh # boop will now have this message stuck in its message inbox since there is never a receive that handles any message, should have added a case to receive to handle any message and log and dump it, OTP does for you!
boop ! :stop # boop will now die when it processes this message a few microseconds later
boop ! :anything # This goes no where, you get no respond, no confirmation, if I set up a Monitor on boop then I would have got a Monitor :down message when boop died

With OTP the above as a GenServer it would be more like this:

defmodule Boop do
  use GenServer # this brings in a set of methods and such that you can override below, technically optional if you define them all, but good for documentation purposes anyway

  def start_link(startNum) do # The 'start_link' function can be named whatever, this is just normal OTP style
    GenServer.start_link(__MODULE__, startNum) # This calls the start_link on GenServer, what it does is that whatever process called this method will get an automatic monitor
      # setup for the process that GenServer is also creating (and returning the pid of), it will use this module 'Boop' as the callback module
      # When the GenServer code of this Actor (it has its own receive loop) receives a message then it will call this callback module
      # The callback functions are resolved at JIT/Actor creation time so no indirect calls are done that would waste time
  end

  def init(startNum) do # The argument is the second argument given to GenServer.start_link, can be whatever you want
    {:ok, startNum} # Return a tuple of {:ok, state} to start your GenServer actor, or return an {:error, msg} to die, few other options too
  end

  def handle_call(:get, _from, num) do
    {:reply, num, num} # return the number to the caller (second tuple entry) and as the state again (third)
  end

  def handle_call(handle more of these...) do
    # etc...
  end

  def handle_cast({:set, newNum}, _num) do
    {:noreply, newNum} # change the state to the new number
  end

  def handle_cast(more of these...) do
    # etc...
  end

  def handle_cast(:stop, _num) do
    {:stop, :normal, terminationStateIfAny} # Stop this GenServer with a :normal reason (can give anything else for a failure message) and pass in any data as the state for a terminate function if it exists
  end
end

And more (though for simple Actor Data containers like the above Elixir has a GenServer sub-type called an Agent, the above is similar to how an Agent is implemented). The raw BEAM 'receive' stuff would not be so easy in Rust as I've not seen a way that Rust can break up a single function into multiple callable parts, but the OTP style is doable by Rust perfectly, which is good as raw receive should almost never be used.

And an example of a BEAM webserver, take Phoenix, they've got it to over 2 million simultaneous websocket connections, and if they sent a message to those connections (a wikipedia article they used due to its size) took 2-4 seconds on a single machine. BEAM is, for example, what WhatsApp uses to host their messaging system on a single server for all users. This is not really a show-off of BEAM itself, as BEAM still has plenty more optimizations that can be done to it, but is rather showing off what the OTP style is capable of. Even if they did not have a single 24 core machine with 96 gigs of ram (which is still 40% unloaded at their most busy times) they could have scaled it out onto more computers, spool up more Phoenix nodes and create a Mesh where everything can communicate out transparently. BEAM scales not only up on more cores but also out onto more machines (great in case of hardware failure!), not because the VM is so good (although it is very well made), but because of the OTP programming principles.

If you want arbitrary scaling abilities among an arbitrary amount of connections with an arbitrary amount of running data containers (like a Game Entity) that can each be running different code, the OTP style is really the way to go.

Think of it this way, each Actor in an Actor system is like its own little running system process (just a lot more lightweight), you can program using OOP or ECS or whatever 'inside' of it, but all communication between Actors is done via messaging. Actors is just a higher level system view, you can still use ECS within them, or whatever is appropriate for the problem domain.

Also sorry for another long post. I get a bit rambly when tired. If you have specific questions I am much shorter at answering those. ^.^

@HeroesGrave
Copy link

I've been messing around with an approach using clone-on-write and simple DAGs for scheduling.

First of all, data is clone-on-write. Component state is not mutated. This allows all components from the previous cycle to be visible in order to create the components for the next cycle.

Second of all, each system outputs one component (or intermediate calculation. see below).

The main problem with this approach of course is the number of frames it may take a change that depends on multiple components to propagate. The way I've solved this is allowing systems to depend on certain components or intermediate calculations already being processed, and to use those up-to-date values. This way, all neccessary changes can be propagated in 1 frame, as well as allowing systems to reuse intermediate calculations (I suspect this will be quite rare, but it's a useful feature nonetheless).

Therefore the signature for a system that requires 3 systems to be completed will look like this:

fn process(&mut self, in: (&ComponentList<A>, &ComponentList<B>, &ComponentList<C>), previous_cycle: &Components) -> ComponentList<Out>;

There is then a macro that builds up a sort-of DAG representing the systems that need to be completed before a certain system can be run. Then, using a scoped thread pool and some RwLocks, all the systems are queued up (or as many as the thread pool allows at once) and executed as their dependencies are met. When everything is finished, a new Components structure is returned with the updated components, and intermediate calculations are discarded.

It may even be possible to use critical path analysis to optimise the scheduling of systems at runtime based on some simple profiling, but that's not something I have the time to look into right now.

@kvark
Copy link
Member

kvark commented Mar 29, 2016

@HeroesGrave very interesting, thanks for sharing! Is this something you got working, or just toying around? I'd peek into the source ;)

Also, @csherratt mentioned that the cost of copy-on-write appears to be high. Cloning the world on each iteration can be expensive.

@ghost
Copy link

ghost commented Mar 29, 2016

@kvark there is always overhead in doing copy-on-write. The worst part about it is that it has a bunch of extra code being generated to handle the aliased cases. But I didn't find it to be significantly higher then a none cow data-structure.

The nice thing about copy-on-write is that it makes it really easy to get around the borrow checker. It's essentially free for the readers to clone data-structures, so if you want to do an update and need a copy of some of the old data you just call clone and you don't have to worry about who owns the tree. You end up with almost no RefCell<T>, Mutex<T>, Arc<T> because the data-structures behave well when shared by readers.

The big disadvantage of cow is that you limit yourself in what data-structures you can select. Trees map to copy-on-write extremely well. So BTree's and Radix trees work well with them. Hashmaps, and VecMap do not since they require cloning the entire data-structure (not just the parts changed).

@HeroesGrave
Copy link

@kvark: It's pretty much working, but I want to clean up the code a little bit more and test some more complex workloads.

Also, due to the lack of HKTs, it's not possible to be generic across different component storage types, which is a bit of a pain. Might throw everything in an enum as a workaround.

@liaud
Copy link

liaud commented Mar 29, 2016

I would like to give my opinion on this topic since I have already worked on implementing a
thread safe ECS. So i will give it a try.

In my implementation, entities are only composed of four things:

  • an index, for accessing component's containers
  • a version, to prevents confusing with an entity index recycled
  • a bitset to identify what are the components attached to the entity
  • a uuid to synchronise entity across networks

Most of the time the entity index is used for accessing everything, the handle (index + version),
is useful only when you need to keep a reference on an entity across upates. For exemple a target component that contains the entity targeted. If the entity dies for some reason, the component won't be able to upgrade the handle to a raw index and thus, accessing stale data.

Components are just pure data. They need to implement a trait Component with two associated types, one is for defining what is the container of the component. The second for defining how the
component is updated.

In this implementation, I always have two states. The past state and the futur one.

The past state is read-only, all processes read this state and compute modifications accordingly.
The futur state is write-only, it does not need to exist since we can modify in place the past
state in the sync points of the process scheduler.

Every process writes their modifications through a StateWriter that puts the updates in a buffer
(an insert, a remove or the update enum specified by the component trait). There is one StateWriter
by worker thread and we can reuse them across frames to avoid allocations.

The idea is that I can have a job system where a bunch of job are kicked to worker threads,
each job can spawn other jobs so the processes can create finer-grained units of computation.
This allows to kick a lot a processes on worker threads and let the work-stealing do the load balancing.

I can also use this job system to batch updates of components. Each job takes care of one component
type and applies all the update buffered to that component.

Finally, the process execution graph will be at best only one deep level, no dependencies, only a sync/update point.
If a process depends on other processes computation we can define precise sync points between their executions and update the past state before continuing. If you do COW structures you can even start the update of the components when no more systems are modifying it.
It is just a matter of adding a job waiting the execution of other jobs before being kicked.

There is a incomplete draft of the implementation available here. Note that it is incomplete and does not necessarily match what is described here.

@OvermindDL1
Copy link

I've always questioned the point of the bitset that defines what components are set, I've just let the Component Containers handle that themselves. If something wants to know if an entity has some components then just query them. If something wants to know all the entities that have certain components, then make an Aspect and it will get updated. If something wants all entities and all components then it can make its own Aspect handler that will listen for everything, it can hold that data however it wants.

The reason I hate the bitsets is I use a LOT of components, many hundreds can even be a low number for a fairly small game, and carrying that data around of what is used and what is not is not really useful.

@liaud
Copy link

liaud commented Mar 29, 2016

@OvermindDL1 I agree with you on that point. It is only useful when you don't have many entities to iterate and you don't cache them for future updates. However, if a process wants to filter entities conditionally it allows to just do a bit mask to check if the entity is bound to the filter but this might be rare/nonexistent.

@kvark
Copy link
Member

kvark commented Mar 29, 2016

@Liotitch
Thank you for describing your approach!

@OvermindDL1
You assume there is an Aspect concept, which solves the problem. When would two different systems use the same aspect though? I'm thinking that each system would have it's own inputs/outputs, and thus it's own aspect.

@OvermindDL1
Copy link

@kvark In every setup I've used so far I have many cases where multiple systems work on the same aspects, systems that have no aspects, and systems that have multiple aspects. It is a good style to follow.

@kvark
Copy link
Member

kvark commented Mar 29, 2016

@OvermindDL1 could you drop in some examples?

@OvermindDL1
Copy link

@kvark Sure, from a small project:

*) Multiple systems with the same Aspects:
SoundComponent -> SoundManager (Play sounds), SoundBroadcaster (Broadcasts sounds to an AI system based on nearby things in an octree or so)

*) Systems with no Aspects
Anything that needs to process per tick that is not entities, like working through the network packet queue to distribute incoming packets where they belong

*) Systems with multiple different kinds of Aspects
SimpleCollision -> Has one Aspect list of entities with CollisionBodies and another Aspect with a list of entities that have CollisionAreas (many things often have both)

@kvark
Copy link
Member

kvark commented Apr 6, 2016

Heads up everyone - I was able to implement a ECS design that fulfils my requirements:
https://github.com/kvark/parsec

Thread-safe parallel processing is there, with not even a char of unsafe code. Still got to figure out the entity/component recycling though, so will follow the thread.

@OvermindDL1
Copy link

@kvark As I often like to do I like to run an ECS through a few of my stress tests that my system handles with ease to see if I have found a replacement so I can stop dealing with my own so it is time for parsec. ^.^

I used your example.rs as the base, created 1 million entities (not unheard of in some of my mini games, I often have ten thousand active entities with bursts of up to a million that can be rapidly created and destroyed, hence why I had to learn how to optimize an ECS), with 99% having a parsec::VecStorage Component and 2% having a parsec::HashMapStorage component with 1% of overlap. First issue found, these two lines takes a substantial amount of time (almost a full second):

    scheduler.run0w1r(|b: &CompBool| { // The rare component
        println!("Entity {}", b.0);
    });
    scheduler.wait();

Where such an equivalent Aspect in my system takes <1ms (even in a debug run). As I am still learning Rust I wondered if it was caused by println buffering output, so I removed that line (so the system is completely a no-op, still running in debug mode for note but was worried it might have been optimized out regardless as it often would in C++) and got the same results, so output buffering was not the issue.

At this point I delved into the code and it looked like it was iterating over every single entity for every single system callback even if not wanted (it is just skipped for the callback, but it is still iterated over), which to me does not seem in any sense scalable? What is the plan to work around that?

@kvark
Copy link
Member

kvark commented Apr 11, 2016

@OvermindDL1 thank you for having a look at parsec!

You might be interested in the existing project called ecs_bench (by @lschmierer) that aims to profile different ECS implementations in Rust. It's first test case is very similar to yours, where only a fraction of entities actually satisfy the system's requirements for running.

What you found is a known issue - the runXwYr variants always check for all components, and don't attempt to do early exit. We could check for the components to exist one by one, but we'd have to settle with the user-given order, and it has writable firsts, so it's limited. We'll do this anyway, there is no downsides for early-existing at least like that.

One way to address this is - if you know that some of the components are rarer than others - to use the custom run function and query those rare components first, early exit, or otherwise proceed.
Edit: just realized your test case only uses a single component, so this optimization wouldn't help you.

Another - a more radical way, is to use tables by @csherratt. He's already got a prototype implemented on top of parsec's storage interface, and it performed 6 times faster on the ecs_bench's test case. AFAIK, it doesn't require any extensions to parsec itself, and works somewhat automatically (@csherratt please correct me if that's wrong), which makes it a much nicer alternative (in my eyes) to rolling any sort of Aspect support. I like to keep things simple and separate complexity into layers, and that's what parsec succeeds at so far.

Please feel free to hop on our gitter and/or create issues in our project (or contribute!), to avoid off-topic in this thread.

@OvermindDL1
Copy link

@kvark Fascinating links and information, thanks.

Yeah my basic test involved a 2D screen with a set of tiny floating squares (movement system floats them around bouncing off screen edges) that when they intersect it (collision system) destroys both, emits (via event) a shower of new Entities for each (basically particles), and the spawn system shortly later spawns those back in within about 10ms. Generally have 1000 squares or 10k for a stress test, each collision creates about a hundred more entities as they zoom off until they fade out and are removed, and when it 10k exist the collisions are... consistently constant to put it mildly. I can only achieve real-time speeds in my system thanks to my aspect cache (and shader trickery for the rendering side). I should probably get a gitter account, done. ^.^

@kvark
Copy link
Member

kvark commented Apr 11, 2016

@OvermindDL1 that sounds like a very good testcase for ECS. I'm currently porting yasteroids, and it's going to take some time, but I'd love to try out your scenario with parsec to see where we stand.

@kvark
Copy link
Member

kvark commented May 4, 2016

Closed by #44

@JeanMertz
Copy link

Just wanted to note that this was a super fascinating discussion. Lots of information here that's useful for someone figuring out what patterns there are, and how to handle them. Thank you all for spending your time here ❤️

CBenoit pushed a commit to CBenoit/amethyst that referenced this issue Feb 19, 2019
Refactor Transform in amethyst_core (and reformat amethyst_assets)
bors bot pushed a commit that referenced this issue Oct 30, 2019
Update base64 requirement from 0.9 to 0.11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
diff: normal Achievable by an reasonable experienced developer. If new to Amethyst, may need some guidance. pri: critical This issue has to be fixed ASAP. For the owning team, nothing else should be worked on.
Projects
No open projects
Engine
  
Shipped
Development

No branches or pull requests