-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Architecture/Threading idea #56
Comments
In general I think this is a great idea. It looks like a better, higher-level approach than the multithreaded version I had coded up, which performed the multithreading within a given operator. Have you done any benchmarking yet? Based on what I observed in my multithreaded version, my main concern is that, in some situations, the added overhead of creating/joining threads can eat up any performance gains. I think the higher-level implementation avoids this problem for the most part, but it would be good to benchmark it against the current version of the code to make sure it isn't going to make simple models (e.g., the MNIST example) less efficient. Maybe a single-threaded comparison to evaluate the overhead, and then look at a 2- or 4-threaded run to see how it scales. |
I haven't thought much about supporting multithreading but before going into technicalities I would like to discuss some other things. I think we should have clear what we want to achieve and why, and then the "how" will come later. Some ideas from the top of my head:
Regarding struct node_context {
Onnx__NodeProto *onnx_node;
Onnx__TensorProto **inputs;
Onnx__TensorProto **outputs;
+ operator_executer executer;
+ size_t n_jobs;
+ void **jobs;
};
So you were parallelising some operators? Just curious, which ones have you tried? |
While I think that this is true for the average user, I think it's good to keep in mind that a lot of science software is written in older standards, and performance can be a concern in those situations. In my case, we're working with some modern software that is written in C99 for speed and for compatibility with some older modules, and we'd like to squeeze out any performance gains that are possible. So, there is definitely an interest in a multithreaded version of cONNXr. But, I think it is only worth implementing it into the master branch if it doesn't negatively impact the performance of single-threaded uses.
I parallelized all operators that were implemented at that time (early June) just to see if it was reasonable. I found that most operators didn't benefit from parallelization (the overhead of creating/joining threads ate up the performance gain of using N threads) except when there were large numbers of that operation. From what I recall, the exception to that was the add and matmul operators, which provided some performance gains even for small models.
This is a great point, and it should probably be figured out first before going down the rabbit hole. Based on my use case, my initial thought is that you could just parallelize over the zero-th axis. These would be separate, independent cases, so presumably the inference could be carried out in parallel without issue. E.g., if there is a batch of 256 cases to predict on, you could spawn 256 jobs and assign them to N workers. For all of my models, this holds true -- they all take inputs shaped as (num_cases, num_inputs) -- but maybe there are some other models that this is not true for. Thoughts? |
@mdhimes Great input
This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading. |
Yeah, the single global context threw a wrench in my early multithreading plans :) But, assuming that the internal arrays maintain the expected dimensionality, I think it should still be possible to parallelize over the zero-th axis even with the global context. Everything is presumably accessed through pointers to pre-allocated arrays, and since each case along the zero-th axis is independent, there's no risk of a worker trying to access a value that hasn't been filled in yet. As long as all of the workers are joined before trying to access the outputs, it should be fine, I think. Though, I think this would require a reworking of how the operators perform. |
There is a lot of potential parallelism in the execution of any onnx model:
I wanted to tackle the batch and operator parallelism.
Afaik we do not need to change a thing with this approach, since all jobs of a node need to be independent and therefore can run in any order. These jobs work with already allocated memory and do nothing but filling in the numbers. |
I would try to keep it simple, stupid, |
@alrevuelta |
This is why I'm in favor of parallelizing over the batch. If using pthread.h, the create/join calls will handle all of that. If you also parallelize operators, unless there is a more efficient way than I implemented, a lot of the time you won't get much benefit. Add, matmul, and conv are the few that I think would benefit from parallelization in most use cases. I can make a branch on my fork if you'd like to take a look at how I handled that -- it isn't the best coding style per se, but it was quick enough to implement for my purpose of testing. |
I would not create a thread for each job, but have job ringbuffer from which persistent worker grab their next job. If you create a thread for each job you work against the fork/clone overhead. |
As far as I see it, if we abstract the parallelism this way and each operator decides how to be parallelized, it's always at least the batch parallelism and in some cases even more, but the underlying structure can accommodate all kinds of parallelism an operator may offer. |
Okay I see. So if we leave by now the model parallelism aside, this are my concerns:
What I mean here is that if we have multiple threads running inference at the same time, the struct |
In my view, batch parallelism is a specific kind of operator parallelism.
There can't happen any corruption afaik, because each job is (defined as) completely independent and therefore only writes to its own assigned location not interfering with anything else. this is what I meant by "filling in the numbers" |
regarding the necessary void pointer I mentioned: |
Yes, mostly. I modified the existing function to use pthread.h to create/join the different threads. I defined a new function for those threads to carry out (this is basically the original lines of code that implement the operator). And, in order to tell the threads where to get the inputs from and where to store the outputs, I defined a new struct that held pointers to those locations and other relevant info (e.g., indices to carry out the operation).
Is there a reason that the full set of jobs can't just be split among the threads when they are created? Is it more efficient to use a ringbuffer? If you have a batch size of M and requested N threads, then the i-th thread would do the inference on the non-inclusive range of elements i -- min(i+N, M). This is the approach I took with parallelizing the operators, which ensures that there are exactly as many jobs as there are threads, as this avoids the overhead related to creating additional jobs and having workers retrieve said jobs. Though, maybe that overhead is negligible.
This is a great point, I hadn't considered that. I like it. I think the biggest concern is performance impact, which I suppose we won't really know until it's implemented and tested. Worst-case scenario, if it negatively impacts simple cases, then it could exist as a branch for people that need multithreading. |
this would make jobs and threads dependant on each other.
static ringbuffers are just neat for job dispatching. you have a mutex-protected ringbuffer in which you insert your jobs and a fixed number of persistent threads which remove the jobs. reading an empty ringbuffer blocks, writing to a full ringbuffer blocks, an insertion unblocks readers, a removal unblocks writers. |
the impact should minimal, if not even non-existing. |
I see, thanks for the info. I'm not familiar with ringbuffers so I wasn't sure what impact that might have. If the impact is as minimal as you say, then it sounds like a great approach to me. I'm not sure how much help I could be in coding it up (I'd have to go learn more about ringbuffers) but I'm happy to do some benchmarking once the method is implemented, I have a pretty wide range of hardware (~decade old server, ~7 year old consumer-grade workstation, ~4 year old i7 laptop, modern AMD EPYC server) to test it out. |
nice to know, I could throw in some modern dual-socket xeon monster (2x6x2 threads) optimized for IO. |
I might be missing something here, but I see batch parallelism more straight forward. We have the following function. inference(Onnx__ModelProto *model, Onnx__TensorProto **inputs, int nInputs) Can't we just spawn different threads that call this functions with different inputs? |
@mdhimes @nopeslide Perhaps a |
let's wait until the PR is ready, then we see the code/performance impact and can decide how to proceed. |
we could, but this would either need special input (no simple onnx tensor), or preprocessing of the onnx tensor input (generating multiple tensors out of one). |
Got the following proposal regarding threading:
node_context
(same principle as the executer chosen by the resolver)node_context
node_context
will be extended by a countersize_t n_jobs
and a pointer to a list of pointers of these execution contextsvoid **jobs
with this we can decouple the actual operator algorithms as much as possible from onnx, without losing anything.
extension of the
node_context
simple single-threaded execution of all jobs an operator has provided
an execution context/job could look like this (i.e. operator add)
how these splits into jobs are applied is completely up to the operator. we may specify a wanted level of parallelism globally which the operator may try to achieve, but more jobs than threads shouldn't be a problem.
additionally we simplify customization, since the worker does not need any knowledge of the onnx structure.
the resulting flow would look like this:
@alrevuelta @mdhimes your opinions?
The text was updated successfully, but these errors were encountered: