Architecture/Threading idea #56

nopeslide · 2020-08-04T12:15:23Z

Got the following proposal regarding threading:

Sets refer to the typeless operator instead of resolver (see issue replace resolver by typeless operator executer #40 )
The typeless operator sets everything up for the actual execution (see issue operator specific init function #42)
- It registers a worker function inside the node_context (same principle as the executer chosen by the resolver)
- It registers a set of execution contexts inside the node_context
  - node_context will be extended by a counter size_t n_jobs and a pointer to a list of pointers of these execution contexts void **jobs
  - an execution context contains any information stripped from the onnx node needed for the calculation of the outputs, it's a boiled down variant of all the options onnx provides.
- the execution contexts are operator implementation specific and independent of each other
execution of a node consists now of passing each context to the specified worker
- if multiple contexts exists they may be executed in parallel

with this we can decouple the actual operator algorithms as much as possible from onnx, without losing anything.

extension of the node_context

--- a/include/operators/operator.h
+++ b/include/operators/operator.h
@@ -7,7 +7,7 @@
 // TODO Remove unused code
 typedef enum operator_status operator_status;
 typedef struct node_context  node_context;
-typedef operator_status (*operator_executer)(node_context *ctx);
+typedef operator_status (*operator_executer)(void *job);
 typedef operator_executer (*operator_resolver)(node_context *ctx);
 
 
@@ -17,8 +17,9 @@ struct node_context {
   Onnx__NodeProto     *onnx_node;
   Onnx__TensorProto  **inputs;
   Onnx__TensorProto  **outputs;
-  operator_executer resolved_op;
-  //int (*resolved_op)(node_context *ctx);
+  operator_executer    executer;
+  size_t               n_jobs;
+  void               **jobs;
 };

simple single-threaded execution of all jobs an operator has provided

--- a/src/inference.c
+++ b/src/inference.c
@@ -76,7 +76,9 @@ Onnx__TensorProto** inference(Onnx__ModelProto *model, Onnx__TensorProto **input
   for (int nodeIdx = 0; nodeIdx < model->graph->n_node; nodeIdx++)
   {
     TRACE(1, true, "Running node %d, operator=%s", nodeIdx, model->graph->node[nodeIdx]->op_type);
-    all_context[nodeIdx].resolved_op(&all_context[nodeIdx]);
+    for (int job = 0; job < all_context[nodeIdx].n_jobs; job++) {
+      all_context[nodeIdx].executer(all_context[nodeIdx].jobs[job])
+    }
     TRACE_TENSOR(2, true, all_context[nodeIdx].outputs[0])
   }

an execution context/job could look like this (i.e. operator add)

struct job_add {
  float *summand_a;
  float *summand_b;
  float *sum;
  size_t num;
};

how these splits into jobs are applied is completely up to the operator. we may specify a wanted level of parallelism globally which the operator may try to achieve, but more jobs than threads shouldn't be a problem.
additionally we simplify customization, since the worker does not need any knowledge of the onnx structure.

the resulting flow would look like this:

create/prepare all node_contexts & tensors (as before)
create/prepare all jobs for each context & tensors (execute all typeless operators similar to the resolving step before)
actual execution (execute all jobs node by node)

@alrevuelta @mdhimes your opinions?

The text was updated successfully, but these errors were encountered:

mdhimes · 2020-08-04T15:40:56Z

In general I think this is a great idea. It looks like a better, higher-level approach than the multithreaded version I had coded up, which performed the multithreading within a given operator.

Have you done any benchmarking yet? Based on what I observed in my multithreaded version, my main concern is that, in some situations, the added overhead of creating/joining threads can eat up any performance gains. I think the higher-level implementation avoids this problem for the most part, but it would be good to benchmark it against the current version of the code to make sure it isn't going to make simple models (e.g., the MNIST example) less efficient. Maybe a single-threaded comparison to evaluate the overhead, and then look at a 2- or 4-threaded run to see how it scales.

alrevuelta · 2020-08-04T15:50:26Z

I haven't thought much about supporting multithreading but before going into technicalities I would like to discuss some other things. I think we should have clear what we want to achieve and why, and then the "how" will come later. Some ideas from the top of my head:

I assume you are referring to multithreading on a model level. So if a given model has different "branches" that don't depend on each other, we can parallelise them by running both at the same time. Just out of curiosity, do you have any models that could be parallelised?
Do you have any idea how can we know if a given model can be parallelised? I don't know if there are some tools out there but I don't think this is trivial. We should study this and this will impact how we "resolve" our jobs.
Do you have a particular use case/need why you want to implement this? If we make this multithreading, that means that we are looking for performance, and I don't think someone looking for performance will be using cONNXr. Lets not forget that this is C99.
From my point of view we could allocate our limited time in other more "core" functionalities that I think are more relevant. Of course, if there is a particular need and someone is really interested we can consider it . @mdhimes is this something that you are looking into?

Regarding node_context, there are some things that I don't understand. So a node is a given operator with its inputs and outputs. Why does a node have different jobs? Could you elaborate more on this?

struct node_context {
   Onnx__NodeProto     *onnx_node;
   Onnx__TensorProto  **inputs;
   Onnx__TensorProto  **outputs;
+  operator_executer    executer;
+  size_t               n_jobs;
+  void               **jobs;
 };

In general I think this is a great idea. It looks like a better, higher-level approach than the multithreaded version I had coded up, which performed the multithreading within a given operator.

So you were parallelising some operators? Just curious, which ones have you tried?

mdhimes · 2020-08-04T16:35:38Z

If we make this multithreading, that means that we are looking for performance, and I don't think someone looking for performance will be using cONNXr. Lets not forget that this is C99.

While I think that this is true for the average user, I think it's good to keep in mind that a lot of science software is written in older standards, and performance can be a concern in those situations. In my case, we're working with some modern software that is written in C99 for speed and for compatibility with some older modules, and we'd like to squeeze out any performance gains that are possible. So, there is definitely an interest in a multithreaded version of cONNXr. But, I think it is only worth implementing it into the master branch if it doesn't negatively impact the performance of single-threaded uses.

So you were parallelising some operators? Just curious, which ones have you tried?

I parallelized all operators that were implemented at that time (early June) just to see if it was reasonable. I found that most operators didn't benefit from parallelization (the overhead of creating/joining threads ate up the performance gain of using N threads) except when there were large numbers of that operation. From what I recall, the exception to that was the add and matmul operators, which provided some performance gains even for small models.

Do you have any idea how can we know if a given model can be parallelised? I don't know if there are some tools out there but I don't think this is trivial. We should study this and this will impact how we "resolve" our jobs.

This is a great point, and it should probably be figured out first before going down the rabbit hole.

Based on my use case, my initial thought is that you could just parallelize over the zero-th axis. These would be separate, independent cases, so presumably the inference could be carried out in parallel without issue. E.g., if there is a batch of 256 cases to predict on, you could spawn 256 jobs and assign them to N workers. For all of my models, this holds true -- they all take inputs shaped as (num_cases, num_inputs) -- but maybe there are some other models that this is not true for. Thoughts?

alrevuelta · 2020-08-04T19:02:40Z

@mdhimes Great input

Based on my use case, my initial thought is that you could just parallelize over the zero-th axis. These would be separate, independent cases, so presumably the inference could be carried out in parallel without issue. E.g., if there is a batch of 256 cases to predict on, you could spawn 256 jobs and assign them to N workers. For all of my models, this holds true -- they all take inputs shaped as (num_cases, num_inputs) -- but maybe there are some other models that this is not true for. Thoughts?

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

mdhimes · 2020-08-04T19:20:40Z

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Yeah, the single global context threw a wrench in my early multithreading plans :)

But, assuming that the internal arrays maintain the expected dimensionality, I think it should still be possible to parallelize over the zero-th axis even with the global context. Everything is presumably accessed through pointers to pre-allocated arrays, and since each case along the zero-th axis is independent, there's no risk of a worker trying to access a value that hasn't been filled in yet. As long as all of the workers are joined before trying to access the outputs, it should be fine, I think. Though, I think this would require a reworking of how the operators perform.

nopeslide · 2020-08-04T19:27:44Z

Regarding node_context, there are some things that I don't understand. So a node is a given operator with its inputs and outputs. Why does a node have different jobs? Could you elaborate more on this?

There is a lot of potential parallelism in the execution of any onnx model:

batch
batches are independent and can be executed in any order
model
we have a model with nodes that can run in parallel
operator
almost all operators can be parallelized to a certain degree

I wanted to tackle the batch and operator parallelism.
Instead of having an operator calculate everything in one go, I would split the whole calculation in independent chunks aka jobs.
A batch element would be a legitimate chunk aka job, or even a partial calculation inside the same batch element:
I.e.

Add is a linear walk through independent values, therefore parallelizable
Conv has groups and feature maps which are parallelizable

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Afaik we do not need to change a thing with this approach, since all jobs of a node need to be independent and therefore can run in any order. These jobs work with already allocated memory and do nothing but filling in the numbers.

nopeslide · 2020-08-04T19:35:48Z

As long as all of the workers are joined before trying to access the outputs, it should be fine, I think. Though, I think this would require a reworking of how the operators perform.

I would try to keep it simple, stupid,
but you are right we need some kind of synchronization between the workers regarding nodes or jobs.
Either we run the network node by node and wait for all worker to finish for each node
or, we run N workers completely independent of each other which try to grab the next possible job, requiring a critical/lockable structure which marks jobs as done

nopeslide · 2020-08-04T19:43:01Z

@alrevuelta
keep in mind with #42 and #40 we already need to exchange data between the "init" (for the lack of a better name) aka typeless-operator function and the actual calculation function. this data is nothing else as what I called "job".
the only additional thing I propose is: instead of exchanging this one large "job," we could exchange multiple smaller jobs directly enabling multiple threads if desired.

mdhimes · 2020-08-04T19:47:38Z

I would try to keep it simple, stupid,
but you are right we need some kind of synchronization between the workers regarding nodes or jobs.
Either we run the network node by node and wait for all worker to finish for each node
or, we run N workers completely independent of each other which try to grab the next possible job, requiring a critical/lockable structure which marks jobs as done

This is why I'm in favor of parallelizing over the batch. If using pthread.h, the create/join calls will handle all of that.

If you also parallelize operators, unless there is a more efficient way than I implemented, a lot of the time you won't get much benefit. Add, matmul, and conv are the few that I think would benefit from parallelization in most use cases. I can make a branch on my fork if you'd like to take a look at how I handled that -- it isn't the best coding style per se, but it was quick enough to implement for my purpose of testing.

nopeslide · 2020-08-04T19:55:47Z

If you also parallelize operators, unless there is a more efficient way than I implemented, a lot of the time you won't get much benefit. Add, matmul, and conv are the few that I think would benefit from parallelization in most use cases. I can make a branch on my fork

I would not create a thread for each job, but have job ringbuffer from which persistent worker grab their next job. If you create a thread for each job you work against the fork/clone overhead.

nopeslide · 2020-08-04T20:12:15Z

This is why I'm in favor of parallelizing over the batch. If using pthread.h, the create/join calls will handle all of that.

As far as I see it, if we abstract the parallelism this way and each operator decides how to be parallelized, it's always at least the batch parallelism and in some cases even more, but the underlying structure can accommodate all kinds of parallelism an operator may offer.
How to parallelize multiple nodes or even the whole network is out-of-scope of this idea, since this would require actual scheduling done outside of the operator execution.

alrevuelta · 2020-08-04T20:17:42Z

I wanted to tackle the batch and operator parallelism.

Okay I see. So if we leave by now the model parallelism aside, this are my concerns:

Batch parallelism is interesting and a cool feature to have. Either by building on top or using zero-th axis I would say the impact is reasonable.
Regarding operator parallelism and what you have mentioned, I'm worried about its impact. Can't we just parallelise within the operator? I mean, leave all the interfaces/struct/whatever untouched to avoid adding unnecessary complexity to the non-multithreading case, and then just parallelise one particular operator. This means that the operator has to be reimplemented, but locally. Do you think this can be a solution to the problem?. One of the cool things of cONNXr is that its a spider that connects inputs/outputs/attributes with its nodes and the function that has to be run, with a simple and generic interface. If someone doesn't like one particular operator, they can reimplement it to run multithreaded, on a GPU, HW accelerator, or whatever.
@mdhimes which approach did you took to parallelise the operators? Where you respecting the interface and just multithreading within the function?

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Afaik we do not need to change a thing with this approach, since all jobs of a node need to be independent and therefore can run in any order. These jobs work with already allocated memory and do nothing but filling in the numbers.

What I mean here is that if we have multiple threads running inference at the same time, the struct all_context will be corrupted, because all the thread would be writing at the same time on it.

nopeslide · 2020-08-04T20:27:41Z

* Batch parallelism is interesting and a cool feature to have. Either by building on top or using zero-th axis I would say the impact is reasonable.

* Regarding operator parallelism and what you have mentioned, I'm worried about its impact. Can't we just parallelise within the operator? I mean, leave all the interfaces/struct/whatever untouched to avoid adding unnecessary complexity to the non-multithreading case, and then just parallelise one particular operator. This means that the operator has to be reimplemented, but locally. Do you think this can be a solution to the problem?. One of the cool things of cONNXr is that its a spider that connects inputs/outputs/attributes with its nodes and the function that has to be run, with a simple and generic interface. If someone doesn't like one particular operator, they can reimplement it to run multithreaded, on a GPU, HW accelerator, or whatever.

In my view, batch parallelism is a specific kind of operator parallelism.
The operator gets the data of multiple batches at once and it's the operators job to offload this work how it sees fit.
The problem I see by not providing this tiny,tiny (;D) interface change (making a necessary void pointer to an array) you need to build something not on-top of connxr, but something that has to replace each operator, the inference and our contexts to accommodate efficient threading, therefore almost everything.

What I mean here is that if we have multiple threads running inference at the same time, the struct all_context will be corrupted, because all the thread would be writing at the same time on it.

There can't happen any corruption afaik, because each job is (defined as) completely independent and therefore only writes to its own assigned location not interfering with anything else. this is what I meant by "filling in the numbers"

nopeslide · 2020-08-04T20:41:08Z

regarding the necessary void pointer I mentioned:
If our typeless operator sets up the execution context (parallelized or not) it makes sense to provide some things to the actual executer.
For example, optional attributes can be resolved in the typeless operator and we supply the values the execution needs to act on (resolved means here, the decision between default values or values from the attribute).
For this we need to extend the node context with a void pointer to reference the operator specific execution context.
sorry if this was unclear, sometimes I'm jumping too far ahead.

mdhimes · 2020-08-04T23:48:47Z

@mdhimes which approach did you took to parallelise the operators? Where you respecting the interface and just multithreading within the function?

Yes, mostly. I modified the existing function to use pthread.h to create/join the different threads. I defined a new function for those threads to carry out (this is basically the original lines of code that implement the operator). And, in order to tell the threads where to get the inputs from and where to store the outputs, I defined a new struct that held pointers to those locations and other relevant info (e.g., indices to carry out the operation).

I would not create a thread for each job, but have job ringbuffer from which persistent worker grab their next job. If you create a thread for each job you work against the fork/clone overhead.

Is there a reason that the full set of jobs can't just be split among the threads when they are created? Is it more efficient to use a ringbuffer? If you have a batch size of M and requested N threads, then the i-th thread would do the inference on the non-inclusive range of elements i -- min(i+N, M). This is the approach I took with parallelizing the operators, which ensures that there are exactly as many jobs as there are threads, as this avoids the overhead related to creating additional jobs and having workers retrieve said jobs. Though, maybe that overhead is negligible.

As far as I see it, if we abstract the parallelism this way and each operator decides how to be parallelized, it's always at least the batch parallelism and in some cases even more, but the underlying structure can accommodate all kinds of parallelism an operator may offer.

This is a great point, I hadn't considered that. I like it.

I think the biggest concern is performance impact, which I suppose we won't really know until it's implemented and tested. Worst-case scenario, if it negatively impacts simple cases, then it could exist as a branch for people that need multithreading.

nopeslide · 2020-08-05T00:10:44Z

Is there a reason that the full set of jobs can't just be split among the threads when they are created?

this would make jobs and threads dependant on each other.
I prefer to divide job creation and job scheduling, so threading is completely optional regarding the creation of all datastructures.
Later on, this would allow to divide the frontend, the onnx parsing code from the backend (the actual execution).

Is it more efficient to use a ringbuffer? If you have a batch size of M and requested N threads, then the i-th thread would do the inference on the non-inclusive range of elements i -- min(i+N, M). This is the approach I took with parallelizing the operators, which ensures that there are exactly as many jobs as there are threads, as this avoids the overhead related to creating additional jobs and having workers retrieve said jobs. Though, maybe that overhead is negligible.

static ringbuffers are just neat for job dispatching. you have a mutex-protected ringbuffer in which you insert your jobs and a fixed number of persistent threads which remove the jobs. reading an empty ringbuffer blocks, writing to a full ringbuffer blocks, an insertion unblocks readers, a removal unblocks writers.
The scheduler simply inserts each job in the ringbuffer. To synchronize between nodes it can just wait until all threads are blocked while reading.

nopeslide · 2020-08-05T12:33:50Z

I think the biggest concern is performance impact, which I suppose we won't really know until it's implemented and tested. Worst-case scenario, if it negatively impacts simple cases, then it could exist as a branch for people that need multithreading.

the impact should minimal, if not even non-existing.
If the jobs are large enough to be optimized by the compiler (at least a few loops), I expect no impact.

mdhimes · 2020-08-05T14:24:01Z

I see, thanks for the info. I'm not familiar with ringbuffers so I wasn't sure what impact that might have. If the impact is as minimal as you say, then it sounds like a great approach to me.

I'm not sure how much help I could be in coding it up (I'd have to go learn more about ringbuffers) but I'm happy to do some benchmarking once the method is implemented, I have a pretty wide range of hardware (~decade old server, ~7 year old consumer-grade workstation, ~4 year old i7 laptop, modern AMD EPYC server) to test it out.

nopeslide · 2020-08-05T19:58:06Z

I'm not sure how much help I could be in coding it up (I'd have to go learn more about ringbuffers) but I'm happy to do some benchmarking once the method is implemented, I have a pretty wide range of hardware (~decade old server, ~7 year old consumer-grade workstation, ~4 year old i7 laptop, modern AMD EPYC server) to test it out.

nice to know, I could throw in some modern dual-socket xeon monster (2x6x2 threads) optimized for IO.
Realistic numbers/benchmarks are hard to come by, maybe we need some kind of highscore board :D

alrevuelta · 2020-08-05T19:59:10Z

In my view, batch parallelism is a specific kind of operator parallelism.
The operator gets the data of multiple batches at once and it's the operators job to offload this work how it sees fit.
The problem I see by not providing this tiny,tiny (;D) interface change (making a necessary void pointer to an array) you need to build something not on-top of connxr, but something that has to replace each operator, the inference and our contexts to accommodate efficient threading, therefore almost everything.

I might be missing something here, but I see batch parallelism more straight forward. We have the following function.

inference(Onnx__ModelProto *model, Onnx__TensorProto **inputs, int nInputs)

Can't we just spawn different threads that call this functions with different inputs?

alrevuelta · 2020-08-05T20:16:05Z

@mdhimes @nopeslide Perhaps a MULTITHREADING_ENABLED variable and some ifdef magic can satisfy all parties?

nopeslide · 2020-08-05T20:27:50Z

@mdhimes @nopeslide Perhaps a MULTITHREADING_ENABLED variable and some ifdef magic can satisfy all parties?

let's wait until the PR is ready, then we see the code/performance impact and can decide how to proceed.
Will do a PR for the build system first (#58 related)

nopeslide · 2020-08-05T20:40:57Z

I might be missing something here, but I see batch parallelism more straight forward. We have the following function.
inference(Onnx__ModelProto *model, Onnx__TensorProto **inputs, int nInputs)
Can't we just spawn different threads that call this functions with different inputs?

we could, but this would either need special input (no simple onnx tensor), or preprocessing of the onnx tensor input (generating multiple tensors out of one).
I would like to keep connxr as close as possible to onnx, without any special treatment of anything, therefore letting the operator decide how to do operations.
on the other hand, fair enough, this seems more simple, but I got the feeling it will hurt us in the long run, since it looks like some special treatment which could be easily done outside of connxr and therefore has no place in connxr itself.

nopeslide · 2020-08-10T11:53:22Z

@mdhimes
I finished #57 which divides execution of an operator into a prepare and execution stage and would be a valid base for any "job based" threading attempts.

nopeslide mentioned this issue Mar 8, 2021

operator threading #90

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture/Threading idea #56

Architecture/Threading idea #56

nopeslide commented Aug 4, 2020 •

edited

Loading

mdhimes commented Aug 4, 2020

alrevuelta commented Aug 4, 2020

mdhimes commented Aug 4, 2020

alrevuelta commented Aug 4, 2020

mdhimes commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020

nopeslide commented Aug 4, 2020

nopeslide commented Aug 4, 2020

mdhimes commented Aug 4, 2020

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

alrevuelta commented Aug 4, 2020

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

mdhimes commented Aug 4, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 5, 2020

mdhimes commented Aug 5, 2020

nopeslide commented Aug 5, 2020 •

edited

Loading

alrevuelta commented Aug 5, 2020

alrevuelta commented Aug 5, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 10, 2020

Architecture/Threading idea #56

Architecture/Threading idea #56

Comments

nopeslide commented Aug 4, 2020 • edited Loading

mdhimes commented Aug 4, 2020

alrevuelta commented Aug 4, 2020

mdhimes commented Aug 4, 2020

alrevuelta commented Aug 4, 2020

mdhimes commented Aug 4, 2020 • edited Loading

nopeslide commented Aug 4, 2020

nopeslide commented Aug 4, 2020

nopeslide commented Aug 4, 2020

mdhimes commented Aug 4, 2020

nopeslide commented Aug 4, 2020 • edited Loading

nopeslide commented Aug 4, 2020 • edited Loading

alrevuelta commented Aug 4, 2020

nopeslide commented Aug 4, 2020 • edited Loading

nopeslide commented Aug 4, 2020 • edited Loading

mdhimes commented Aug 4, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 5, 2020

mdhimes commented Aug 5, 2020

nopeslide commented Aug 5, 2020 • edited Loading

alrevuelta commented Aug 5, 2020

alrevuelta commented Aug 5, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 5, 2020

nopeslide commented Aug 10, 2020

nopeslide commented Aug 4, 2020 •

edited

Loading

mdhimes commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 4, 2020 •

edited

Loading

nopeslide commented Aug 5, 2020 •

edited

Loading