-
Hi folks, I wanted to know your expert opinion on the following. We have an all-to-all, computing a min of single scalar real value, among many chares intended to be running at large scales. This amounts to our single synchronization point within a time step. I wonder if replacing the single all-to-all with a reduction + broadcast targeting each chare may allow for more overlap. I believe a single all-to-all is implemented as a redn+bcast to/from a single chare, and the complexity of what I'm suggesting is probably worse, nevertheless worth asking. In code, with DG being a chare array, I'm suggesting to replace contribute( sizeof(double), &mindt, CkReduction::min_double,
CkCallback(CkReductionTarget(DG,solve), thisProxy) ); with for all DG chares i
contribute( sizeof(double), &mindt, CkReduction::min_double,
CkCallback(CkReductionTarget(DG,solve), thisProxy[i]) );
end Would this allow for more overlap by removing the global sync or I would throw the baby out with the bathwater because I am replacing the log(n) algorithmic/parallel complexity with n due to the for loop? Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
[Copied from email to the charm mailing list] I would be shocked if indiviual redn+p2p would ever outperform a single redn+broadcast to all chares in the array. A couple points: In over decomposition, it’s even worse than going from log(n) to n in terms of messages. For the broadcast, its log(p) where p is the number of PEs, and for the individual, it is n where n is the number of array elements. On top of that, it is also n reductions as compared to one reduction. Lastly, I don’t even think there is going to be much if any benefit from overlap. You are serializing the sends, so one PE is initiating N reductions, which could be a very large number as you scale up. So that already adds one extra bottleneck. And secondly, there is already overlap present in the single redn+broadcast scheme. Once the reduction completes, the broadcast is initiated, but broadcasts in charm are not synchronized. So the array elements will be able to act upon the |
Beta Was this translation helpful? Give feedback.
[Copied from email to the charm mailing list]
I would be shocked if indiviual redn+p2p would ever outperform a single redn+broadcast to all chares in the array. A couple points:
In over decomposition, it’s even worse than going from log(n) to n in terms of messages. For the broadcast, its log(p) where p is the number of PEs, and for the individual, it is n where n is the number of array elements. On top of that, it is also n reductions as compared to one reduction.
Lastly, I don’t even think there is going to be much if any benefit from overlap. You are serializing the sends, so one PE is initiating N reductions, which could be a very large number as you scale up. So that already adds one…