All-to-all or redn+bcast #3363

jbakosi · 2021-05-14T15:54:28Z

jbakosi
May 14, 2021
Collaborator

Hi folks,

I wanted to know your expert opinion on the following.

We have an all-to-all, computing a min of single scalar real value, among many chares intended to be running at large scales. This amounts to our single synchronization point within a time step.

I wonder if replacing the single all-to-all with a reduction + broadcast targeting each chare may allow for more overlap. I believe a single all-to-all is implemented as a redn+bcast to/from a single chare, and the complexity of what I'm suggesting is probably worse, nevertheless worth asking.

In code, with DG being a chare array, I'm suggesting to replace

  contribute( sizeof(double), &mindt, CkReduction::min_double,
              CkCallback(CkReductionTarget(DG,solve), thisProxy) );

with

  for all DG chares i
    contribute( sizeof(double), &mindt, CkReduction::min_double,
                CkCallback(CkReductionTarget(DG,solve), thisProxy[i]) );
  end

Would this allow for more overlap by removing the global sync or I would throw the baby out with the bathwater because I am replacing the log(n) algorithmic/parallel complexity with n due to the for loop?

Thanks,
Jozsef

Answered by epmikida

Jun 3, 2021

[Copied from email to the charm mailing list]

I would be shocked if indiviual redn+p2p would ever outperform a single redn+broadcast to all chares in the array. A couple points:

In over decomposition, it’s even worse than going from log(n) to n in terms of messages. For the broadcast, its log(p) where p is the number of PEs, and for the individual, it is n where n is the number of array elements. On top of that, it is also n reductions as compared to one reduction.

Lastly, I don’t even think there is going to be much if any benefit from overlap. You are serializing the sends, so one PE is initiating N reductions, which could be a very large number as you scale up. So that already adds one…

View full answer

epmikida · 2021-06-03T20:02:16Z

epmikida
Jun 3, 2021

[Copied from email to the charm mailing list]

I would be shocked if indiviual redn+p2p would ever outperform a single redn+broadcast to all chares in the array. A couple points:

In over decomposition, it’s even worse than going from log(n) to n in terms of messages. For the broadcast, its log(p) where p is the number of PEs, and for the individual, it is n where n is the number of array elements. On top of that, it is also n reductions as compared to one reduction.

Lastly, I don’t even think there is going to be much if any benefit from overlap. You are serializing the sends, so one PE is initiating N reductions, which could be a very large number as you scale up. So that already adds one extra bottleneck. And secondly, there is already overlap present in the single redn+broadcast scheme. Once the reduction completes, the broadcast is initiated, but broadcasts in charm are not synchronized. So the array elements will be able to act upon the solve entry method invocation as soon as they receive it, even if other members of the array are still waiting for it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All-to-all or redn+bcast #3363

{{title}}

Replies: 1 comment

{{title}}

Select a reply

All-to-all or redn+bcast #3363

jbakosi May 14, 2021 Collaborator

Replies: 1 comment

epmikida Jun 3, 2021

jbakosi
May 14, 2021
Collaborator

epmikida
Jun 3, 2021