Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding allreduce for ndarray #234

Closed
Hakuyume opened this issue Apr 13, 2018 · 10 comments
Closed

Adding allreduce for ndarray #234

Hakuyume opened this issue Apr 13, 2018 · 10 comments
Labels
Milestone

Comments

@Hakuyume
Copy link
Member

Hakuyume commented Apr 13, 2018

I want CommunicatorBase.allreduce method that can work with numpy/cupy.ndarray.
Although it can be archived by using comm.mpi_comm.A(a)llreduce, it has two problems.

  1. Users have to take care of MPI module.
    comm.mpi_comm.A(a)llreduce requires MPI module imported, i.e. MPI.IN_PLACE or MPI.SUM.
    However, we need import MPI lazily (I don't know the reason, but ChainerMN's examples do so).
    With a wrapper function, handling MPI module can be hidden.
  2. nccl is not used.
    For cupy.ndarray, there are two backends; mpi and nccl.
    However, using comm.mpi_comm cannot make use of nccl.
    It is better to select the backend automatically.

For example,
https://github.com/chainer/chainermn/blob/master/chainermn/functions/batch_normalization.py#L121-L125 can be replaced with comm.allreduce(tmp) and we can remove self.memory_utility_module, self.mpi_module.

@Hakuyume
Copy link
Member Author

Hakuyume commented Apr 13, 2018

And this is my case (similar to that of batch_normalization).
https://github.com/Hakuyume/chainercv/blob/ssd-multi-gpu-train/chainercv/links/model/ssd/multibox_loss.py#L91-L95

I had to handle MPI module manually.

@Hakuyume Hakuyume changed the title Adding allreduce for ndarray Adding allreduce for ndarray Apr 13, 2018
@kuenishi kuenishi added this to the v1.3.0 milestone Apr 17, 2018
@kuenishi
Copy link
Member

kuenishi commented May 8, 2018

@Hakuyume as #237 has been merged, this issue will be addressed at 1.3.0 release. So far only communicator will support allreduce, but do you think you need functions like AllGather ?

To see real use case, the URL of your code ( https://github.com/Hakuyume/chainercv/blob/ssd-multi-gpu-train/chainercv/links/model/ssd/multibox_loss.py#L91-L95 ) has been lost 404 so could you provide latest allreduce code if you still have it somewhere open?

@Hakuyume
Copy link
Member Author

Hakuyume commented May 8, 2018

@kuenishi Thank you for supporting allreduce!

So far only communicator will support allreduce, but do you think you need functions like AllGather ?

For my case, function classes are not required.

To see real use case, the URL of your code ( https://github.com/Hakuyume/chainercv/blob/ssd-multi-gpu-train/chainercv/links/model/ssd/multibox_loss.py#L91-L95 ) has been lost 404 so could you provide latest allreduce code if you still have it somewhere open?

I'm sorry for removing the branch. Here is the latest code.
https://github.com/chainer/chainercv/blob/master/chainercv/links/model/ssd/multibox_loss.py#L91-L95

@kuenishi
Copy link
Member

kuenishi commented May 8, 2018

Glad to hear that. So far I'll be closing this issue for cleaning up the milestone list, but feel free to reopen this when you think you need more.

@kuenishi kuenishi closed this as completed May 8, 2018
@Hakuyume
Copy link
Member Author

Hakuyume commented May 8, 2018

@kuenishi Does #237 support reducing cupy.ndarray? According this line, it takes only numpy.ndarray.

@kuenishi
Copy link
Member

kuenishi commented May 8, 2018

Yes, it is not just documented, but _memory_utility.array_to_buffer_object() automatically handles both cupy and numpy array objects. It is tested over hierarchical and naive communicator but should work with other communicators. With pure_nccl communicator it needs stream synchronization before IIRC.

@Hakuyume
Copy link
Member Author

Hakuyume commented May 8, 2018

Thank you.

@kuenishi
Copy link
Member

Actually it wasn't implemented for cupy.ndarray and now #293 should fix it.

@ankahira
Copy link

The allreduce function doesn't seem to use NCCL or am I missing something?

@keisukefukuda
Copy link
Member

Hi @ankahira ,
First, this repository is an old one. ChainerMN was merged into Chainer.
Also, Chainer is no longer maintained.

Besides, ChainerMN's PureNcclCommunicator uses NCCL. https://github.com/chainer/chainer/blob/master/chainermn/communicators/pure_nccl_communicator.py#L180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants