Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary MPI support for GeNN #158

Merged
merged 77 commits into from
Feb 28, 2018
Merged

Preliminary MPI support for GeNN #158

merged 77 commits into from
Feb 28, 2018

Conversation

brad-mengchi
Copy link
Contributor

@brad-mengchi brad-mengchi commented Sep 4, 2017

This now works pretty nicely as a very minimal MPI implementation for GeNN:

  • Linux only
  • Only supports spike-based communication over MPI links
  • Doesn't use multicast MPI (I believe there is a multicast API which could be helpful for larger simulations)

Essentially what's changed is:

  1. NNmodel now has seperate maps containing local and remote neuron groups - representing those simulated on the local machine and those simulated on other nodes on the network (which one a neuron group goes into is determined by the host id you pass to NNmodel::addNeuronGroup)
  2. NNmodel also has seperate maps of local and remote synapse groups (which one a synapse group goes into is automatically determined by the host id of it's target neuron group)
  3. There is a new NeuronGroup::hasOutputToHost method which tests whether a neuron group has any outputs to a given host ID (MPI rank) - this is used to determine which remote neuron groups needs to be synchronised every time step
  4. Basic data structures to hold incoming spikes and GPU push methods are generated for remote neuron groups which needs synchronising.
  5. New mpi.cc and mpi.h files is generated for models build with the -m genn-buildmodels.sh flag. This contains methods to transmit local neuron groups' current spikes using MPI to a specific host and to receive remote neuron group current spikes from a specific host. Typically these are called automatically from the synchroniseMPI function which automatically sends and receives all required spikes each timestep.
  6. Macros are generated for each remote neuron group so user code can test whether it's there with #ifndef POP_NAME_REMOTE and e.g. not try and generate afferent connectivity .

1 and 2 resulted in a fairly large number of search-and-replace changes but this nice thing with this is that 99% of the code just needs to work as before, just on the local neuron and synapse groups.
Example model using this is here https://github.com/neworderofjamie/genn_examples/tree/master/va_benchmark_mpi - it can run on a local MPI install or using the SGE system on our cluster.

NOTE I am going to merge this branch manually as I don't think the changes to the userprojects are useful

…the postsynaptic neurons

Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
Signed-off-by: Mengchi Zhang <zhan2308@purdue.edu>
* ``NNmodel::isDeviceInitRequired`` checks for remote neuron groups who have outputs to local machine and have spike variables which should be initialised on device
* ``genInitializeDeviceKernel`` now also intialises remote neuron group spike variables
* Previously ``StandardGeneratedSections::neuronOutputInit`` was only being used in a subset of the locations this code was being run in
* ``StandardGeneratedSections::neuronOutputInit`` should only advance device spike queues
* For consistency all host spike queues are now advanced in stepTimeCPU/stepTimeGPU
* Pull functions shouldn't be generated for remote populations which don't output to the local host
* Tidied up some auto-generated comments
@neworderofjamie neworderofjamie dismissed their stale review January 19, 2018 18:30

I've now made these changes myself!

# Conflicts:
#	lib/GNUmakefile
#	lib/include/modelSpec.h
#	lib/src/generateRunner.cc
@tnowotny
Copy link
Member

tnowotny commented Jan 29, 2018

I have had a look at the example you made and had a bit of poking around.
I know I am coming late to this (as usual) but I was surprised to see the MIMD approach (separate executable for each MPI host). I always thought MPI was meant to be more SIMD than that. On the practical side the disadvantage of the current solution would be that it's not very scalable. If one could have a loop over populations E1 to E100 and rather than a macro testing for locality one had a simple if branching, then I could see how this scales to big machines. I.e.

if (mpi_local("E1")) {
pull ...;
}

would also allow something like

for (int i= 0; i < 100; i++) {
  std::string pop= std::string("E")+itoa(i);
  if (mpi_local(pop)) {
     pull...;
  }
}

With the macros this would be difficult ... but maybe it's too late now to rethink the entire design?
Maybe we can also discuss offline.

@tnowotny
Copy link
Member

Argh ... maybe this is rubbish. The pull commands etc are also named at compile time, so what I was thinking wouldn't work anyway .... would it?

@neworderofjamie
Copy link
Contributor

neworderofjamie commented Jan 29, 2018

I think those issues are kinda two sides of the same coin. Because the structure of the network exists largely as generated code in GeNN - rather than in Nest where it's builds in memory at runtime - I think it makes sense that the MPI code is more MIMD than would be typical.

However I think the problem of not being able to loop through populations is more general than just MPI - the simulation code for the Potjans, Diesmann microcircuit (https://github.com/neworderofjamie/genn_examples/blob/master/potsjan_microcircuit/simulator.cc#L51-L114) is heading towards macro hell and that only has 8 populations. As you say there's not much choice as everything is compile-time. BUT I actually think the way the SpineML simulator works solves this quite neatly:

  1. Building the generated code into a dynamic library
  2. Loading it at runtime (https://github.com/genn-team/genn/blob/development/spineml/simulator/main.cc#L621-L629)
  3. Looking up functions/variables within that at runtime (https://github.com/genn-team/genn/blob/development/spineml/simulator/main.cc#L642)

May be a good future direction for building the simulation code for larger GeNN models and the actual simulation executable would then be the same across all nodes, it just loads in a different dynamic library of generated code.

@tnowotny
Copy link
Member

This looks like an interesting solution. I remember back then there was a deliberate decision to make it all compile-time and have explicitly named functions for each population etc to make it easy for users not to have to index anything but only call things by name ...

What is your gut feeling - is it worth just merging this solution (with the macros) now even if we may later do something more like the spineML2GeNN design?

@neworderofjamie
Copy link
Contributor

Well, all the options are there to build models using the SpineML approach (there's a GENN_PREFERENCE to build a dynamic library) so you would be able to build models like this using the current version. The example model is in my personal github so no one ever needs to see it :)

I am keen to merge this PR into development to prevent it being lost (the MPI communications stuff and the splitting of remote and local populations are useful whatever), but perhaps I'll merge after I make the 3.1.0 release. Then I can experiment with building some models using the dynamic library approach and, if that's clearly a better fit for MPI, roll back some of the hacky bits for making filenames unique...

@tnowotny
Copy link
Member

Ok - you have my blessing to merge this when it fits into your workflow. Overall it's not an awful approach. That the macros exist doesn't hurt anyone who may not want to use them ;-)

Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when it suits best ... as discussed.

@neworderofjamie neworderofjamie removed this from the GeNN 3.1.0 milestone Jan 30, 2018
@neworderofjamie
Copy link
Contributor

neworderofjamie commented Feb 1, 2018

FYI @tnowotny I had a go at re-implementing the microcircuit model using a shared library here https://github.com/neworderofjamie/genn_examples/blob/master/potsjan_microcircuit/simulator_shared_library.cc. I think the result is already somewhat terser and less riddled with macros and there's still quite a lot of boilerplate that could be provided by the SharedLibraryModel helper class.

@neworderofjamie neworderofjamie changed the base branch from development to master February 14, 2018 12:22
@neworderofjamie neworderofjamie merged commit 2026729 into master Feb 28, 2018
@neworderofjamie neworderofjamie deleted the MPI_support branch February 28, 2018 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants