GitHub - astromaddie/commloop: MPI-based communication loop framework for BART

##Commloop

An MPI-based communication loop framework, designed for inter-program cross-lingual communication.

###Table of Contents

Team Members
System Requirements
Background
Filelist
Makefile
Commloop Benchmarks

####Team Members

####System Requirements

Important: MPICH must be installed before mpi4py! Commloop and mpi4py were written for MPICH. Nested spawning is not functional with OpenMPI as of the time of this release.

####Background

MPI (Message-Passing Interface) is a communications protocol used to add parallel processing in programs. In this implementation, Commloop consists of a central hub (a 'Master') that interacts with a sequence of spawned programs ('Workers') as a mediator via a loop: Python and C workers are included here, but any language supported by MPI can be easily implemented. This allows communication between programs of different languages.

We designed Commloop to be modular and expandable. As previously stated, the core of Commloop consists of a central hub (master.py), and C and Python workers (worker_c.c and worker.py, respectively). mutils.py contains a series of function wrappers for MPI calls, designed so that MPI could be easily replaced with another parallel processing interface at a later date.

To execute the code as-is, run:

mpiexec master.py

Initially, the Master sends data to the first Python Worker and awaits output. The outputted data from pyWorker1 is sent back to Master, which then sends it to cWorker. The outputted cWorker data is returned to Master and sent to pyWorker2. Once that data is returned to Master, the loop repeats.

During each loop iteration, each Worker received an array of floats, divides it in half, and sends the resulting array back to the Master. This scaling factor allows the starting number to rapidly approach zero, so there is a traceable difference between each worker operation, without risk of the values blowing up and causing double overflow issues.

The code currently passes dummy arrays in the following structure:

Sender	Data	Receiver
Master	Array1	pyWorker1
pyWorker1	Array2	Master
Master	Array2	cWorker
cWorker	Array3	Master
Master	Array3	pyWorker2
pyWorker2	Array4	Master
	Array1 = Array4
	repeat

####Files

bin/mutils.py
Holds python wrappers for all used MPI functions in general form
(used by both master.py and worker.py)
bin/master.py
Holds all the master MPI calls
bin/worker.py
Holds worker MPI calls for both Python portions of Commloop
bin/worker_c
Holds worker MPI calls for C portion of Commloop
src/Makefile
Compiles the C worker
src/worker_c.c
Holds worker MPI calls for the C portion of Commloop

####Makefile

To compile the C worker, simply call make in src/. The compiled binary will be moved to bin/ automatically, overwriting any existing binary.

The makefile generates MPI-executable C code with the following command

mpicc -fPIC -o worker_c worker_c.c

####Commloop Benchmarks

The above plot is for a benchmark of MPI, rather than Commloop specifically. For this setup, we only used one Master and one Worker, with 10 processes per spawned Python worker. The code looped over 1000 iterations. We recorded the min, max, median, and mean times for each transfer size. Performance remains constant up to about 10KB, before runtimes begin to logarithmically increase. The final benchmark (below) was run with the default sourcecode setup (with arrays of sizes 10B, 1KB, 1MB, 10B respectively), with the 1MB array being passed to a C worker, showing runtimes at start, and loop speed breakdowns.

Default setup breakdown

Part of code	Time (seconds)
Start MPI Comm	0.291091918945
Avg Iteration	0.194536820277
Total Code	82.5224819183

Runtime Table

Size of Array	Median Time (in seconds)	Minimum Time (in seconds)
1B	8.10623e-06	3.81469e-06
10B	8.10623e-06	6.91413e-06
100B	8.10623e-06	2.86102e-06
1KB	6.19888e-06	4.76837e-06
10KB	3.49283e-05	1.69277e-05
100KB	0.00130105	0.00126290
1MB	0.0130050	0.0125229
10MB	0.155957	0.150859
100MB	1.60676	0.852834
1GB	20.5827	5.63851

Per-Array Plots of Benchmark Transfers

The transfer times for all 1000 iterations for each benchmark were recorded to show the variations in transfer times. Variance in the times are assumed to be caused by background processes in the computer, and spikes occur periodically, which may indicate an MPI buffer being flushed.