GitHub - fhstoica/N_Body_Simulation: Multithreaded N-body (star cluster) simulation using barrier synchronization, and OpenGL code to visualize the results.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Gen_Init_Cond.py		Gen_Init_Cond.py
Makefile		Makefile
OpenGL_Makefile		OpenGL_Makefile
OpenGL_Makefile_Win		OpenGL_Makefile_Win
README		README
Show_Cluster_Motion.cpp		Show_Cluster_Motion.cpp
star_cluster.cpp		star_cluster.cpp

Repository files navigation

This project contains the source code needed for 
simulating the evolution of a star cluster using 
the Verlet algorithm. See for example:

http://en.wikipedia.org/wiki/Verlet_integration

Since the complexity is O(N^2) where N is the number 
of particles, it is essential to take advantage of
multi-core processors.

It is a multi-threaded simulation using barrier 
syncronization and the number of threads can be specified 
by the user via the "MaxThreads" parameter.
 
Under Linux CPU affinity can also be specified, by 
uncommenting the -DLINUX preprocessor option in the 

The particles are split in equal groups per thread, each 
thread computing the updated positions and velocities of 
each particle in its assigned group. 

The algorithm uses adaptive time step, and writes out the 
values only when enough evolution time has accumulated. 
At this point the cumulative time is reset, and the 
process is repeated. This way the updated particle 
positions are written out at regular time intervals, even 
though the number of simulation steps can change. 
To avoid delays due to I/O operations the intermediate 
results are accumulated in a vector which is written to 
file only at the end of the simulation.

The threads use barrier synchronization, but the last 
thread to reach the barrier much check if enough time 
has accumulated to write the current particle positions.
This is determined by having each thread set a bit on a
global 32-bit unsigned integer, so we are limited to 
32 threads. Locking is needed so that the integer is not
modified between the time the thread sets its bit and 
the time checks the resulting value. 

The total number of simulation steps is specified in the 
beginning, but the resulting number of frames depends 
on the actual evolution of the cluster. 
The largest disadvantage is the formation of a close binary
pair; the particles move at very large speed with very small
separation, so the timestep will be always very small. 
 
As mentioned in the beginning, the treads can have 
user-specified CPU affinity, so that any thread can 
run on any CPU, or each thread can be forced to run on a 
single CPU. Affinity is set via the function:
 
pthread_setaffinity_np(p_thread[i], sizeof(cpu_set_t), &cpuset);

It works only in Linux as "_np" stands for non-portable.
If a thread is created on the wrong CPU it is migrated. 
This is not very costly, as it is done only once at the 
beginning of the run.
To enable this, uncomment the -DLINUX preprocessor option
in the Makefile.
The user can leave these two lines in the main() function:

      CPU_ZERO(&cpuset);   
      CPU_SET(i, &cpuset); 

uncommented, and this will force each thread to run on a 
single CPU. Commenting them out will allow any thread to 
run on any CPU.

Leaving -DLINUX commented-out will allow the program to 
compile in Windows using MinGV. However, it will not work
in Cygwin, as barriers are not supported.

To build the program use:

make star_cluster.srl

and then simply run it with:

./star_cluster.srl

One also needs the intial data for the cluster, 
generated by the Gen_Init_Cond.py python script.
Simply edit the script to set the correct number of 
particles (the same as in star_cluster.cpp) and then 
run with:

python Gen_Init_Cond.py 99.54719

It creates the initial_data.dat file used by star_cluster.cpp.
The number given as argument is the seed for the random number
generator. 

One thing to keep in mind is that if you increase the number 
of particles you'll have to modify the velocity scale "vscale" 
and the position (inter-particle distance) scale "pscale" to 
obtain a well-behaved cluster. That is because the potential 
energy is negative (attractive) and increases as O(N^2) with 
the number of particles. Failing to do so will result in a 
cluster of very fast-moving particles, which will force the 
time step to be very small, resulting in very little evolution 
time. 

Visualising the data:

The data can be visualised using the Show_Cluster_Motion.cpp
program. The number of particles must be set manually to the 
correct value (same as in star_cluster.cpp).

const GLint N   = 4;

To build the program simply use

make -f OpenGL_Makefile Show_Cluster_Motion.srl

or in Windows

make -f OpenGL_Makefile_Win Show_Cluster_Motion.srl

You may need to edit the makefiles so that they point to the 
correct directories containing the gl.h, glu.h and glut.h.
Then use it as:

./Show_Cluster_Motion.srl

It reads the "positions.txt" generated by the simulation. 
Click on the window and move the mouse around to rotate
the entire cluster while the particles are moving.
Internally it creates a vector of display lists and loops 
over it calling one list at a time until the users presses ESC. 

The cluster simulation above can make a pretty good benchmark 
for processor speed on multi-threaded tasks. However, due to 
the N^2 scaling, it becomes impractical to run it for clusters 
larger than 128 particles. Past this point we enter the realm 
of GPU programming. The interesting part is to see for each 
number of particles what is optimal: to allow any thread on 
any CPU core, or to force each thread to run on a sigle CPU.

Happy hacking!

Update 07-Dec-2014:

The code has been updated to take advantage of the SSE instructions
available on modern processors. The -DSSE compiler flag selects the 
SSE code. The run time drops to a bout a half of the non-optimized 
version. 
The compiler flags for tuning to a specific architecture are set 
for Athlon 64, but they can be easily changed to other architectures.