simple port of hpl-2.0 to use NVIDIA GPU accelation with CUBLAS
C C++
Switch branches/tags
Nothing to show
Latest commit b0440f0 May 10, 2011 avidday@gmail.com avidday@gmail.com first commit
Permalink
Failed to load latest commit information.
include
makes first commit May 9, 2011
src first commit May 9, 2011
testing
COPYING
Make.CUDA
Make.top
Makefile
README first commit May 10, 2011

README

-- High Performance Computing Linpack Benchmark for CUDA
    hpl-cuda - 0.001 - 2010

    David Martin (cuda@avidday.net)
    (C) Copyright 2010-2011 All Rights Reserved

See the accompanying COPYING file for full details of the
license and copyright information of the code contained in
this distribution.

This distribution contains a simple acceleration scheme for
the standard HPL-2.0 benchmark with a double precision capable
NVIDIA GPU and the CUBLAS library.

The code has been known to build on Ubuntu 8.04LTS or later and Redhat 5
and derivatives, using mpich2 and GotoBLAS, with CUDA 2.2 or later.

The supplied Make.CUDA file relies on a number of environment variables
being set to correctly locate host BLAS and MPI, and CUBLAS libraries
and include files. Example values for the abovementioned mpich2 and gotoBLAS
combinations might be:

MPICH2_INSTALL_PATH=/opt/mpich2-1.3
MPICH2_INCLUDES=-I/opt/mpich2-1.3/include
MPICH2_LIBRARIES=-L/opt/mpich2-1.3/lib -lmpich -lmpl
BLAS_INCLUDES=-I/opt/goto/GotoBLAS2/include
BLAS_LIBRARIES=-L/opt/goto/GotoBLAS2/lib -lgoto2 -lgfortran -lpthread
CUBLAS_INCLUDES=-I/opt/cuda-3.2/include
CUBLAS_LIBRARIES=-L/opt/cuda-3.2/lib64 -lcublas -lcudart -lcuda

With these set, it should be possible to build by issuing

$ make arch=CUDA

which will drop the final xhpl executable in the bin/CUDA directory.
Correct runtime path and link load settings are the responsibility
of the user.

NOTE:

This version of the code only acclerates one section
of the factorization for a single GPU. The scheduling routine
gpuUpdatePlanCreate() in auxil/HPL_gpusupport.c contains two
tuning constants tune0 and tune1, which control the split
of work between the GPU and host CPU(s). These must be tuned
for optimal performance with a given GPU and host CPU/BLAS
combination. How this is done is left as an exercise for the
reader.

There are a number of further optimizations which can be applied
to this code - it should be regarded as a starting point rather
than a definitive version of the benchmark.

DISCLAIMER:

 THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT 
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
 OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
 SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT 
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY 
 THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.