GitHub - andrewssobral/LowRankSVDCodes: RSVDPACK: Implementations of fast algorithms for computing the low rank SVD, interpolative and CUR decompositions of a matrix, using randomized sampling. Includes codes for single core (using GNU GSL), multi-core (using Intel MKL) and GPU (using NVIDIA CUDA/CULA) architectures.

andrewssobral / LowRankSVDCodes Public

forked from sergeyvoronin/LowRankMatrixDecompositionCodes

Notifications You must be signed in to change notification settings
Fork 2
Star 3

RSVDPACK: Implementations of fast algorithms for computing the low rank SVD, interpolative and CUR decompositions of a matrix, using randomized sampling. Includes codes for single core (using GNU GSL), multi-core (using Intel MKL) and GPU (using NVIDIA CUDA/CULA) architectures.

GPL-3.0 license

3 stars 32 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
matlab_code		matlab_code
multi_core_mkl_code		multi_core_mkl_code
nvidia_gpu_cula_code		nvidia_gpu_cula_code
single_core_gsl_code		single_core_gsl_code
LICENSE		LICENSE
README		README
make_matrix_binary.m		make_matrix_binary.m
make_matrix_binary2.m		make_matrix_binary2.m
make_matrix_binary3.m		make_matrix_binary3.m
read_matrix_binary.m		read_matrix_binary.m

Repository files navigation

RSVDPACK: Implementations of fast algorithms for computing the low rank SVD, 
interpolative and CUR decompositions, using randomized sampling. 
Given, a real mxn matrix A and a rank parameter 
k<< min(m,n) (or optionally, a tolerance parameter instead of rank) the 
rsvd routines compute  
U_k, Sigma_k, and V_k
where U_k is mxk, Sigma_k is kxk, and V_k is nxk, such that:
A \approx U_k Sigma_k V^T_k

Included is a blocked randomized routine which computes a so called QB 
decompositon of A such that,  
A \approx Q_k B_k, from which other decompositions may be efficiently computed.

The interpolative decomposition routines return a single or double sided ID of a matrix, 
while the CUR decomposition can decompose a matrix into the form A \approx C U R 
where C and R contain a selection of the columns and rows, respectively of the 
original A.

For more detailed information on the implemented algorithms see:
http://arxiv.org/abs/1502.05366
http://arxiv.org/abs/1412.8447
http://arxiv.org/abs/1503.07157

There are three codes: for single processor (with GNU GSL), multiprocessor 
(with Intel MKL) and GPU (with CULA library for NVIDIA CUDA capable cards).
A simple implementation in Matlab (or Octave) is also provided for 
illustration purposes. The majority of the algorithms are currently implemented with 
the multi-core Intel MKL version. The GPU version will soon receive updates 
and additional routines as well.

In addition, experimental Matlab mex-file support for the multi-core and 
GPU C codes is available with illustrating examples of calling some of the 
multi-core and GPU functions.

Written by Sergey Voronin, 2014-2016 as part of a collaborative project with Per-Gunnar Martinsson.
Tested with GSL-1.16, icc/mkl 14.03, cuda 6.5, cula r18. Should also work 
with more recent version of the software. 

Algorithms used based on those originally described in the article: 
"Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions," N. Halko, P.G. Martinsson, J. Tropp, SIAM Review, 53(2), 2011
and in the other mentioned articles, with some modifications. 
See the arxiv articles for more details.

============ summary of installation and usage ==============

One must install the GNU GSL development libraries, Intel MKL (and the intel C compiler), 
NVIDIA CUDA libraries and the CULA Dense packages. 
Please refer to their corresponding documentations.
http://www.gnu.org/software/gsl/
https://software.intel.com/en-us/intel-mkl
https://developer.nvidia.com/cuda-zone
http://www.culatools.com/dense/

NOTE on software: The GSL library is free to download and distributed under a GNU license. A free 
version of the CULA library is available for download at 
http://www.culatools.com/downloads/ , which includes all the functions 
necessary for RSVDPACK (matrix-matrix and matrix-vector ops, QR, SVD, etc). 
The Intel MKL library is generally available in commercial form. 
The free, non-commercial versions of the Intel MKL library and the 
C++ compiler can be obtained from Intel by qualified users. Please 
see: https://software.intel.com/en-us/qualify-for-free-software .  

For simple illustration of the randomized SVD algorithms, see the 
Matlab implementation. Each C implementation resides in its own subfolder 
with a compile.sh file. After 
necessary paths (PATH variable, LD_LIBRARY_PATH) are set for referencing 
gsl, mkl, and cuda/cula, this file 
should be modified to reflect any local system changes and executed to yield the 
driver executable in each case. Paths for mkl are usually set via:
$ source /opt/intel/bin/compilervars.sh intel64
For cuda/cula, source the script nvidia_gpu_cula_code/setup_paths.sh after checking 
that the paths are correct for your system.  

In order to make a test matrix, use the provide make_matrix_binary.m script which can be run 
from Octave or Matlab. Note that this script writes a binary matrix file.

Once the matrix is made one can use any of the drivers to compute the low 
rank SVD or the ID and CUR decompositions. Inside the main loop of the programs one 
sets the rank k <= min(nrows,ncols) 
or the tolerance parameter if one of the autorank methods is used.
Additional parameters such as the block size apply to the block randomized methods.

One can call either algorithm I,II,III, or IV for computing the low rank SVD with 
each driver.
Each of the codes also has three autorank functions based on version II and III 
of the randomized algorithm, for use when a good rank k to use is not known in advance.
Thus, there is a total of 6 functions for computing the low rank SVD. Algorithm IV 
is block based and uses the QB algorithm to construct Q instead of QR.

Additionally, for the multi-core and GPU architectures the following algorithms 
are also available:
full column pivoted QR from lapack (only for multi-core), 
partial column pivoted QR (using householder reflectors), 
single vector randQB with power method, and 
blocked randQB with power method.

For the multi-core architecture, the following routines are also 
available: ID, double sided ID, CUR in full, randomized, and 
block randomized form.

Sample calling procedure for the rsvd algorithms is as follows 
(similar for all 3 architectures).

=======================

// declare matrices and vectors
mat *M, *U, *S, *V, *T, *S, *C, *R;
vec *Icol, *Irow;

// load matrix M from file
M = matrix_load_from_binary_file(mfile);
m = M->nrows; n = M->ncols;

// set svd rank (< min(m,n)) if not using the autorank version
// for autorank, define instead the TOL parameter (i.e. TOL = 0.1)
// set rank, block size, oversampling, power scheme and orthogonalization parameters
k = 300;
kstep = 100;
p = 20;
q = 2;
s = 1;

// call random SVD (use one of the six variants)
randomized_low_rank_svd1(M, k, &U, &S, &V);
randomized_low_rank_svd2(M, k, &U, &S, &V);
randomized_low_rank_svd3(M, k, q, s, &U, &S, &V);
randomized_low_rank_svd4(M, kblocksize, numblocks, q, &U, &S, &V);
randomized_low_rank_svd2_autorank1(M, frac, TOL, &U, &S, &V);
randomized_low_rank_svd2_autorank2(M, kblocksize, TOL, &U, &S, &V);
randomized_low_rank_svd3_autorank2(M, kblocksize, TOL, q, s, &U, &S, &V);

// call ID, dsID, CUR block randomized routines
id_decomp_fixed_rank_or_prec(M, k, TOL, &frank, &I, &T);
id_two_sided_blockrand_decomp_fixed_rank_or_prec(M, k, p, TOL, kstep, q, s, &frank, &Icol, &Irow, &T, &S);
cur_blockrand_decomp_fixed_rank_or_prec(M, k, p, TOL, kstep, q, s, &frank, &C, &U, &R);

// write results to disk
matrix_write_to_binary_file(U, "data/U.bin");
matrix_write_to_binary_file(S, "data/S.bin");
matrix_write_to_binary_file(V, "data/V.bin");

=======================

Notice that for the multiprocessor OpenMP based code, one can control the number of 
threads used via an environmental variable. For instance, in bash type:
export OMP_NUM_THREADS=6
to use 6 threads. You should use as many threads as there are physical cores for 
best results, but the optimal configuration differs for different systems.

Example run with GPU code

First, make the matrix using the provided script:
$ matlab -nodesktop
> make_matrix_binary2 
> ls data/A_mat_6kx12k.bin
$

Check to make sure nvidia libs are setup ok:
$ nvidia-smi 

Switch to correct directory
$ cd nvidia_gpu_cula_code 

Source paths to cuda/cula
$ source setup_paths.sh

Compile
$./compile.sh

Run:
$./driver_gpu_nvidia_cula
Initializing CULA
culaVersion is 18000
loading matrix from ../data/A_mat_6kx12k.bin
initializing M of size 6000 by 12000
done..
sizes of M are 6000 by 12000
calling random SVD version 3 with k = 1000
.........
elapsed time: about 13 seconds
normM = 180.600893 ; normU = 31.622777 ; normS = 176.342348 ; normV = 31.622777 ; normP = 176.342348
percent_error between M and U S V^T = 21.587897
$

Example run with Matlab Mex Interface of GPU code (run C GPU code inside of Matlab)

$ cd matlab_code/mex_code_nvidia_gpu_cula/
$ source setup_vars.sh
$ ./compile_mex.sh
$ matlab -nodesktop
>> A = randn(2000,5000);
>> k = 1900;
>> [U,S,V] = rsvd_cula_mex_interface1(A,k);
>> norm(A - U*S*V')/norm(A)
>> k = 2000;
>> [U,S,V] = rsvd_cula_mex_interface2(A,500,4,2);
>> norm(A - U*S*V')/norm(A)