# arbenson/mrtsqr forked from dgleich/mrtsqr

MapReduce Streaming TSQR Implementation
Python C++ Java Shell Other
This branch is 302 commits ahead of dgleich:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
cxx
data
dumbo
java
.gitignore

## MapReduce TSQR (MRTSQR)

Austin R. Benson, David F. Gleich, Paul G. Constantine, and James Demmel

This software provides a number of matrix computations for Hadoop Mapreduce, using Python Hadoop streaming. Among the computations, we focus on providing a variety of methods for computing the QR factorization and the SVD. The QR factorization is a standard matrix factorization used to solve many problems. Probably the most famous is linear regression:

``````minimize || Ax - b ||,
``````

where A is an m-by-n matrix, and b is an m-by-1 vector. When the number of rows of the matrix A is much larger than the number of columns (m >> n), then A is called a tall-and-skinny matrix because of its shape.

The SVD is a small extension of QR when the matrix A is tall-and-skinny. If

`````` A = QR
``````

and

`````` R = USV',
``````

then the SVD of A is

`````` A = (QU)SV'.
``````

Since R is n-by-n, computing its SVD is cheap (O(n^3) operations). The QR implementations we offer are:

• Compute just the R factor, using TSQR or Cholesky QR
• Indirect TSQR (compute R followed by Q = AR^{-1}). This is an unstable computation of Q.
• Indirect TSQR + (pseudo-)Iterative Refinement.
• Direct TSQR. This is a stable computation of Q.
• Householder QR. This is for performance results only--the other algorithms are superior in MapReduce.

All of the analogous SVD computations are also available, with no additional cost in running time. We also provide the following computations that may be useful:

• B^T * A for tall-and-skinny matrices A and B.
• A^T * A for a tall-and-skinny matrix A.
• A * B for tall-and-skinny A and small B.

These codes are written using Python Hadoop streaming. We use NumPy for local matrix computations and Dumbo for managing the streaming. Some C++ implementations are also provided in the `mrtsqr/cxx` directory.

The most recent work can be found in the following paper by Benson, Gleich, and Demmel. Please cite the following paper if you use this software:

The original paper by Constantine and Gleich on MapReduce TSQR is:

The original work on the TSQR by Demmel et al. is:

## Setup

This code requires the following software:

Once everything is installed, run the small tests for MRTSQR:

`````` cd dumbo
python run_tests.py all
``````

## R and singular values examples

Here, we give a brief overview of the code and a small working example. For this example, we need to set the environment variable HADOOP_INSTALL to point to Hadoop on your system. For example:

`````` HADOOP_INSTALL=/usr/lib/hadoop
``````

Our first example shows how to compute the R factor and singular values of a small matrix:

``````# Move a matrix into HDFS, properly formatted for our tools
dumbo start dumbo/matrix2seqfile.py \
-input tsqr/verytiny.tmat -output tsqr/verytiny.mseq

# Look at the matrix in HFDS

# Run TSQR

# Look at R in HDFS

# Run TSQR with a different reduce schedule and output name
dumbo start dumbo/tsqr.py -mat tsqr/verytiny.mseq -reduce_schedule 2,1 \
-output verytiny-qrr-double-reduce.mseq

# Look at R (should be the same, up to sign)

# Run TS-SVD to compute the singular values

# Look at the singular values
``````

Our second example shows how to stably compute the thin QR and SVD factorization of the same small matrix. For this example, Feathers needs to be installed and feathers.jar on the Java classpath.

``````# Change directories
cd dumbo

# Compute Q, R, and singular values stably:
python run_dirtsqr.py --input=tsqr/verytiny.mseq \
--ncols=4 \
--svd=1 \
--local_output=tsqr-tmp \
--output=verytiny_qr_svd

# Look at R

# Look at the singular values
``````

The matrix Q is stored in `verytiny_qr_svd_3` on HDFS. However, we store it in the compressed TypedBytes string format (as opposed to TypedBytes list format) for efficiency. This makes the output of cat unreadable but computations using Q faster. We can make sure that Q is orthogonal:

`````` dumbo start AtA.py -mat verytiny_qr_svd_3 \
# Q^T * Q should be close to the identity matrix
``````

## Many QRs

In this example, we look at many of the methods provided for computing the QR factorization. We will use a slightly larger test matrix for this example. The following code creates a 10M-by-20 matrix.

`````` hadoop fs -copyFromLocal data/Simple_10k.txt Simple_10k.txt
dumbo start dumbo/GenBigExample.py -mat Simple_10k.txt \
``````

Now, we compute the QR factorization of this matrix using several different methods:

`````` cd dumbo
# TSQR + AR^{-1}.
--schedule=20,1 --output=ARINV
# Cholesky QR + AR^{-1}.  Option use_cholesky specifies number of columns.
--schedule=20,1 --output=ARINV_CHOL --use_cholesky=20

# Direct TSQR.
--ncols=20 --svd=0 --schedule=20,20,20 --output=DIRTSQR
# Recursive Direct TSQR.
--ncols=20 --output=REC_DIRTSQR

# Indirect TSQR + iterative refinement.
--schedule=20,1 --output=TSQR_IR
# Indirect TSQR + pseudo-iterative refinement.