# Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) is a simple yet effective method to decompose a matrix into a product of two non-negative matrices(that is sparse matrices with all non-negative entries). This technique is most commonly used in recommender systems, and was made well known by the Netflix Prize. NMF aims to factor a data matrix $X$ into a product of two matrices:

$$X \approx AS $$

where $X$ is a $n \times m$ matrix, $A$ is a $n \times k$ matrix, and $S$ is a $k \times m$ matrix. $k$ is usually provided by the user, and symbolizes the number of distinct "factors" in the data. For example, if our data was the total productivity of a group of factories per hour for the past week, the number of factors $k$ would be the number of factories. Without prior knowledge the number of factors would be harder to pinpoint, and would have to be chosen using cross validation or something similar

It's important to note that this problem does not have a unique solution, and we could end up with many different combinations of $A$ and $S$ that multiply to get a decent approximation of $X$. Even more, each pair of $A$ and $S$ can be scaled by any real number $\alpha$ and $\frac{1}{\alpha}$ respectively to yield an infinite number of pairs. 

## Alternating Least Squares

The natural question to ask now is how to determine $A$ and $S$ when given $X$ and $k$. One relatively simple method is to use alternating least squares, which is a generalization of the least squares method for simple linear regression. In simple linear regression, the goal is to solve the following equation for $x$:

$$ Ax = b \implies A^TAx = A^Tb \implies x = (A)^{\dagger}b\$$

This can be generalized for a product of matrices by picking a random $i^{th}$ column of $X$ and $S$, which we will denote $x_{:,i}$ and $s_{:,i}$, fixing $A$, and solving for $s_{:,i}$. Then by our previous equation $X \approx AS$ we have 

$$ x_{:,i} \approx As_{:,i}$$
This yields the following update rule:

$$ s_{:,i} := (A)^{\dagger}x_{:,i}$$

However, since we also want to solve for $A$ we need to sample a column of $A$ and fix $S$. To get the same linear form we do the following:

$$x_{i,:} \approx a_{i,:}S \implies x_{i,:}^T \approx S^Ta_{i,:}^T$$

We switch to updating the rows of $A$ rather than the columns due to dimensionality, and get the following update rule:

$$ a_{i,:}^T = (SS^T)^{-1}Sx_{i,:}^T $$

We then repeat these updates until convergence or the number of iterations is fulfilled. 

In [27]:
import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(suppress=True)

In [4]:
np.random.seed(1)
col1 = np.array([[0, 0, 9, 5, 3, 2, 1, 0, 0, 0, 0, 0]])
col2 = np.array([[0, 0, 0, 0, 0, 3, 2, 1, 1, 0, 0, 0]])
col3 = np.array([[0, 5, 5, 6, 6, 7, 4, 2, 1, 0.5, 0, 0]])

factors = np.vstack((col1, col2, col3)).T
weights = np.random.randint(0, 2, size=(3, 10))

X = np.matmul(factors, weights)
print(X)

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]


Initializing the factor matrices $A$ and $S$ is typically done by filling a matrix of the right dimensions with randomized entries. However, to check that the ALS algorithm works properly we can initialize $A$ and $S$ close to our original factor matrices. The final $A$ and $S$ should yield a very low error.

In [28]:
np.random.seed(1)
k = 3
niter = 1000
A = factors + 0.01*np.random.rand(12, 3)
S = weights + 0.01*np.random.rand(3, 10)

for i in np.arange(niter):
    rowcol = np.random.randint(k)
    S[:, rowcol] = np.matmul(np.linalg.pinv(A), X[:, rowcol])
    A[rowcol, :] = np.matmul(X[rowcol, :], np.matmul(S.T, np.linalg.inv(np.matmul(S, S.T))))

approx = np.matmul(A, S)
print(X)
print(np.round(approx, 2))
print("Relative error: ", np.linalg.norm(X - approx) / np.linalg.norm(X))

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]
[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [-0.02  4.98  0.    0.02  5.01  0.01  0.02 -0.01  5.02  0.03]
 [ 8.99 13.98  0.    0.03 14.02  9.04  9.01  8.97 14.    0.07]
 [ 5.02 11.03  0.    0.08 11.07  5.07  5.06  5.07 11.02  0.05]
 [ 3.01  9.01  0.    0.07  9.05  3.05  3.05  3.06  9.01  0.05]
 [ 2.   11.99  0.    3.07 12.04  2.07  2.07  5.0

We can see that the entries of our approximation $AS$ are pretty close to our data matrix, and the relative error is fairly low. Now that the algorithm works we can finalize it in a function.

In [34]:
def nmfals(data, k, niter, reinit = 5):
    # set to negative one so we can guarantee an update for the first init
    finalerror = -1
    
    # need to compare final error to overall best and store the overall best
    seqerror = np.empty(niter)
    lowesterror = np.empty(1)
    
    # store overall best factor matrices
    lbest = np.random.rand(data.shape[0], k)
    rbest = np.random.rand(k, data.shape[1])
    
    for j in np.arange(reinit):
        # randomly initialize the factor matrices
        lfactor = np.random.rand(data.shape[0], k)
        rfactor = np.random.rand(k, data.shape[1])

        for i in np.arange(niter):
            # sample random row or column
            rowcol = np.random.randint(k)
            # perform linear reg update 
            rfactor[:, rowcol] = np.matmul(np.linalg.pinv(lfactor), data[:, rowcol])
            lfactor[rowcol, :] = np.matmul(data[rowcol, :], np.matmul(rfactor.T, np.linalg.inv(np.matmul(rfactor, rfactor.T))))
            # calculate error after update
            seqerror[i] = np.linalg.norm(data - np.matmul(lfactor, rfactor)) / np.linalg.norm(data)
        # update after first init
        if (finalerror == -1):
            lowesterror = seqerror
            lbest = lfactor
            rbest = rfactor
        # if not first, only update if final error is lower than overall best
        elif (finalerror > seqerror[niter - 1]):
            finalerror = seqerror[niter - 1]
            lowesterror = seqerror
            lbest = lfactor
            rbest = rfactor
    return(lbest, rbest, lowesterror)

In [30]:
np.random.seed(1)
A, S, error = nmfals(X, 3, 100, 10)
print(np.round(np.matmul(A,S), 2))
print(error[90:100])

[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [-0.27  4.96  0.    1.33  0.66  1.47  2.04  1.07  2.38 -0.05]
 [ 8.92 13.85  0.    6.67  5.13  9.12 11.73  8.94 11.47  2.53]
 [ 2.45  6.63  0.    0.56  0.45  0.9   1.45  0.76  1.08  1.16]
 [ 2.45  8.72  0.    0.43  0.28  0.69  1.31  0.44  0.9   1.33]
 [ 0.72  3.16  0.    0.45  0.28  0.58  0.87  0.46  0.82  0.36]
 [ 0.43  2.91  0.    0.23  0.1   0.27  0.49  0.14  0.45  0.29]
 [ 1.52  6.55  0.    0.77  0.47  1.02  1.59  0.77  1.44  0.79]
 [ 2.09  4.58  0.    0.41  0.38  0.7   1.1   0.65  0.77  0.93]
 [ 0.6   2.38  0.    0.5   0.32  0.64  0.9   0.54  0.89  0.26]
 [ 0.56  2.65  0.    0.33  0.19  0.42  0.66  0.32  0.61  0.3 ]
 [ 1.44  5.17  0.    0.55  0.36  0.76  1.21  0.6   1.03  0.72]]
[0.69647011 0.69647011 0.69646858 0.6964143  0.6964143  0.6964143
 0.69640732 0.69640174 0.69635492 0.69635001]


Unfortunately, the ALS algorithm does not always closely resemble the original data matrix in practice, as the random initializations of $A$ and $S$ can cause the resulting approximation to vary wildly even with multiple iterations. This makes sense, as there are many different factorizations that a matrix can have. While the factorized $A$ and $S$ don't form a matrix that matches $X$ closely, it did preserve the row and column of zeros that were present in $X$. 

One thing to note is we use the relative error to judge the quality of our approximation, which is the Frobenius norm of the difference between our original data matrix $X$ and the approximation $AS$ divided by the Frobenius norm of $X$. We do this rather than just use the raw error since factorizations can be scaled by multiplying $A$ by a number $r$ and $S$ by $\frac{1}{r}$. This scaling also scales the error, hence the need for a relative metric. 

## Randomized Kaczmarz Method

We saw previously that the algorithm to factorize $X$ has two main parts, one to pick our matrix column and row indices and one to take an iterative step towards the local optimum. The iterative step gives us the freedom to choose our favorite method to solve for $A$ and $S$, and rather than doing a traditional least squares method we can try applying the Randomized Kaczmarz(RK) method instead. This iterative step takes our randomly chosen row/column of either $A$ or $S$ and projects it towards the local optimum. It is equivalent to stochastic gradient descent with a specific step size when the matrix is positive definite.

Our current system $AS = X$ is reduced to $As_{:,i} = x_{:,i}$ when a column is sampled. Our RK iterative step then samples a row of $A$ and corresponding entry $k$ of $s_{:,i}$ and would then be the following:

$$s_{:,i}^{(j+1)} = s_{:,i}^{(j)} + \frac{x_{k,i} - a_{k,:}^Ts_{:,i}}{\lvert\lvert{a_{k,:}}\rvert\rvert ^2}a_{k,:}$$

Note that $j$ represents the current iteration of our RK method. This value can be explicitly chosen, making it an additional parameter in this algorithm. If we sampled a row of $S$ rather than a column of $A$, each step would instead be the following:

$$a_{i,:}^{(j+1)} = a_{i,:}^{(j)} + \frac{x_{i,k} - a_{i,:}^Ts_{:,k}}{\lvert\lvert{s_{:,k}}\rvert\rvert ^2}s_{:,k}$$

To summarize, we start by randomly sampling a row/column to reduce to a linear system(as usual). We then proceed to take RK steps, with each step sampling a random row and entry and updating. The number of steps before resampling our linear system can be provided as a parameter. 

We can also perform quick sanity check by initializing $A$ and $S$ close to the original factor matrices like we did with the ALS method.  

In [47]:
np.random.seed(1)
k = 3
niter = 100

A = factors + 0.01*np.random.rand(12, 3)
S = weights + 0.01*np.random.rand(3, 10)

kacziters = 5
for i in np.arange(niter):
    rowcol = np.random.randint(k)
    for i in np.arange(kacziters):
        kaczrow = np.random.randint(len(X[:, rowcol]))
        kaczcol = np.random.randint(len(X[rowcol, :]))
        S[:, rowcol] = S[:, rowcol] + (X[kaczrow, rowcol] - np.matmul(A[kaczrow, :], S[:, rowcol])) / (np.linalg.norm(A[kaczrow, :])**2) * A[kaczrow, :]
        A[rowcol, :] = A[rowcol, :] + (X[rowcol, kaczcol] - np.matmul(A[rowcol, :], S[:, kaczcol])) / (np.linalg.norm(S[:, kaczcol])**2) * S[:, kaczcol] 

approx = np.matmul(A, S)
print(X)
print(np.round(approx, 2))
print(np.linalg.norm(X - approx) / np.linalg.norm(X))

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]
[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [-0.47  0.91  0.    0.53  0.33 -0.83 -0.83 -0.3  -0.19  0.  ]
 [ 4.9  -3.07  0.    0.    8.13 10.44 10.38 10.38  8.1   0.03]
 [ 3.24  9.07  0.    0.08 11.07  5.07  5.06  5.07 11.02  0.05]
 [ 2.24  9.    0.    0.07  9.05  3.05  3.05  3.06  9.01  0.05]
 [ 1.11 10.35  0.    3.07 12.04  2.07  2.07  5.0

Compared to the ALS method, the RK method has a higher relative error in its approximation. 

The finalized function is largely similar to the ALS one, except the ALS iterative step is substituted for the RK iteration loop. 

In [44]:
def nmfrk(data, k, niter, kacziter, reinit = 5):
    # set to negative one so we can guarantee an update for the first init
    finalerror = -1
    
    # need to compare final error to overall best and store the overall best
    seqerror = np.empty(niter)
    lowesterror = np.empty(1)
    
    # store overall best factor matrices
    lbest = np.random.rand(data.shape[0], k)
    rbest = np.random.rand(k, data.shape[1])
    
    for j in np.arange(reinit):
        # randomly initialize the factor matrices
        lfactor = np.random.rand(data.shape[0], k)
        rfactor = np.random.rand(k, data.shape[1])
        # outer loop for number of iterations 
        for i in np.arange(niter):
            rowcol = np.random.randint(k)
            # inner loop for number of RK iterations
            for i in np.arange(kacziter):
                kaczrow = np.random.randint(len(X[:, rowcol]))
                kaczcol = np.random.randint(len(X[rowcol, :]))
                rfactor[:, rowcol] = rfactor[:, rowcol] + (data[kaczrow, rowcol] - np.matmul(lfactor[kaczrow, :], rfactor[:, rowcol])) / (np.linalg.norm(lfactor[kaczrow, :])**2) * lfactor[kaczrow, :]
                lfactor[rowcol, :] = lfactor[rowcol, :] + (data[rowcol, kaczcol] - np.matmul(lfactor[rowcol, :], rfactor[:, kaczcol])) / (np.linalg.norm(rfactor[:, kaczcol])**2) * rfactor[:, kaczcol] 
            # calculate error after update
            seqerror[i] = np.linalg.norm(data - np.matmul(lfactor, rfactor)) / np.linalg.norm(data)
        # update after first init
        if (finalerror == -1):
            lowesterror = seqerror
            lbest = lfactor
            rbest = rfactor
        # if not first, only update if final error is lower than overall best
        elif (finalerror > seqerror[niter - 1]):
            finalerror = seqerror[niter - 1]
            lowesterror = seqerror
            lbest = lfactor
            rbest = rfactor
    return(lbest, rbest, lowesterror)

Varying the number of RK iterations per row/column sample does not seem to reliably change the relative error, as the variance of the relative error is quite high despite 10 reinitializations to attempt to mitigate it. 

In [75]:
A, S, error = nmfrk(X, k = 3, niter = 100, kacziter = 1, reinit = 10)

approx = np.matmul(A, S)
print(X)
print(np.round(approx, 2))
print(np.linalg.norm(X - approx) / np.linalg.norm(X))

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]
[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 1.69 -1.51 -0.02 -1.02  0.19 -0.5  -1.25  0.   -0.13  0.33]
 [44.05 71.12 -0.23 18.33  8.17 18.86  9.   15.61 13.15 15.95]
 [ 3.54  4.92 -0.02  1.58  0.7   1.63  0.78  1.31  1.02  1.33]
 [ 0.38  0.41  0.    0.41  0.12  0.37  0.28  0.24  0.15  0.2 ]
 [ 2.    4.84 -0.01  1.44  0.41  1.28  0.97  0.9

In [77]:
A, S, error = nmfrk(X, k = 3, niter = 100, kacziter = 5, reinit = 10)

approx = np.matmul(A, S)
print(X)
print(np.round(approx, 2))
print(np.linalg.norm(X - approx) / np.linalg.norm(X))

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]
[[  0.     0.     0.     0.     0.     0.     0.     0.     0.     0.  ]
 [ 25.67 -17.61   0.     4.12   2.39   0.     1.27   2.45   1.4   -0.15]
 [-26.97 -52.08   0.     0.    -1.59   9.97  10.47   8.58   5.78  -5.  ]
 [  2.44  11.21   0.     0.8    1.19   0.39   0.26   0.44   0.17   0.84]
 [  1.4    7.64   0.     0.58   0.85   0.42   0.33   0.43   0.2    0.55

In [65]:
A, S, error = nmfrk(X, k = 3, niter = 100, kacziter = 10, reinit = 10)

approx = np.matmul(A, S)
print(X)
print(np.round(approx, 2))
print(np.linalg.norm(X - approx) / np.linalg.norm(X))

[[ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   5.   0.   0.   5.   0.   0.   0.   5.   0. ]
 [ 9.  14.   0.   0.  14.   9.   9.   9.  14.   0. ]
 [ 5.  11.   0.   0.  11.   5.   5.   5.  11.   0. ]
 [ 3.   9.   0.   0.   9.   3.   3.   3.   9.   0. ]
 [ 2.  12.   0.   3.  12.   2.   2.   5.   9.   0. ]
 [ 1.   7.   0.   2.   7.   1.   1.   3.   5.   0. ]
 [ 0.   3.   0.   1.   3.   0.   0.   1.   2.   0. ]
 [ 0.   2.   0.   1.   2.   0.   0.   1.   1.   0. ]
 [ 0.   0.5  0.   0.   0.5  0.   0.   0.   0.5  0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0.   0.   0. ]]
[[  0.     0.     0.     0.     0.     0.     0.     0.     0.     0.  ]
 [  0.56  57.59   0.     1.03   3.91   2.75   5.73   2.2    5.    -1.14]
 [-13.25 132.34   0.     6.85  14.    12.03  13.69   6.9   14.06  -4.01]
 [  1.74  -5.37   0.     1.81   1.3    1.29   1.11   0.65   0.76   1.37]
 [  0.62   1.06   0.     0.53   0.51   0.45   0.57   0.27   0.43   0.35