Matrix factorization is used in recommender systems, topic modeling, data compression, and anomaly detection. It is useful for learning latent representations of fixed entities. Sometimes, such entities are changing with respect to time. In such cases, we can extend conventional matrix factorization methods to sequential matrix factorization.
Matrix factorization deals with the following problem.
Here we deal with the problem where you have a sequence of data
each with a decomposition of
subject to a transition rule based on transition matrix
The loss function for optimization is the distance (e.g., frobenius norm) of the predicted values of
$ \underset{W_i, H, T}{\mathrm{argmin}}; \alpha\frac{1}{N}\sum_{i=1}^N \left\lVert W_i H - X_i \right\rVert^2 + (1-\alpha)\frac{1}{N-2}\sum_{i=3}^N \left\lVert W_i H - (W_{i-2}T^2H + W_i T H) \right\rVert^2 + \beta\sum_{i=1}^N \left\lVert W_i \right\rVert_{2,1}^2 + \kappa\left\lVert H \right\rVert_{1,2}^2 + \eta\sum_{\mathrm{all\ elements}}\left\lVert \mathrm{relu}([W, H]) \right\rVert^2 $
The first term is the compression loss, or how much data is lost going from the raw representation to the condensed representation. The second term is the transition loss, or how well the transition rule is able to predict the next step, given a compression scheme. The third and fourth terms are weight decay regularizers aimed at getting better generalization. They have a 1-norm in the axis corresponding to the features to enforce sparsity, and a 2-norm in the other axis. The final term is a nonnegativity enforcer, which has zero loss for positive values. The relative importances of each loss term are given by the coefficients
Typically graph features (such as those extracted by ReFeX) can contain large variations in the values. To make the optimization work, we must first normalize the graph features values. To do this, we and mean normalize each feature's marginal distribution. Some feature distributions are also log-scaled. We show the data exploration and normalization process in this notebook https://colab.research.google.com/drive/13Gj8qYA2Nl8jucQbWaYeuclmaStkwODL
A gif of all the nonzero feature marginal distributions is shown here. The data comes from features extraction from a graph of transactions at a financial institution.
We define a row to be inactive if it contains all zeros for a particular time step. We added an additional column which is 1 if the row is inactive for that time step, and 0 otherwise. If a particular node is inactive for all time steps. We define a row to be dead if it is inactive for all time steps. If a row is dead, we remove it from the data.
We unit normalize each row in
tensorflow 1.9
(haven't tested on other versions)numpy
The code is self contained within the main.py
file.
An interactive tensorboard of the constructed computation graph is in the following link https://boards.aughie.org/board/IQg3vzIEAcHviNIniDicLLQ-U_E/
Screenshot of high level graph
The main feed-forward computations are enclosed within the
General (negative values allowed) matrix factorization yields the following training curve.