Merge pull request #54 from biocore/fix-ordering

bug fixes in ordering and fraction total variance explained
biocore · Jan 10, 2020 · 7d685fd · 7d685fd
2 parents b1f3705 + 98c3deb
commit 7d685fd
Show file tree

Hide file tree

Showing 57 changed files with 15,127 additions and 4,669 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -6,6 +6,7 @@ omit =
     */__init__.py
     */base.py 
     *deicode/q2/plugin_setup.py
+    *deicode/testing.py
 source = DEICODE
 branch = True
 include = */deicode/*

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,31 @@
 # DEICODE changelog
 
+## Version 0.2.4 (2020-01-07)
+
+### Features
+
+* Feature presence frequency filter [raised in issue #45](https://github.com/biocore/DEICODE/issues/45)
+
+This new feature allows users to choose a number from 0-100, with decimals allowed. The filter
+will remove features that are present in less than that percentage of samples. In practice
+this filter is useful to filter out features that will be difficult to use for log-ratios
+downstream in [Qurro](https://github.com/biocore/qurro).
+
+* 'optspace' option in n-components to make an estimation on the rank parameter. See below in bug-fixes for more info.
+
+### Bug fixes
+
+* Axis order [raised in issue #52](https://github.com/biocore/DEICODE/issues/52) and [issue #32](https://github.com/biocore/DEICODE/issues/32)
+
+This was partially fixed in PR #33 and PR #34 but there was a remaining bug
+in the optspace function [here](https://github.com/biocore/DEICODE/blob/b1f37059b79f6ca2e2db11ba2fb7500c1a92f87e/deicode/optspace.py#L211-L212).
+This bug was subtle and only caused the order to change periodically. Examples of this fix being tested on both
+real and simulated data can be found [here](https://nbviewer.jupyter.org/github/cameronmartino/hartig-net/blob/master/percent-explained/real-data-fraction-var.ipynb) and [here](https://nbviewer.jupyter.org/github/cameronmartino/hartig-net/blob/master/percent-explained/simulation-fraction-var.ipynb) respectively. 
+
+* Fraction total variance explained [raised in issue #53](https://github.com/biocore/DEICODE/issues/53)
+
+The fraction variance explained currently is calculated by the sum of squares based on the number of singular values. The number of singular values changes based on the rank input. This causes the resulting fraction variance explained to change dramatically as the rank is increased or decreased. This has been brought up multiple times on the QIIME2 forum. For example, see [here](). Additionally, the correct rank that actually explains the total variance is not clear (i.e. What rank should I choose?). To solve both of these problems a rank estimation is made based on part C of Keshavan, R. H., Montanari, A. & Oh, S. (Low-rank matrix completion with noisy observations: A quantitative comparison. in 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton) 1216–1222 (2009)) is used. This can can be enabled by setting n-components to 'optspace'. To prevent user confusion the default was not changed in this version. However, a future warning was added to warn users that in the next version 'optspace' based rank estimation will be the default.
+
 ## Version 0.2.3 (2019-6-18)
 
 ### Backward-incompatible changes [stable]

diff --git a/README.md b/README.md
@@ -15,34 +15,19 @@ To install the most up to date version of deicode, run the following command
 
 **Note**: that deicode is not compatible with python 2, and is compatible with Python 3.4 or later. deicode is currently in alpha. We are actively developing it, and backward-incompatible interface changes may arise.
 
-## Using DEICODE as a standalone tool
-
-```
-$ deicode --help
-Usage: deicode [OPTIONS]
-
-  Runs RPCA with an rclr preprocessing step.
-
-Options:
-  --in-biom TEXT               Input table in biom format.  [required]
-  --output-dir TEXT            Location of output files.  [required]
-  --n_components INTEGER       The underlying low-rank structure (suggested: 1
-                               < rank < 10) [minimum 2]  [default: 3]
-  --min-sample-count INTEGER   Minimum sum cutoff of sample across all
-                               features  [default: 500]
-  --min-feature-count INTEGER  Minimum sum cutoff of features across all
-                               samples  [default: 10]
-  --max_iterations INTEGER         The number of iterations to optimize the
-                               solution (suggested to be below 100; beware of
-                               overfitting) [minimum 1]  [default: 5]
-  --help                       Show this message and exit.
-```
-
 ## Using DEICODE inside [QIIME 2](https://qiime2.org/)
 
 * The QIIME2 forum tutorial can be found [here](https://forum.qiime2.org/t/robust-aitchison-pca-beta-diversity-with-deicode/8333).
 * The official plugin docs and tutorial can be found [here](https://library.qiime2.org/plugins/deicode).
-* The in-repo tutorial can be found [here](https://github.com/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures.md).
+* The in-repo tutorial can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures.ipynb).
+
+## Using DEICODE as a standalone tool
+
+There are two commands within deicode. The first is `rpca` and the second is `auto-rpca`. The only difference is that `auto-rpca` automatically estimates the underlying-rank of the matrix and requires no input for the `n_components` parameter. In the `rpca` the `n_components` must be set explicitly. The structure of the commands follows the QIIME2 commands exactly and so questions about the use of the tool can be answered in the tutorial in the `Using DEICODE inside QIIME 2` section above. However, an example analysis without the use of QIIME2 can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures-standalone-cli-and-api.ipynb).
+
+## Using DEICODE as a Python API
+
+The `rpca` functionality of DEICODE is also exposed as a python API. An example analysis without the use of the command line can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures-standalone-cli-and-api.ipynb).
 
 ## Other Resources
 

diff --git a/deicode/__init__.py b/deicode/__init__.py
@@ -8,4 +8,4 @@
 #
 # ----------------------------------------------------------------------------
 
-__version__ = "0.2.3"
+__version__ = "0.2.4"
diff --git a/deicode/_rpca_defaults.py b/deicode/_rpca_defaults.py
@@ -4,12 +4,24 @@
 DEFAULT_RANK = 3
 DEFAULT_MSC = 500
 DEFAULT_MFC = 10
+DEFAULT_MFF = 0
 DEFAULT_ITERATIONS = 5
 
-DESC_RANK = ("The underlying low-rank structure (suggested: 1 < rank < 10)"
-             " [minimum 2]")
-DESC_MSC = "Minimum sum cutoff of sample across all features"
-DESC_MFC = "Minimum sum cutoff of features across all samples"
+DESC_RANK = ("The underlying low-rank structure."
+             " The input can be an integer "
+             "(suggested: 1 < rank < 10) [minimum 2]."
+             " Note: as the rank increases the runtime"
+             " will increase dramatically.")
+DESC_MSC = ("Minimum sum cutoff of sample across all features. "
+            "The value can be at minimum zero and must be an whole"
+            " integer. It is suggested to be greater than or equal"
+            " to 500.")
+DESC_MFC = ("Minimum sum cutoff of features across all samples. "
+            "The value can be at minimum zero and must be an whole"
+            " integer")
+DESC_MFF = ("Minimum percentage of samples a feature must appear"
+            " with a value greater than zero. This value can range"
+            " from 0 to 100 with decimal values allowed.")
 DESC_ITERATIONS = ("The number of iterations to optimize the solution"
                    " (suggested to be below 100; beware of overfitting)"
                    " [minimum 1]")
diff --git a/deicode/matrix_completion.py b/deicode/matrix_completion.py
@@ -82,11 +82,6 @@ def __init__(self, n_components=2, max_iterations=5, tol=1e-5):
         self.max_iterations = max_iterations
         self.tol = tol
 
-        if self.n_components < 2:
-            raise ValueError("n_components must be at least 2")
-        if self.max_iterations < 1:
-            raise ValueError("max_iterations must be at least 1")
-
         return
 
     def fit(self, X):
@@ -103,13 +98,34 @@ def _fit(self):
 
         # make copy for imputation, check type
         X_sparse = self.X_sparse
+        n, m = X_sparse.shape
 
+        # make sure the data is sparse (otherwise why?)
         if np.count_nonzero(np.isinf(X_sparse)) != 0:
             raise ValueError('Contains either np.inf or -np.inf')
 
-        if self.n_components > np.min(X_sparse.shape):
-            raise ValueError(
-                'n_components must be less than the minimum shape')
+        # test n-iter
+        if self.max_iterations < 1:
+            raise ValueError("max_iterations must be at least 1")
+
+        # check the settings for n_components
+        if isinstance(self.n_components, str) and \
+           self.n_components.lower() == 'auto':
+            # estimate the rank of the matrix
+            self.n_components = 'auto'
+        # check hardset values
+        elif isinstance(self.n_components, int):
+            if self.n_components > (min(n, m) - 1):
+                raise ValueError("n-components must be at most"
+                                 " 1 minus the min. shape of the"
+                                 " input matrix.")
+            if self.n_components < 2:
+                raise ValueError("n-components must "
+                                 "be at least 2")
+        # otherwise rase an error.
+        else:
+            raise ValueError("n-components must be "
+                             "an interger or 'auto'")
 
         # return solved matrix
         self.U, self.s, self.V = OptSpace(n_components=self.n_components,

diff --git a/deicode/optspace.py b/deicode/optspace.py
@@ -40,10 +40,12 @@ def __init__(
             N = Features (i.e. OTUs, metabolites)
             M = Samples
 
-        n_components: int
+        n_components: int or {"optspace"}, optional
             The underlying rank of the dataset.
             This will also control the number of components
-            (axis) in the U and V loadings.
+            (axis) in the U and V loadings. The value can either
+            be hard set by the user or estimated through the
+            optspace algorithm.
 
         max_iterations: int
             The maximum number of convex iterations to optimize the solution
@@ -105,6 +107,28 @@ def solve(self, obs):
         # generate a mask that tracks where missing
         # values exist in the obs dataset
         mask = (np.abs(obs) > 0).astype(np.int)
+        # save the shape of the matrix
+        n, m = obs.shape
+        # get a measure of sparsity level
+        total_nonzeros = np.count_nonzero(mask)
+        eps = total_nonzeros / np.sqrt(m * n)
+        if isinstance(self.n_components, str):
+            if self.n_components.lower() == 'auto':
+                # estimate the rank of the matrix
+                self.n_components = rank_estimate(obs, eps)
+            else:
+                raise ValueError("n-components must be an "
+                                 "integer or 'auto'.")
+        # raise future warning if hard set
+        elif isinstance(self.n_components, int):
+            if self.n_components > (min(n, m) - 1):
+                raise ValueError("n-components must be at most"
+                                 " 1 minus the min. shape of the"
+                                 " input matrix.")
+        # otherwise rase an error.
+        else:
+            raise ValueError("n-components must be "
+                             "an interger or 'auto'")
         # The rescaling factor compensates the smaller average size of
         # the of the missing entries (mask) with respect to obs
         rescal_param = np.count_nonzero(mask) * self.n_components
@@ -113,13 +137,9 @@ def solve(self, obs):
         # Our initial first guess are the loadings generated
         # by the traditional SVD
         U, S, V = svds(obs, self.n_components, which='LM')
-        # save the shape of the matrix
-        n, m = obs.shape
         # the shape and number of non-zero values
         # can set the input perameters for the gradient
         # decent
-        total_nonzeros = np.count_nonzero(mask)
-        eps = total_nonzeros / np.sqrt(m * n)
         rho = eps * n
         U = U * np.sqrt(n)
         V = (V * np.sqrt(m)).T
@@ -208,8 +228,8 @@ def svd_sort(U, S, V):
     # questions/36381356/sort-matrix-based
     # -on-its-diagonal-entries
     S = S[idx, :][:, idx]
-    U = U[:, idx[::-1]]
-    V = V[:, idx[::-1]]
+    U = U[:, idx]
+    V = V[:, idx]
     return U, S, V
 
 
@@ -366,3 +386,74 @@ def grassmann_manifold_two(U, step_size, n_components):
     step = np.multiply(U, repmat(step, 1, n_components)) / \
         (step_size * n_components)
     return step
+
+
+def rank_estimate(obs, eps, k=20, lam=0.05,
+                  min_rank=2, max_iter=5000):
+
+    """
+    This function estimates the rank of a
+    sparse matrix (i.e. with missing values).
+
+    Parameters
+    ----------
+    obs: numpy.ndarray - a rclr preprocessed matrix of shape (M,N)
+        with missing values set to zero or np.nan.
+        N = Features (i.e. OTUs, metabolites)
+        M = Samples
+
+    eps: float - Measure of the level of sparsity
+        Equivalent to obs N-non-zeros / sqrt(obs.shape)
+
+    k: int - Max number of singular values / rank
+
+    lam: float - Step in the iteration
+
+    min_rank: int - The min. rank allowed
+
+    Returns
+    -------
+    The estimated rank of the matrix.
+
+    References
+    ----------
+    .. [1] Part C in Keshavan, R. H., Montanari,
+           A. & Oh, S. Low-rank matrix completion
+           with noisy observations: A quantitative
+           comparison. in 2009 47th Annual Allerton
+           Conference on Communication, Control,
+           and Computing (Allerton) 1216–1222 (2009).
+    """
+
+    # dim. of the data
+    n, m = obs.shape
+    # get N-singular values
+    s = svds(obs,  min(k, n, m) - 1, which='LM',
+             return_singular_vectors=False)[::-1]
+    # get N+1 singular values
+    s_one = s[:-1] - s[1:]
+    # simplify iterations
+    s_one_ = s_one / np.mean(s_one[-10:])
+    # iteration one
+    r_one = 0
+    iter_ = 0
+    while r_one <= 0:
+        cost = []
+        # get the cost
+        for idx in range(s_one_.shape[0]):
+            cost.append(lam * max(s_one_[idx:]) + idx)
+        # estimate the rank
+        r_one = np.argmin(cost)
+        lam += lam
+        iter_ += 1
+        if iter_ > max_iter:
+            break
+
+    # iteration two
+    cost = []
+    # estimate the rank
+    for idx in range(s.shape[0] - 1):
+        cost.append((s[idx + 1] + np.sqrt(idx * eps) * s[0] / eps) / s[idx])
+    r_two = np.argmin(cost)
+    # return the final estimate
+    return max(r_one, r_two, min_rank)
diff --git a/deicode/q2/plugin_setup.py b/deicode/q2/plugin_setup.py
@@ -9,10 +9,10 @@
 import qiime2.plugin
 import qiime2.sdk
 from deicode import __version__
-from deicode.rpca import rpca
+from deicode.rpca import rpca, auto_rpca
 from deicode._rpca_defaults import (DESC_RANK, DESC_MSC, DESC_MFC,
-                                    DESC_ITERATIONS)
-from qiime2.plugin import (Properties, Int)
+                                    DESC_ITERATIONS, DESC_MFF)
+from qiime2.plugin import (Properties, Int, Float)
 from q2_types.feature_table import FeatureTable, Frequency
 from q2_types.distance_matrix import DistanceMatrix
 from q2_types.ordination import PCoAResults
@@ -38,6 +38,7 @@
         'n_components': Int,
         'min_sample_count': Int,
         'min_feature_count': Int,
+        'min_feature_frequency': Float,
         'max_iterations': Int,
     },
     outputs=[
@@ -51,16 +52,52 @@
         'n_components': DESC_RANK,
         'min_sample_count': DESC_MSC,
         'min_feature_count': DESC_MFC,
+        'min_feature_frequency': DESC_MFF,
         'max_iterations': DESC_ITERATIONS,
     },
     output_descriptions={
         'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
         'distance_matrix': ('The Aitchison distance of'
-                            'the sample loadings from RPCA.')
+                            ' the sample loadings from RPCA.')
     },
     name='(Robust Aitchison) RPCA Biplot',
     description=("Performs robust center log-ratio transform "
                  "robust PCA and ranks the features by the "
                  "loadings of the resulting SVD."),
     citations=[]
 )
+
+plugin.methods.register_function(
+    function=auto_rpca,
+    inputs={'table': FeatureTable[Frequency]},
+    parameters={
+        'min_sample_count': Int,
+        'min_feature_count': Int,
+        'min_feature_frequency': Float,
+        'max_iterations': Int,
+    },
+    outputs=[
+        ('biplot', PCoAResults % Properties("biplot")),
+        ('distance_matrix', DistanceMatrix)
+    ],
+    input_descriptions={
+        'table': 'Input table of counts.',
+    },
+    parameter_descriptions={
+        'min_sample_count': DESC_MSC,
+        'min_feature_count': DESC_MFC,
+        'min_feature_frequency': DESC_MFF,
+        'max_iterations': DESC_ITERATIONS,
+    },
+    output_descriptions={
+        'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
+        'distance_matrix': ('The Aitchison distance of'
+                            ' the sample loadings from RPCA.')
+    },
+    name='(Robust Aitchison) RPCA Biplot',
+    description=("Performs robust center log-ratio transform "
+                 "robust PCA and ranks the features by the "
+                 "loadings of the resulting SVD. Automatically"
+                 " estimates the underlying rank (i.e. n-components)."),
+    citations=[]
+)