Skip to content

Commit

Permalink
Merge pull request #54 from biocore/fix-ordering
Browse files Browse the repository at this point in the history
bug fixes in ordering and fraction total variance explained
  • Loading branch information
mortonjt committed Jan 10, 2020
2 parents b1f3705 + 98c3deb commit 7d685fd
Show file tree
Hide file tree
Showing 57 changed files with 15,127 additions and 4,669 deletions.
1 change: 1 addition & 0 deletions .coveragerc
Expand Up @@ -6,6 +6,7 @@ omit =
*/__init__.py
*/base.py
*deicode/q2/plugin_setup.py
*deicode/testing.py
source = DEICODE
branch = True
include = */deicode/*
Expand Down
26 changes: 26 additions & 0 deletions CHANGELOG.md
@@ -1,5 +1,31 @@
# DEICODE changelog

## Version 0.2.4 (2020-01-07)

### Features

* Feature presence frequency filter [raised in issue #45](https://github.com/biocore/DEICODE/issues/45)

This new feature allows users to choose a number from 0-100, with decimals allowed. The filter
will remove features that are present in less than that percentage of samples. In practice
this filter is useful to filter out features that will be difficult to use for log-ratios
downstream in [Qurro](https://github.com/biocore/qurro).

* 'optspace' option in n-components to make an estimation on the rank parameter. See below in bug-fixes for more info.

### Bug fixes

* Axis order [raised in issue #52](https://github.com/biocore/DEICODE/issues/52) and [issue #32](https://github.com/biocore/DEICODE/issues/32)

This was partially fixed in PR #33 and PR #34 but there was a remaining bug
in the optspace function [here](https://github.com/biocore/DEICODE/blob/b1f37059b79f6ca2e2db11ba2fb7500c1a92f87e/deicode/optspace.py#L211-L212).
This bug was subtle and only caused the order to change periodically. Examples of this fix being tested on both
real and simulated data can be found [here](https://nbviewer.jupyter.org/github/cameronmartino/hartig-net/blob/master/percent-explained/real-data-fraction-var.ipynb) and [here](https://nbviewer.jupyter.org/github/cameronmartino/hartig-net/blob/master/percent-explained/simulation-fraction-var.ipynb) respectively.

* Fraction total variance explained [raised in issue #53](https://github.com/biocore/DEICODE/issues/53)

The fraction variance explained currently is calculated by the sum of squares based on the number of singular values. The number of singular values changes based on the rank input. This causes the resulting fraction variance explained to change dramatically as the rank is increased or decreased. This has been brought up multiple times on the QIIME2 forum. For example, see [here](). Additionally, the correct rank that actually explains the total variance is not clear (i.e. What rank should I choose?). To solve both of these problems a rank estimation is made based on part C of Keshavan, R. H., Montanari, A. & Oh, S. (Low-rank matrix completion with noisy observations: A quantitative comparison. in 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton) 1216–1222 (2009)) is used. This can can be enabled by setting n-components to 'optspace'. To prevent user confusion the default was not changed in this version. However, a future warning was added to warn users that in the next version 'optspace' based rank estimation will be the default.

## Version 0.2.3 (2019-6-18)

### Backward-incompatible changes [stable]
Expand Down
33 changes: 9 additions & 24 deletions README.md
Expand Up @@ -15,34 +15,19 @@ To install the most up to date version of deicode, run the following command

**Note**: that deicode is not compatible with python 2, and is compatible with Python 3.4 or later. deicode is currently in alpha. We are actively developing it, and backward-incompatible interface changes may arise.

## Using DEICODE as a standalone tool

```
$ deicode --help
Usage: deicode [OPTIONS]
Runs RPCA with an rclr preprocessing step.
Options:
--in-biom TEXT Input table in biom format. [required]
--output-dir TEXT Location of output files. [required]
--n_components INTEGER The underlying low-rank structure (suggested: 1
< rank < 10) [minimum 2] [default: 3]
--min-sample-count INTEGER Minimum sum cutoff of sample across all
features [default: 500]
--min-feature-count INTEGER Minimum sum cutoff of features across all
samples [default: 10]
--max_iterations INTEGER The number of iterations to optimize the
solution (suggested to be below 100; beware of
overfitting) [minimum 1] [default: 5]
--help Show this message and exit.
```

## Using DEICODE inside [QIIME 2](https://qiime2.org/)

* The QIIME2 forum tutorial can be found [here](https://forum.qiime2.org/t/robust-aitchison-pca-beta-diversity-with-deicode/8333).
* The official plugin docs and tutorial can be found [here](https://library.qiime2.org/plugins/deicode).
* The in-repo tutorial can be found [here](https://github.com/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures.md).
* The in-repo tutorial can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures.ipynb).

## Using DEICODE as a standalone tool

There are two commands within deicode. The first is `rpca` and the second is `auto-rpca`. The only difference is that `auto-rpca` automatically estimates the underlying-rank of the matrix and requires no input for the `n_components` parameter. In the `rpca` the `n_components` must be set explicitly. The structure of the commands follows the QIIME2 commands exactly and so questions about the use of the tool can be answered in the tutorial in the `Using DEICODE inside QIIME 2` section above. However, an example analysis without the use of QIIME2 can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures-standalone-cli-and-api.ipynb).

## Using DEICODE as a Python API

The `rpca` functionality of DEICODE is also exposed as a python API. An example analysis without the use of the command line can be found [here](https://nbviewer.jupyter.org/github/biocore/DEICODE/blob/master/ipynb/tutorials/moving-pictures-standalone-cli-and-api.ipynb).

## Other Resources

Expand Down
2 changes: 1 addition & 1 deletion deicode/__init__.py
Expand Up @@ -8,4 +8,4 @@
#
# ----------------------------------------------------------------------------

__version__ = "0.2.3"
__version__ = "0.2.4"
20 changes: 16 additions & 4 deletions deicode/_rpca_defaults.py
Expand Up @@ -4,12 +4,24 @@
DEFAULT_RANK = 3
DEFAULT_MSC = 500
DEFAULT_MFC = 10
DEFAULT_MFF = 0
DEFAULT_ITERATIONS = 5

DESC_RANK = ("The underlying low-rank structure (suggested: 1 < rank < 10)"
" [minimum 2]")
DESC_MSC = "Minimum sum cutoff of sample across all features"
DESC_MFC = "Minimum sum cutoff of features across all samples"
DESC_RANK = ("The underlying low-rank structure."
" The input can be an integer "
"(suggested: 1 < rank < 10) [minimum 2]."
" Note: as the rank increases the runtime"
" will increase dramatically.")
DESC_MSC = ("Minimum sum cutoff of sample across all features. "
"The value can be at minimum zero and must be an whole"
" integer. It is suggested to be greater than or equal"
" to 500.")
DESC_MFC = ("Minimum sum cutoff of features across all samples. "
"The value can be at minimum zero and must be an whole"
" integer")
DESC_MFF = ("Minimum percentage of samples a feature must appear"
" with a value greater than zero. This value can range"
" from 0 to 100 with decimal values allowed.")
DESC_ITERATIONS = ("The number of iterations to optimize the solution"
" (suggested to be below 100; beware of overfitting)"
" [minimum 1]")
32 changes: 24 additions & 8 deletions deicode/matrix_completion.py
Expand Up @@ -82,11 +82,6 @@ def __init__(self, n_components=2, max_iterations=5, tol=1e-5):
self.max_iterations = max_iterations
self.tol = tol

if self.n_components < 2:
raise ValueError("n_components must be at least 2")
if self.max_iterations < 1:
raise ValueError("max_iterations must be at least 1")

return

def fit(self, X):
Expand All @@ -103,13 +98,34 @@ def _fit(self):

# make copy for imputation, check type
X_sparse = self.X_sparse
n, m = X_sparse.shape

# make sure the data is sparse (otherwise why?)
if np.count_nonzero(np.isinf(X_sparse)) != 0:
raise ValueError('Contains either np.inf or -np.inf')

if self.n_components > np.min(X_sparse.shape):
raise ValueError(
'n_components must be less than the minimum shape')
# test n-iter
if self.max_iterations < 1:
raise ValueError("max_iterations must be at least 1")

# check the settings for n_components
if isinstance(self.n_components, str) and \
self.n_components.lower() == 'auto':
# estimate the rank of the matrix
self.n_components = 'auto'
# check hardset values
elif isinstance(self.n_components, int):
if self.n_components > (min(n, m) - 1):
raise ValueError("n-components must be at most"
" 1 minus the min. shape of the"
" input matrix.")
if self.n_components < 2:
raise ValueError("n-components must "
"be at least 2")
# otherwise rase an error.
else:
raise ValueError("n-components must be "
"an interger or 'auto'")

# return solved matrix
self.U, self.s, self.V = OptSpace(n_components=self.n_components,
Expand Down
107 changes: 99 additions & 8 deletions deicode/optspace.py
Expand Up @@ -40,10 +40,12 @@ def __init__(
N = Features (i.e. OTUs, metabolites)
M = Samples
n_components: int
n_components: int or {"optspace"}, optional
The underlying rank of the dataset.
This will also control the number of components
(axis) in the U and V loadings.
(axis) in the U and V loadings. The value can either
be hard set by the user or estimated through the
optspace algorithm.
max_iterations: int
The maximum number of convex iterations to optimize the solution
Expand Down Expand Up @@ -105,6 +107,28 @@ def solve(self, obs):
# generate a mask that tracks where missing
# values exist in the obs dataset
mask = (np.abs(obs) > 0).astype(np.int)
# save the shape of the matrix
n, m = obs.shape
# get a measure of sparsity level
total_nonzeros = np.count_nonzero(mask)
eps = total_nonzeros / np.sqrt(m * n)
if isinstance(self.n_components, str):
if self.n_components.lower() == 'auto':
# estimate the rank of the matrix
self.n_components = rank_estimate(obs, eps)
else:
raise ValueError("n-components must be an "
"integer or 'auto'.")
# raise future warning if hard set
elif isinstance(self.n_components, int):
if self.n_components > (min(n, m) - 1):
raise ValueError("n-components must be at most"
" 1 minus the min. shape of the"
" input matrix.")
# otherwise rase an error.
else:
raise ValueError("n-components must be "
"an interger or 'auto'")
# The rescaling factor compensates the smaller average size of
# the of the missing entries (mask) with respect to obs
rescal_param = np.count_nonzero(mask) * self.n_components
Expand All @@ -113,13 +137,9 @@ def solve(self, obs):
# Our initial first guess are the loadings generated
# by the traditional SVD
U, S, V = svds(obs, self.n_components, which='LM')
# save the shape of the matrix
n, m = obs.shape
# the shape and number of non-zero values
# can set the input perameters for the gradient
# decent
total_nonzeros = np.count_nonzero(mask)
eps = total_nonzeros / np.sqrt(m * n)
rho = eps * n
U = U * np.sqrt(n)
V = (V * np.sqrt(m)).T
Expand Down Expand Up @@ -208,8 +228,8 @@ def svd_sort(U, S, V):
# questions/36381356/sort-matrix-based
# -on-its-diagonal-entries
S = S[idx, :][:, idx]
U = U[:, idx[::-1]]
V = V[:, idx[::-1]]
U = U[:, idx]
V = V[:, idx]
return U, S, V


Expand Down Expand Up @@ -366,3 +386,74 @@ def grassmann_manifold_two(U, step_size, n_components):
step = np.multiply(U, repmat(step, 1, n_components)) / \
(step_size * n_components)
return step


def rank_estimate(obs, eps, k=20, lam=0.05,
min_rank=2, max_iter=5000):

"""
This function estimates the rank of a
sparse matrix (i.e. with missing values).
Parameters
----------
obs: numpy.ndarray - a rclr preprocessed matrix of shape (M,N)
with missing values set to zero or np.nan.
N = Features (i.e. OTUs, metabolites)
M = Samples
eps: float - Measure of the level of sparsity
Equivalent to obs N-non-zeros / sqrt(obs.shape)
k: int - Max number of singular values / rank
lam: float - Step in the iteration
min_rank: int - The min. rank allowed
Returns
-------
The estimated rank of the matrix.
References
----------
.. [1] Part C in Keshavan, R. H., Montanari,
A. & Oh, S. Low-rank matrix completion
with noisy observations: A quantitative
comparison. in 2009 47th Annual Allerton
Conference on Communication, Control,
and Computing (Allerton) 1216–1222 (2009).
"""

# dim. of the data
n, m = obs.shape
# get N-singular values
s = svds(obs, min(k, n, m) - 1, which='LM',
return_singular_vectors=False)[::-1]
# get N+1 singular values
s_one = s[:-1] - s[1:]
# simplify iterations
s_one_ = s_one / np.mean(s_one[-10:])
# iteration one
r_one = 0
iter_ = 0
while r_one <= 0:
cost = []
# get the cost
for idx in range(s_one_.shape[0]):
cost.append(lam * max(s_one_[idx:]) + idx)
# estimate the rank
r_one = np.argmin(cost)
lam += lam
iter_ += 1
if iter_ > max_iter:
break

# iteration two
cost = []
# estimate the rank
for idx in range(s.shape[0] - 1):
cost.append((s[idx + 1] + np.sqrt(idx * eps) * s[0] / eps) / s[idx])
r_two = np.argmin(cost)
# return the final estimate
return max(r_one, r_two, min_rank)
45 changes: 41 additions & 4 deletions deicode/q2/plugin_setup.py
Expand Up @@ -9,10 +9,10 @@
import qiime2.plugin
import qiime2.sdk
from deicode import __version__
from deicode.rpca import rpca
from deicode.rpca import rpca, auto_rpca
from deicode._rpca_defaults import (DESC_RANK, DESC_MSC, DESC_MFC,
DESC_ITERATIONS)
from qiime2.plugin import (Properties, Int)
DESC_ITERATIONS, DESC_MFF)
from qiime2.plugin import (Properties, Int, Float)
from q2_types.feature_table import FeatureTable, Frequency
from q2_types.distance_matrix import DistanceMatrix
from q2_types.ordination import PCoAResults
Expand All @@ -38,6 +38,7 @@
'n_components': Int,
'min_sample_count': Int,
'min_feature_count': Int,
'min_feature_frequency': Float,
'max_iterations': Int,
},
outputs=[
Expand All @@ -51,16 +52,52 @@
'n_components': DESC_RANK,
'min_sample_count': DESC_MSC,
'min_feature_count': DESC_MFC,
'min_feature_frequency': DESC_MFF,
'max_iterations': DESC_ITERATIONS,
},
output_descriptions={
'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
'distance_matrix': ('The Aitchison distance of'
'the sample loadings from RPCA.')
' the sample loadings from RPCA.')
},
name='(Robust Aitchison) RPCA Biplot',
description=("Performs robust center log-ratio transform "
"robust PCA and ranks the features by the "
"loadings of the resulting SVD."),
citations=[]
)

plugin.methods.register_function(
function=auto_rpca,
inputs={'table': FeatureTable[Frequency]},
parameters={
'min_sample_count': Int,
'min_feature_count': Int,
'min_feature_frequency': Float,
'max_iterations': Int,
},
outputs=[
('biplot', PCoAResults % Properties("biplot")),
('distance_matrix', DistanceMatrix)
],
input_descriptions={
'table': 'Input table of counts.',
},
parameter_descriptions={
'min_sample_count': DESC_MSC,
'min_feature_count': DESC_MFC,
'min_feature_frequency': DESC_MFF,
'max_iterations': DESC_ITERATIONS,
},
output_descriptions={
'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
'distance_matrix': ('The Aitchison distance of'
' the sample loadings from RPCA.')
},
name='(Robust Aitchison) RPCA Biplot',
description=("Performs robust center log-ratio transform "
"robust PCA and ranks the features by the "
"loadings of the resulting SVD. Automatically"
" estimates the underlying rank (i.e. n-components)."),
citations=[]
)

0 comments on commit 7d685fd

Please sign in to comment.