Add RPCA and standalone RCLR functionality #36

cameronmartino · 2020-09-27T21:47:09Z

This PR merges the repeated functionality of DEICODE which included the rclr transformation and RPCA functions. All previous testing used for RPCA/rclr in DEICODE has been adapted here. The tutorials for DEICODE have been added and altered to function through gemelli. Additionally, the rclr transformation as a standalone command and QIIME2 command has been added (a previously requested functionality from DEICODE). Overall, this prevents any shared functionality from being repeated across repos and causing issues later. This also allows gemelli to function on both cross-sectional and repeated measure experimental setups. The tutorials have been set up this way to minimize confusion on when to use CTF or RPCA.

v0.0.5 -> v0.0.6 (2020-09-27)

Features

Robust Aitchison PCA
- with manually chosen n_components
- with auto chosen n_components
- associated tests added for function, standalone, & QIIME2
Robust centered log-ratio (rclr) transformation
- allows users to use the transform alone outside CTF/RPCA
- associated tests added and extra tests for rclr added
Tutorials
- Cross-sectional for RPCA (standalone & QIIME2)
- Repeated measure for CTF (standalone & QIIME2)

Miscellaneous

auto_rpca rank estimation min rank now 3
added citations for CTF & RPCA (linked when appropriate)
updated the function descriptions

gwarmstrong · 2020-09-28T17:56:50Z

gemelli/_rpca_defaults.py

+# descriptions. This is used by both the standalone RPCA and QIIME 2 RPCA sides
+# of gemelli.
+
+DEFAULT_RANK = 3


Are any of these shared with ctf? May make sense to have a single file with defaults if you envision wanting modify a single default/description that is shared by ctf/rpca

These have been merged into one file and reduced.

gwarmstrong · 2020-09-28T18:03:25Z

gemelli/ctf.py

@@ -3,8 +3,8 @@
 from pandas import concat
 from pandas import DataFrame
 from skbio import OrdinationResults, DistanceMatrix
-from gemelli.factorization import TensorFactorization
-from gemelli.preprocessing import build, rclr
+from gemelli.tensor_factorization import TensorFactorization


factorization -> tensor_factorization and rclr -> tensor_rclr are kind of huge breaking API changes. Definitely requires a major version.

I would be worried if now gemelli.factorization now automatically refers to matrix factorization or something that. If anyone uses your API, the error message they see will be along the lines of "TensorFactorization does not exist in gemelli.factorization" as opposed to "TensorFactorization has moved from gemelli.factorization to gemelli.tensor_factorization"

Reverted the changes back to factorization

gwarmstrong · 2020-09-28T18:04:57Z

gemelli/matrix_completion.py

+from scipy.spatial import distance
+
+
+class MatrixCompletion(_BaseImpute):


I am going to assume this is copy and pasted from deicode

gwarmstrong · 2020-09-28T18:13:50Z

gemelli/matrix_completion.py

+
+        return
+
+    def fit(self, X):


Sklearn transformers fit, and fit_transform methods typically take an optional y=None argument (even if y is not used in the computation) [see here]. Including this argument makes the logic around chaining together multiple models (e.g., fitting deicode, then using it's components in an ML) a lot easier to use, especially if using some of sklearn's super useful pipeline functionality.

y=None is now added. Thanks!

gwarmstrong · 2020-09-28T18:22:04Z

gemelli/base.py

@@ -23,6 +23,10 @@ def fit(self):
    def label(self):
        """ Placeholder for fit this
        should be implemetned by sub-method"""
+    def transform(self):


I would be super hesitant to simply return sample_weights as the transform method like this.

Generally the expected behavior of transform is "I give you data, and you transform it for me". Your transform method is not doing that. Furthermore, sample_weights is already accessible as an attribute and you are not doing any extra work on it.

Side note: Is there any way to project new (i.e., not used in fitting procedure) samples into RPCA/CTF space?

These have been removed or altered to match the PCA function in sklearn.

gwarmstrong · 2020-09-28T18:22:35Z

gemelli/matrix_completion.py

+        self.feature_weights = self.V
+        self.sample_weights = self.U
+
+    def fit_transform(self, X):


Same as above

gwarmstrong · 2020-09-28T18:24:04Z

gemelli/preprocessing.py

 from .base import _BaseConstruct


-def rclr(T):
+def tensor_rclr(T):


Same as above, be careful with name changes.

these have been changed to matrix_rclr and tensor_rclr

gwarmstrong · 2020-09-28T18:28:26Z

gemelli/preprocessing.py



-def rclr_matrix(M):
+def rclr(M):


Here it is. I would strongly caution avoiding this name change. matrix may make sense as the default. BUT switching the default will strongly affect any user code. Would probably suggest keeping rclr_matrix, and rclr_tensor, and keeping rclr as tensor. Possible to deprecate, but I would not just go around switching the API.

these have been changed to matrix_rclr and tensor_rclr

gwarmstrong · 2020-09-28T18:32:15Z

gemelli/q2/tests/simulations.py

+from __future__ import division
+# utils
+import pandas as pd
+import numpy as np


Can I suggest a different location for this code than a tests directory? While used in your testing, I could see this code being pretty useful to others outside of a 'unit testing' context, in which it feels weird to import from tests.

Maybe something like gemelli.simulations?

The simulations have been moved to the main package.

gemelli/q2/tests/simulations.py

gwarmstrong · 2020-09-28T18:36:17Z

gemelli/q2/tests/test_optspace.py

+            np.testing.assert_array_almost_equal(abs(V_exp[:, i]),
+                                                 abs(V_res[:, i]))
+
+    def test_OptSpace_rank_raises(self):


could ease debugging if you split these into different tests

these have been split. Thanks!

gwarmstrong · 2020-09-28T18:36:57Z

gemelli/q2/tests/test_rpca_method.py

+
+
+@nottest
+def create_test_table():


This looks like duplicated code

Now not duplicated. Thanks!

gwarmstrong · 2020-09-28T18:38:15Z

gemelli/scripts/__init__.py

+    ctx.call_on_close(_terribly_handle_brokenpipeerror)
+
+
+import_module('gemelli.scripts._standalone_transforms')


I am not sure I understand these imports

This link may be useful for avoiding the circular import anti-pattern

gwarmstrong · 2020-09-28T18:49:22Z

gemelli/scripts/tests/test_standalone_rpca.py

+        assert_ordinationresults_equal(ord_res, ord_exp)
+
+        # Lastly, check that gemelli's exit code was 0 (indicating success)
+        try:


This code block is repeated a lot. Possible to encapsulate it?

E.g.,

class CliTestCase(unittest.TestCase): def assertExitCode(self, value, result): try: self.assertEqual(0, result.exit_code) except: ...

And then inheriting CliTestCase in all of your CLI tests, and calling assertExitCode

Good idea, this is now implemented. Thanks!

gwarmstrong · 2020-09-28T19:02:10Z

setup.py

@@ -84,7 +84,7 @@ def run(self):
    hit = _version_re.search(f.read().decode('utf-8')).group(1)
    version = str(ast.literal_eval(hit))

-standalone = ['gemelli=gemelli.scripts._standalone_ctf:standalone_ctf']
+standalone = ['gemelli=gemelli.scripts.__init__:cli']


Suggested change

standalone = ['gemelli=gemelli.scripts.__init__:cli']

standalone = ['gemelli=gemelli.scripts:cli']

gwarmstrong · 2020-09-28T19:03:15Z

gemelli/tensor_factorization.py

@@ -521,6 +524,16 @@ def tenals(tensor,
    tensor_dimensions = tensor.shape
    # Frobenius norm initial for ALS minimization.
    initial_tensor_frobenius_norm = norm(tensor)**2
+    # rank est. (for each slice)


This doesn't seem super in scope, ffr

gwarmstrong

I have some comments scattered throughout, mostly about structure and repeated code.

gwarmstrong · 2020-11-12T18:14:49Z

Do we need to adjust the setup.py package_data or MANIFEST.in in order to make sure the test data ships with the package?

cameronmartino added 24 commits September 10, 2020 15:30

minor typo

bafe9ca

remove unlinked badge

9bb8863

add both citations

017fe62

add rpca transform

b9784cb

update setup for new cli setup

a229bbe

add base rpca functionality

3e5c289

add rpca tests

0ea636b

add rpca standalone

0000870

add standalone tests for rpca

249d882

add rpca methods

e6d1a33

add q2 rpca tests

879d21a

prevent warning when no duplicates

2a3aebb

fix warnings in tests

90ccc38

flake8

a6c3a8a

fix citations

d5f6279

add RPCA tutorials, updated for gemelli

dcd46b2

update readme

9b75e18

update readme

7f1fc56

fix some typos

b70c6da

add a few checks for rank est.

dad08b9

fix test file

573ca0e

add command and tests for standalone & QIIME2 rclr transformation

957bbef

fix rclr command flake8

c7e45f8

ignore some extentions

8c07cf6

cameronmartino requested a review from gwarmstrong September 27, 2020 21:47

update changelog

89a8c32

gwarmstrong reviewed Sep 28, 2020

View reviewed changes

gemelli/q2/tests/simulations.py Outdated Show resolved Hide resolved

gwarmstrong reviewed Sep 28, 2020

View reviewed changes

gwarmstrong requested changes Sep 28, 2020

View reviewed changes

cameronmartino added 10 commits November 11, 2020 14:46

fix multiple default files issue

f017cf6

revert back tensor_factorization -> factorization

81782fd

add y=None to all fit / fit_transform

c75addb

fix rclr -> matrix_rclr

562b731

move simulations into package

b37f29d

split tests

865e7bb

reduce repeated code

47eda84

reduce repeated code in tests

ad1805e

fix transformation

01c9316

fix pep / flake

f0b0ae4

cameronmartino added 2 commits November 30, 2020 11:20

add test data to MANIFEST.in

0ef2212

remove travis deprecated key sudo

3e7f8d8

gwarmstrong force-pushed the merge_rpca branch from a7c0c2a to 3e7f8d8 Compare November 30, 2020 19:20

gwarmstrong merged commit c419b03 into master Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RPCA and standalone RCLR functionality #36

Add RPCA and standalone RCLR functionality #36

cameronmartino commented Sep 27, 2020 •

edited

Loading

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

gwarmstrong Sep 28, 2020

gwarmstrong Sep 28, 2020

cameronmartino Nov 12, 2020

gwarmstrong Sep 28, 2020

gwarmstrong Sep 28, 2020

gwarmstrong left a comment

gwarmstrong commented Nov 12, 2020

		from scipy.spatial import distance


		class MatrixCompletion(_BaseImpute):

		ctx.call_on_close(_terribly_handle_brokenpipeerror)


		import_module('gemelli.scripts._standalone_transforms')

	standalone = ['gemelli=gemelli.scripts.__init__:cli']
	standalone = ['gemelli=gemelli.scripts:cli']

Add RPCA and standalone RCLR functionality #36

Add RPCA and standalone RCLR functionality #36

Conversation

cameronmartino commented Sep 27, 2020 • edited Loading

v0.0.5 -> v0.0.6 (2020-09-27)

Features

Miscellaneous

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwarmstrong left a comment

Choose a reason for hiding this comment

gwarmstrong commented Nov 12, 2020

cameronmartino commented Sep 27, 2020 •

edited

Loading