Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RPCA and standalone RCLR functionality #36

Merged
merged 37 commits into from
Nov 30, 2020
Merged

Add RPCA and standalone RCLR functionality #36

merged 37 commits into from
Nov 30, 2020

Conversation

cameronmartino
Copy link
Collaborator

@cameronmartino cameronmartino commented Sep 27, 2020

This PR merges the repeated functionality of DEICODE which included the rclr transformation and RPCA functions. All previous testing used for RPCA/rclr in DEICODE has been adapted here. The tutorials for DEICODE have been added and altered to function through gemelli. Additionally, the rclr transformation as a standalone command and QIIME2 command has been added (a previously requested functionality from DEICODE). Overall, this prevents any shared functionality from being repeated across repos and causing issues later. This also allows gemelli to function on both cross-sectional and repeated measure experimental setups. The tutorials have been set up this way to minimize confusion on when to use CTF or RPCA.

v0.0.5 -> v0.0.6 (2020-09-27)

Features

  • Robust Aitchison PCA
    • with manually chosen n_components
    • with auto chosen n_components
    • associated tests added for function, standalone, & QIIME2
  • Robust centered log-ratio (rclr) transformation
    • allows users to use the transform alone outside CTF/RPCA
    • associated tests added and extra tests for rclr added
  • Tutorials
    • Cross-sectional for RPCA (standalone & QIIME2)
    • Repeated measure for CTF (standalone & QIIME2)

Miscellaneous

  • auto_rpca rank estimation min rank now 3
  • added citations for CTF & RPCA (linked when appropriate)
  • updated the function descriptions

# descriptions. This is used by both the standalone RPCA and QIIME 2 RPCA sides
# of gemelli.

DEFAULT_RANK = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these shared with ctf? May make sense to have a single file with defaults if you envision wanting modify a single default/description that is shared by ctf/rpca

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been merged into one file and reduced.

gemelli/ctf.py Outdated
@@ -3,8 +3,8 @@
from pandas import concat
from pandas import DataFrame
from skbio import OrdinationResults, DistanceMatrix
from gemelli.factorization import TensorFactorization
from gemelli.preprocessing import build, rclr
from gemelli.tensor_factorization import TensorFactorization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorization -> tensor_factorization and rclr -> tensor_rclr are kind of huge breaking API changes. Definitely requires a major version.

I would be worried if now gemelli.factorization now automatically refers to matrix factorization or something that. If anyone uses your API, the error message they see will be along the lines of "TensorFactorization does not exist in gemelli.factorization" as opposed to "TensorFactorization has moved from gemelli.factorization to gemelli.tensor_factorization"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the changes back to factorization

from scipy.spatial import distance


class MatrixCompletion(_BaseImpute):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to assume this is copy and pasted from deicode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.


return

def fit(self, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sklearn transformers fit, and fit_transform methods typically take an optional y=None argument (even if y is not used in the computation) [see here]. Including this argument makes the logic around chaining together multiple models (e.g., fitting deicode, then using it's components in an ML) a lot easier to use, especially if using some of sklearn's super useful pipeline functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y=None is now added. Thanks!

gemelli/base.py Outdated
@@ -23,6 +23,10 @@ def fit(self):
def label(self):
""" Placeholder for fit this
should be implemetned by sub-method"""
def transform(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be super hesitant to simply return sample_weights as the transform method like this.

Generally the expected behavior of transform is "I give you data, and you transform it for me". Your transform method is not doing that. Furthermore, sample_weights is already accessible as an attribute and you are not doing any extra work on it.

Side note: Is there any way to project new (i.e., not used in fitting procedure) samples into RPCA/CTF space?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been removed or altered to match the PCA function in sklearn.

self.feature_weights = self.V
self.sample_weights = self.U

def fit_transform(self, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

from .base import _BaseConstruct


def rclr(T):
def tensor_rclr(T):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, be careful with name changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have been changed to matrix_rclr and tensor_rclr



def rclr_matrix(M):
def rclr(M):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is. I would strongly caution avoiding this name change. matrix may make sense as the default. BUT switching the default will strongly affect any user code. Would probably suggest keeping rclr_matrix, and rclr_tensor, and keeping rclr as tensor. Possible to deprecate, but I would not just go around switching the API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have been changed to matrix_rclr and tensor_rclr

from __future__ import division
# utils
import pandas as pd
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest a different location for this code than a tests directory? While used in your testing, I could see this code being pretty useful to others outside of a 'unit testing' context, in which it feels weird to import from tests.

Maybe something like gemelli.simulations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simulations have been moved to the main package.

np.testing.assert_array_almost_equal(abs(V_exp[:, i]),
abs(V_res[:, i]))

def test_OptSpace_rank_raises(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could ease debugging if you split these into different tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have been split. Thanks!



@nottest
def create_test_table():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplicated code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now not duplicated. Thanks!

ctx.call_on_close(_terribly_handle_brokenpipeerror)


import_module('gemelli.scripts._standalone_transforms')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand these imports

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link may be useful for avoiding the circular import anti-pattern

assert_ordinationresults_equal(ord_res, ord_exp)

# Lastly, check that gemelli's exit code was 0 (indicating success)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is repeated a lot. Possible to encapsulate it?

E.g.,

class CliTestCase(unittest.TestCase):

    def assertExitCode(self, value, result):
        try:
            self.assertEqual(0, result.exit_code)
        except:
            ...

And then inheriting CliTestCase in all of your CLI tests, and calling assertExitCode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is now implemented. Thanks!

@@ -84,7 +84,7 @@ def run(self):
hit = _version_re.search(f.read().decode('utf-8')).group(1)
version = str(ast.literal_eval(hit))

standalone = ['gemelli=gemelli.scripts._standalone_ctf:standalone_ctf']
standalone = ['gemelli=gemelli.scripts.__init__:cli']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
standalone = ['gemelli=gemelli.scripts.__init__:cli']
standalone = ['gemelli=gemelli.scripts:cli']

@@ -521,6 +524,16 @@ def tenals(tensor,
tensor_dimensions = tensor.shape
# Frobenius norm initial for ALS minimization.
initial_tensor_frobenius_norm = norm(tensor)**2
# rank est. (for each slice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem super in scope, ffr

Copy link
Member

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments scattered throughout, mostly about structure and repeated code.

@gwarmstrong
Copy link
Member

Do we need to adjust the setup.py package_data or MANIFEST.in in order to make sure the test data ships with the package?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants