-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RPCA and standalone RCLR functionality #36
Conversation
gemelli/_rpca_defaults.py
Outdated
# descriptions. This is used by both the standalone RPCA and QIIME 2 RPCA sides | ||
# of gemelli. | ||
|
||
DEFAULT_RANK = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these shared with ctf? May make sense to have a single file with defaults if you envision wanting modify a single default/description that is shared by ctf/rpca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have been merged into one file and reduced.
gemelli/ctf.py
Outdated
@@ -3,8 +3,8 @@ | |||
from pandas import concat | |||
from pandas import DataFrame | |||
from skbio import OrdinationResults, DistanceMatrix | |||
from gemelli.factorization import TensorFactorization | |||
from gemelli.preprocessing import build, rclr | |||
from gemelli.tensor_factorization import TensorFactorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorization
-> tensor_factorization
and rclr
-> tensor_rclr
are kind of huge breaking API changes. Definitely requires a major version.
I would be worried if now gemelli.factorization
now automatically refers to matrix factorization or something that. If anyone uses your API, the error message they see will be along the lines of "TensorFactorization
does not exist in gemelli.factorization
" as opposed to "TensorFactorization
has moved from gemelli.factorization
to gemelli.tensor_factorization
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the changes back to factorization
from scipy.spatial import distance | ||
|
||
|
||
class MatrixCompletion(_BaseImpute): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to assume this is copy and pasted from deicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
gemelli/matrix_completion.py
Outdated
|
||
return | ||
|
||
def fit(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sklearn transformers fit
, and fit_transform
methods typically take an optional y=None
argument (even if y
is not used in the computation) [see here]. Including this argument makes the logic around chaining together multiple models (e.g., fitting deicode, then using it's components in an ML) a lot easier to use, especially if using some of sklearn's super useful pipeline functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y=None is now added. Thanks!
gemelli/base.py
Outdated
@@ -23,6 +23,10 @@ def fit(self): | |||
def label(self): | |||
""" Placeholder for fit this | |||
should be implemetned by sub-method""" | |||
def transform(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be super hesitant to simply return sample_weights
as the transform
method like this.
Generally the expected behavior of transform
is "I give you data, and you transform it for me". Your transform
method is not doing that. Furthermore, sample_weights
is already accessible as an attribute and you are not doing any extra work on it.
Side note: Is there any way to project new (i.e., not used in fitting procedure) samples into RPCA/CTF space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have been removed or altered to match the PCA function in sklearn.
gemelli/matrix_completion.py
Outdated
self.feature_weights = self.V | ||
self.sample_weights = self.U | ||
|
||
def fit_transform(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
from .base import _BaseConstruct | ||
|
||
|
||
def rclr(T): | ||
def tensor_rclr(T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, be careful with name changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these have been changed to matrix_rclr
and tensor_rclr
gemelli/preprocessing.py
Outdated
|
||
|
||
def rclr_matrix(M): | ||
def rclr(M): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is. I would strongly caution avoiding this name change. matrix
may make sense as the default. BUT switching the default will strongly affect any user code. Would probably suggest keeping rclr_matrix
, and rclr_tensor
, and keeping rclr
as tensor
. Possible to deprecate, but I would not just go around switching the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these have been changed to matrix_rclr and tensor_rclr
gemelli/q2/tests/simulations.py
Outdated
from __future__ import division | ||
# utils | ||
import pandas as pd | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest a different location for this code than a tests
directory? While used in your testing, I could see this code being pretty useful to others outside of a 'unit testing' context, in which it feels weird to import from tests
.
Maybe something like gemelli.simulations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simulations have been moved to the main package.
gemelli/q2/tests/test_optspace.py
Outdated
np.testing.assert_array_almost_equal(abs(V_exp[:, i]), | ||
abs(V_res[:, i])) | ||
|
||
def test_OptSpace_rank_raises(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could ease debugging if you split these into different tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these have been split. Thanks!
|
||
|
||
@nottest | ||
def create_test_table(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like duplicated code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now not duplicated. Thanks!
ctx.call_on_close(_terribly_handle_brokenpipeerror) | ||
|
||
|
||
import_module('gemelli.scripts._standalone_transforms') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand these imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link may be useful for avoiding the circular import anti-pattern
assert_ordinationresults_equal(ord_res, ord_exp) | ||
|
||
# Lastly, check that gemelli's exit code was 0 (indicating success) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is repeated a lot. Possible to encapsulate it?
E.g.,
class CliTestCase(unittest.TestCase):
def assertExitCode(self, value, result):
try:
self.assertEqual(0, result.exit_code)
except:
...
And then inheriting CliTestCase
in all of your CLI tests, and calling assertExitCode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this is now implemented. Thanks!
@@ -84,7 +84,7 @@ def run(self): | |||
hit = _version_re.search(f.read().decode('utf-8')).group(1) | |||
version = str(ast.literal_eval(hit)) | |||
|
|||
standalone = ['gemelli=gemelli.scripts._standalone_ctf:standalone_ctf'] | |||
standalone = ['gemelli=gemelli.scripts.__init__:cli'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standalone = ['gemelli=gemelli.scripts.__init__:cli'] | |
standalone = ['gemelli=gemelli.scripts:cli'] |
gemelli/tensor_factorization.py
Outdated
@@ -521,6 +524,16 @@ def tenals(tensor, | |||
tensor_dimensions = tensor.shape | |||
# Frobenius norm initial for ALS minimization. | |||
initial_tensor_frobenius_norm = norm(tensor)**2 | |||
# rank est. (for each slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem super in scope, ffr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments scattered throughout, mostly about structure and repeated code.
Do we need to adjust the |
a7c0c2a
to
3e7f8d8
Compare
This PR merges the repeated functionality of DEICODE which included the rclr transformation and RPCA functions. All previous testing used for RPCA/rclr in DEICODE has been adapted here. The tutorials for DEICODE have been added and altered to function through gemelli. Additionally, the rclr transformation as a standalone command and QIIME2 command has been added (a previously requested functionality from DEICODE). Overall, this prevents any shared functionality from being repeated across repos and causing issues later. This also allows gemelli to function on both cross-sectional and repeated measure experimental setups. The tutorials have been set up this way to minimize confusion on when to use CTF or RPCA.
v0.0.5 -> v0.0.6 (2020-09-27)
Features
Miscellaneous