-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor redundant code between standalone and QIIME 2 RPCA methods #29
Conversation
Previously, running deicode standalone from the command line (i.e. just "deicode") would give a confusing error due to the inputs being of type None. Marking certain options as "required" makes it clearer that the user is missing something when they run DEICODE without specifying an input BIOM and/or an output directory.
IMO this makes it a bit easier to use standalone DEICODE. This is sort of a subjective decision, so feel free to reject it if you want :)
Now, the bulk of the RPCA code is just contained in deicode/rpca.py. The standalone DEICODE code (deicode/scripts/_rpca.py) calls the function in deicode/rpca.py, and does the extra work of loading the BIOM table, writing the output files, etc. that Q2 normally takes care of. (The setup is somewhat similar to how rankratioviz uses generate.py from both its Q2 and standalone codebases; a notable difference is that we don't even have a Q2 _method file here anymore, since the _method code is basically a subset of the code needed for the standalone version.) It's worth testing this a bit, but this should make maintaining things a lot easier. OTHER NOTES: -This means that now all of the (previously exclusive) Q2 RPCA options are now available when running DEICODE outside of Q2. So now you can specify --minimum-feature-count and --iterations when running DEICODE outside of Q2. -I changed some of the argument names/help messages to be consistent with Q2. THIS WILL BREAK prior code that uses DEICODE outside of Q2. (However, code that uses DEICODE within Q2 shouldn't be affected.) -IMO, there is still some non-necessary redundancy in 1) the help messages and 2) the default settings of the various RPCA options. It may be worth defining a module where all of this information is stored to further eliminate redundancy (but we're already doing a lot better on that front than previously).
fixes a bug in the refactoring done in the prior commit. Now, all the tests pass. Potentially worth noting that the test output (at least for the standalone RPCA) looks slightly different from what's in the repository. However, it looks like the values are only slightly different (it's probably due to numpy version diffs/machine precision diffs or something), and all the tests pass now, so I think this should be ok (but this isn't my call to make :)). At this point, I think this branch might be ready to merge back into biocore. But I'd like to see what y'all have to say about it.
This centralizes the various default parameters for the RPCA options, cutting down on redundancy. Each option should only have to be changed once now (so the Q2 and standalone DEICODE versions should be a lot more consistent).
(forgot to add it in the last commit, whoops)
This should pretty much knock out biocore#28. At this point, pretty much all of the standalone RPCA code is needed specifically for running DEICODE outside of QIIME 2: that is, there's minimal redundancy between the standalone and Q2 RPCA logic.
Merge redundant code
Looks like having less code decreased the overall coverage slightly. See lemurheavy/coveralls-public#565 (comment) for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thank you @fedarko so much for this contribution.
I only have one request to increase the version number and update the changelog. Then I think this is ready to merge!
Got it, changelog and version bump should be ready later today. Does |
Spaces out some bullet points; switch from the arrow syntax for name changes to fancy-ish markdown tables
if this keeps not working i'll just revert to a prev commit
updated changelog accordingly
One more change (in d35b734) -- the output filenames of the ordination and distance matrix produced by the standalone DEICODE RPCA now match the underlying filenames in the artifacts produced by the Q2 DEICODE RPCA. Updated the changelog accordingly. |
@cameronmartino Sorry to bother you with a ton of messages. I was looking through the code, and I don't think the standalone RPCA test (currently located in test_rpca.py, but soon to be renamed to Furthermore, I'm somewhat confused about what So: while we're at it with this PR, I think it would be worth updating this test to look at the relevant files: that is, verify that It looks like
And here's the entry for this OTU in
PC2 and PC3's values are approximately equal (ish), but PC1 is flipped (if you negate the PC1 value, it's approximately equal to the other PC1 value). Not sure if this is an indication of a past bug that was fixed, or a bug that has been introduced since One possible reason why the values are only approximately equal (ignoring the "flipping" thing for now) is that, as mentioned in the changelog, this PR adds some of the stuff the Q2 RPCA was doing that the non-Q2 RPCA wasn't -- that is, adding a PC of 0s when Also (not to get too nitpicky), but it looks like the Q2 RPCA test doesn't looking any of the actual values of the OrdinationResults or distance matrix. Might be worth adding to that to ensure that these values make sense? Although I don't think I understand the math behind DEICODE well enough yet to do that. |
It's failing, due to the "flipping" bug (?) and other small differences between values as discussed in biocore#29. The "truth" files might just be really old. I also added in "truth_sample_trimmed.txt", which is just a version of the truth_sample.txt file without the OrdinationResults-ish stuff at the end (so you can read it in pandas easier).
Turns out scikit-learn just imports assert_array_almost_equal directly from numpy, which is already a dependency of DEICODE. Seriously, check it out! https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py ...so if nothing else installing DEICODE will now be a tiny bit faster :)
@fedarko I apologize for the late response, thanks for fixing this. This was an oversight on my part.
|
Apparently this has been addressed in Emperor, per biocore#29.
Probably don't need to go into specifics here (documenting the fixed test bugs is ok, but not sure we need to document just general test updates)
Did some more digging, it looks like python's built-in >>> import pandas as pd
>>> df = pd.DataFrame({"A": [True, False], "B": [False, False]})
>>> df2 = pd.DataFrame({"A": [False, False], "B": [False, False]})
>>> any(df)
True
>>> any(df2)
True
>>> all(df)
True
>>> all(df2)
True There are apparently pandas >>> df.any(axis=None)
True
>>> df2.any(axis=None)
False
>>> df.all(axis=None)
False
>>> df2.all(axis=None)
False Just pushed a fix for this. I think we're done, then? Thanks! |
@fedarko Tricky! This is good to know, thanks for teasing that apart. Will have to be careful with those data frames and missing values. I think we are good to go, feel free to merge when you are ready. Thanks! |
@cameronmartino No problem, happy to help. I don't think I can merge this in since I don't have "write access," though. |
Ok, so four things. I finished up a test that generates an Emperor biplot with rank 2, as @cameronmartino and I discussed today. For reference, here's the relevant function (designed to be chucked into the def test_rank2_matrix(self):
"""Verifies that using a rank 2 matrix works in DEICODE and Emperor."""
ordination_qza, _ = q2deicode.actions.rpca(self.q2table, rank=2)
# Generate fake sample metadata because it's required
# self.table_sample_ids is just a list of the IDs in self.q2table
fake_metadata_column_of_0s = [0] * len(self.table_sample_ids)
sample_metadata_df = pd.DataFrame({"blah": fake_metadata_column_of_0s},
index=self.table_sample_ids)
sample_metadata_df.index.name = "Sample ID"
# "Metadata" is imported from qiime2
sample_metadata = Metadata(sample_metadata_df)
biplot = q2emperor.actions.biplot(ordination_qza, sample_metadata)
biplot.visualization.save("biplot.qzv") I noticed a few things while writing/testing this code:
Anyway, that's what I've found so far. Sorry for the walls of text. @cameronmartino @ElDeveloper maybe we can meet sometime tomorrow / later this week to go over some or all of these points? I'm heading home now but I can try to help out with finishing this PR / squashing the relevant bugs this week. Thanks! |
… in scripts now off by third dec.
@fedarko (4) Nice catch, I just pushed the iteration in the OptSpace helper by one. It should have little to no effect because of one iteration won't affect much once it has converged. I think in the test it changed in the third or fourth decimal, I updated the test to reflect this. |
…ing. Also added rank==2 addition of zeros in the PC3 to prevent breaking emperor
@fedarko (2/3) I added the zeros back, for now, this also allows us some ability to be back compatible with older versions of Emperor. I think what is breaking this is that there are only two axes given. When even in the 2D case emperor seems to expect to see more than two axes in the Ordination file. Unless @ElDeveloper objects to this workaround, I will keep it. |
…ented in skbio), Remove coverge blocks on code and add tests for them
I believe the workaround is OK to keep.
…On (Apr-09-19|10:45), Cameron Martino wrote:
@fedarko (**2/3**) I added the zeros back for now, this also allows us some ability to be backcompatiable with older versions of Emperor. I think what is breaking this is that there are only two axis given. When even in the 2D case emperor seems to expect to see more than two axis in the Ordination file. Unless @ElDeveloper objects to this work around, I will keep it.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#29 (comment)
|
Should make testing easier
asdofijosdifjoaisdjfoaidsj
@fedarko Thanks again for this contribution. Will merge and update conda today. |
--in_biom
and--output_dir
).deicode/rpca.py
(which containsrpca()
, a function that runs RPCA given various parameters)--in_biom
->--in-biom
--output_dir
->--output-dir
--min_sample_depth
->--min-sample-count
deicode/scripts/_rpca.py
has been renamed todeicode/scripts/_standalone_rpca.py
. I updateddeicode/scripts/__init__.py
andsetup.py
accordingly.--min-feature-count
and--iterations
options have been added to the standalone code. This means that all of the functionality from the Q2 version of DEICODE should now be available in the standalone DEICODE version. (...this was the main reason I did this today :))deicode/_rpca_defaults.py
. This way, parameter settings only have to be modified once to take effect.show_default
Click argument is used for all the standalone RPCA parameters with default values.