Refactor redundant code between standalone and QIIME 2 RPCA methods #29

fedarko · 2019-04-01T22:35:48Z

Add minimal required options (--in_biom and --output_dir).
Try to create the specified output directory (if it already exists, this still works fine).
A few misc. typo fixes
Fix Duplicated Scripts #28: abstract shared RPCA functionality to a new module, deicode/rpca.py (which contains rpca(), a function that runs RPCA given various parameters)
- The QIIME 2 plugin setup directly references this module's function. From the perspective of a Q2 DEICODE plugin user, nothing should change.
- The standalone DEICODE code also references this module's function. In order to make things a bit more consistent with the Q2 code, this includes various minor breaking changes (i.e. some commands have been renamed). To be exact:
  - --in_biom -> --in-biom
  - --output_dir -> --output-dir
  - --min_sample_depth -> --min-sample-count
- For the sake of clarity (and to avoid namespace conflicts), deicode/scripts/_rpca.py has been renamed to deicode/scripts/_standalone_rpca.py. I updated deicode/scripts/__init__.py and setup.py accordingly.
- Also, the --min-feature-count and --iterations options have been added to the standalone code. This means that all of the functionality from the Q2 version of DEICODE should now be available in the standalone DEICODE version. (...this was the main reason I did this today :))
Abstract shared RPCA parameter settings (descriptions and default values) to a new module, deicode/_rpca_defaults.py. This way, parameter settings only have to be modified once to take effect.
I forgot to mention it in the commit message, but b2f7c4e makes it so that the show_default Click argument is used for all the standalone RPCA parameters with default values.

Previously, running deicode standalone from the command line (i.e. just "deicode") would give a confusing error due to the inputs being of type None. Marking certain options as "required" makes it clearer that the user is missing something when they run DEICODE without specifying an input BIOM and/or an output directory.

IMO this makes it a bit easier to use standalone DEICODE. This is sort of a subjective decision, so feel free to reject it if you want :)

Now, the bulk of the RPCA code is just contained in deicode/rpca.py. The standalone DEICODE code (deicode/scripts/_rpca.py) calls the function in deicode/rpca.py, and does the extra work of loading the BIOM table, writing the output files, etc. that Q2 normally takes care of. (The setup is somewhat similar to how rankratioviz uses generate.py from both its Q2 and standalone codebases; a notable difference is that we don't even have a Q2 _method file here anymore, since the _method code is basically a subset of the code needed for the standalone version.) It's worth testing this a bit, but this should make maintaining things a lot easier. OTHER NOTES: -This means that now all of the (previously exclusive) Q2 RPCA options are now available when running DEICODE outside of Q2. So now you can specify --minimum-feature-count and --iterations when running DEICODE outside of Q2. -I changed some of the argument names/help messages to be consistent with Q2. THIS WILL BREAK prior code that uses DEICODE outside of Q2. (However, code that uses DEICODE within Q2 shouldn't be affected.) -IMO, there is still some non-necessary redundancy in 1) the help messages and 2) the default settings of the various RPCA options. It may be worth defining a module where all of this information is stored to further eliminate redundancy (but we're already doing a lot better on that front than previously).

fixes a bug in the refactoring done in the prior commit. Now, all the tests pass. Potentially worth noting that the test output (at least for the standalone RPCA) looks slightly different from what's in the repository. However, it looks like the values are only slightly different (it's probably due to numpy version diffs/machine precision diffs or something), and all the tests pass now, so I think this should be ok (but this isn't my call to make :)). At this point, I think this branch might be ready to merge back into biocore. But I'd like to see what y'all have to say about it.

This centralizes the various default parameters for the RPCA options, cutting down on redundancy. Each option should only have to be changed once now (so the Q2 and standalone DEICODE versions should be a lot more consistent).

(forgot to add it in the last commit, whoops)

This should pretty much knock out biocore#28. At this point, pretty much all of the standalone RPCA code is needed specifically for running DEICODE outside of QIIME 2: that is, there's minimal redundancy between the standalone and Q2 RPCA logic.

Merge redundant code

fedarko · 2019-04-02T02:37:15Z

Looks like having less code decreased the overall coverage slightly. See lemurheavy/coveralls-public#565 (comment) for reference.

cameronmartino

This is awesome, thank you @fedarko so much for this contribution.

I only have one request to increase the version number and update the changelog. Then I think this is ready to merge!

fedarko · 2019-04-02T19:46:55Z

Got it, changelog and version bump should be ready later today. Does 0.1.7 sound good as a new version number? Since some of these changes aren't backwards compatible, not sure if e.g. 0.2.0 might be a better option. (Your call, of course :))

Spaces out some bullet points; switch from the arrow syntax for name changes to fancy-ish markdown tables

if this keeps not working i'll just revert to a prev commit

updated changelog accordingly

fedarko · 2019-04-02T20:54:05Z

One more change (in d35b734) -- the output filenames of the ordination and distance matrix produced by the standalone DEICODE RPCA now match the underlying filenames in the artifacts produced by the Q2 DEICODE RPCA. Updated the changelog accordingly.

fedarko · 2019-04-03T01:00:13Z

@cameronmartino Sorry to bother you with a ton of messages. I was looking through the code, and I don't think the standalone RPCA test (currently located in test_rpca.py, but soon to be renamed to test_standalone_rpca.py) is correctly checking the output of anything. It looks like dist_res, fea_res, etc. will naturally be identical to dist_exp, fea_exp, etc. every time, since they're looking at the same exact files.

Furthermore, I'm somewhat confused about what feature.txt, samples.txt, etc. are supposed to be. It looks like these are earlier versions of the OrdinationResults output? Neither of these files -- nor distance.txt -- changes when the test is run. So, regardless of what the standalone RPCA code is doing, this test will always pass so long as it doesn't raise an error.

So: while we're at it with this PR, I think it would be worth updating this test to look at the relevant files: that is, verify that distance-matrix.tsv and ordination.txt (the new names for RPCA_distance.txt and RPCA_Ordination.txt) approximately match up with the ground truths.

It looks like assert_array_almost_equal() is still True when comparing distance-matrix.tsv and truth_distance.txt. However, I've been messing around with testing that the Ordination output matches truth_feature.txt and truth_samples.txt and it looks like the signs of the first PC's values are flipped between the "truth" feature and sample values and their respective rows in the Ordination. As an example, here's DEICODE's rank values for OTU_2414 in the test data:

OTU_2414 -4.094486487895422 2.7467056679841555 1.3098674678907734

And here's the entry for this OTU in truth_feature.txt:

OTU_2414 4.0930150209210066 2.749414394166293 1.3179536195044124

PC2 and PC3's values are approximately equal (ish), but PC1 is flipped (if you negate the PC1 value, it's approximately equal to the other PC1 value). Not sure if this is an indication of a past bug that was fixed, or a bug that has been introduced since truth_feature.txt was created.

One possible reason why the values are only approximately equal (ignoring the "flipping" thing for now) is that, as mentioned in the changelog, this PR adds some of the stuff the Q2 RPCA was doing that the non-Q2 RPCA wasn't -- that is, adding a PC of 0s when rank == 2, and using --min-feature-count with a default value of 10.

Also (not to get too nitpicky), but it looks like the Q2 RPCA test doesn't looking any of the actual values of the OrdinationResults or distance matrix. Might be worth adding to that to ensure that these values make sense? Although I don't think I understand the math behind DEICODE well enough yet to do that.

It's failing, due to the "flipping" bug (?) and other small differences between values as discussed in biocore#29. The "truth" files might just be really old. I also added in "truth_sample_trimmed.txt", which is just a version of the truth_sample.txt file without the OrdinationResults-ish stuff at the end (so you can read it in pandas easier).

Turns out scikit-learn just imports assert_array_almost_equal directly from numpy, which is already a dependency of DEICODE. Seriously, check it out! https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py ...so if nothing else installing DEICODE will now be a tiny bit faster :)

cameronmartino · 2019-04-03T15:50:08Z

@fedarko I apologize for the late response, thanks for fixing this. This was an oversight on my part.

Comparing files in the tests -- This makes perfect sense. The distance matrix should always be close and should not have negative inverses etc... The loadings, however, are probably only directly comparable by absolute values. This is because of the sample loadings and feature loadings can also be the inverse, everything is relative and directionality is not absolute in ordination (but one cannot be different and the other not change). Basically, if you multiply both sample/feature loadings in any of the "PCs" by negative one it won't change the real interpretation, you would just be flipping the numerator and denominator in the log-ratio. :)
Editing tests -- It is definitely worth making all the tests check the real values, you are definitely right. As with above, it would probably be best to just check between the absolute values to some rounding error.
I think we can drop the adding zeros for rank < 3, that is updated in emperor now and will just default to 2D in that case (so that code is no longer needed). The Q2 script is definitely the most up to date script.

Apparently this has been addressed in Emperor, per biocore#29.

Probably don't need to go into specifics here (documenting the fixed test bugs is ok, but not sure we need to document just general test updates)

fedarko · 2019-04-08T20:01:09Z

Did some more digging, it looks like python's built-in any() and all() don't work as expected with pandas (see here for details). It looks like they return True for inputs where they should return False -- here are the results from briefly testing this (input DFs based on the pandas documentation):

>>> import pandas as pd
>>> df = pd.DataFrame({"A": [True, False], "B": [False, False]})
>>> df2 = pd.DataFrame({"A": [False, False], "B": [False, False]})
>>> any(df)
True
>>> any(df2)
True
>>> all(df)
True
>>> all(df2)
True

There are apparently pandas .any() and .all() DataFrame functions that actually work as you'd expect:

>>> df.any(axis=None)
True
>>> df2.any(axis=None)
False
>>> df.all(axis=None)
False
>>> df2.all(axis=None)
False

Just pushed a fix for this. I think we're done, then?

Thanks!

cameronmartino · 2019-04-08T20:24:26Z

@fedarko Tricky! This is good to know, thanks for teasing that apart. Will have to be careful with those data frames and missing values. I think we are good to go, feel free to merge when you are ready. Thanks!

fedarko · 2019-04-08T20:39:45Z

@cameronmartino No problem, happy to help. I don't think I can merge this in since I don't have "write access," though.

For --rank and --iterations

fedarko · 2019-04-09T02:12:26Z

Ok, so four things.

I finished up a test that generates an Emperor biplot with rank 2, as @cameronmartino and I discussed today. For reference, here's the relevant function (designed to be chucked into the Test_qiime2_rpca class in the DEICODE Q2 tests file, after making a few imports/etc.):

    def test_rank2_matrix(self):
        """Verifies that using a rank 2 matrix works in DEICODE and Emperor."""

        ordination_qza, _ = q2deicode.actions.rpca(self.q2table, rank=2)
        # Generate fake sample metadata because it's required
        # self.table_sample_ids is just a list of the IDs in self.q2table
        fake_metadata_column_of_0s = [0] * len(self.table_sample_ids)
        sample_metadata_df = pd.DataFrame({"blah": fake_metadata_column_of_0s},
                                          index=self.table_sample_ids)
        sample_metadata_df.index.name = "Sample ID"
        # "Metadata" is imported from qiime2
        sample_metadata = Metadata(sample_metadata_df)
        biplot = q2emperor.actions.biplot(ordination_qza, sample_metadata)
        biplot.visualization.save("biplot.qzv")

I noticed a few things while writing/testing this code:

DEICODE didn't really check the rank or iterations parameters' values up front, so for the sake of clarity I added some basic parameter checking in 140bd75 (where it ensures that --rank is at least 2, and that --iterations is at least 1). (Previously, using a rank of 0 or less would still give you an error from scipy.sparse.linalg.svds(), but the error message wasn't super clear and this still allowed a rank of 1. So I figure this is a beneficial change.)
q2-emperor can produce a biplot from a rank 2 ordination (as is generated in the new test code), but it fails when loading the resulting QZV files in the browser (checking the developer console, I get the same error messages* as I do when trying to load bray_curtis_emperor.qzv from this forum post). This occurs with Chrome 73.0.3683.86 and QIIME 2 2019.1.

I was under the impression that biocore/emperor@a93f029 had fixed this problem in Emperor. Maybe this just wasn't included in the q2-emperor 2019.1 release? (If needed, we can just add back the "add a PC of zeroes" approach to DEICODE in the meantime.)

For reference: here's a link to the biplot QZV file produced by the test code at the top of this post (sorry, github doesn't let me upload QZVs for some reason). And for comparison, here's a link to the biplot QZV file produced with the same code but using rank=3 in the call to rpca() instead of rank=2 (just to show that the rank=2 is the problem here).

^{*with slightly different line numbers, for some reason? I think this is due to Emperor version differences within the forum's QZV file, since I'm using Q2 2019.1 and that post was from 2018.}
Since Emperor is failing in the browser, I'm not sure that a test in DEICODE is the best way to check this error (it seems like it'd be easier to do the testing in Emperor's repository's JS tests, if anything). This is why I haven't committed the test code in this post yet.
This isn't related to points 2 and 3 in this post, but I think I might have noticed a bug in deicode/_optspace.py (see here). Since the code uses 1 as a starting point in the iteration loop -- and since it assigns to index i+1 in the dist ndarray -- dist will always have a value of 0 at its second position (index 1), regardless of niter (which is just --iterations). This also means that the loop's body won't execute when --iterations is 1, since len(range(1, 1)) == 0.

You can test this out by adding a line that says print(dist) right after the for loop I linked to. I'm not familiar enough with the OptSpace algorithm to know the best way to fix this (or what its impacts on prior DEICODE output might have been), but this seems like unintentional behavior to me.

Anyway, that's what I've found so far. Sorry for the walls of text. @cameronmartino @ElDeveloper maybe we can meet sometime tomorrow / later this week to go over some or all of these points? I'm heading home now but I can try to help out with finishing this PR / squashing the relevant bugs this week. Thanks!

… in scripts now off by third dec.

cameronmartino · 2019-04-09T17:21:40Z

@fedarko (4) Nice catch, I just pushed the iteration in the OptSpace helper by one. It should have little to no effect because of one iteration won't affect much once it has converged. I think in the test it changed in the third or fourth decimal, I updated the test to reflect this.

cameronmartino · 2019-04-09T17:30:09Z

@fedarko (1) I agree but I moved this into the actual optspace.py function (here) so it is always restricted.

…ing. Also added rank==2 addition of zeros in the PC3 to prevent breaking emperor

cameronmartino · 2019-04-09T17:45:05Z

@fedarko (2/3) I added the zeros back, for now, this also allows us some ability to be back compatible with older versions of Emperor. I think what is breaking this is that there are only two axes given. When even in the 2D case emperor seems to expect to see more than two axes in the Ordination file. Unless @ElDeveloper objects to this workaround, I will keep it.

…tring

…ented in skbio), Remove coverge blocks on code and add tests for them

ElDeveloper · 2019-04-09T19:38:56Z

I believe the workaround is OK to keep.

…

On (Apr-09-19|10:45), Cameron Martino wrote: @fedarko (**2/3**) I added the zeros back for now, this also allows us some ability to be backcompatiable with older versions of Emperor. I think what is breaking this is that there are only two axis given. When even in the 2D case emperor seems to expect to see more than two axis in the Ordination file. Unless @ElDeveloper objects to this work around, I will keep it. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #29 (comment)

This adds back the text first added to the changelog in 5daf5de, and removes the text added in ce05e00.

Should make testing easier

asdofijosdifjoaisdjfoaidsj

cameronmartino · 2019-04-10T14:14:50Z

@fedarko Thanks again for this contribution. Will merge and update conda today.

fedarko and others added 11 commits April 1, 2019 15:03

ENH: Standalone DEICODE: try creating output dir

84e7ad5

IMO this makes it a bit easier to use standalone DEICODE. This is sort of a subjective decision, so feel free to reject it if you want :)

MAINT: fix some typos

02e0422

MAINT: fix tiny typo in moving pictures tutorial

ce5fcad

MAINT: Create and integrate _rpca_defaults module

e09a27c

This centralizes the various default parameters for the RPCA options, cutting down on redundancy. Each option should only have to be changed once now (so the Q2 and standalone DEICODE versions should be a lot more consistent).

MAINT/BUG: actually add in _rpca_defaults module

7b3ffa8

(forgot to add it in the last commit, whoops)

MAINT: Update setup standalone entry point

8dbe18d

Merge pull request #1 from fedarko/merge_redundant_code

6e2d64a

Merge redundant code

fedarko changed the title ~~Some small standalone DEICODE improvements~~ Refactor redundant code between standalone and QIIME 2 RPCA methods Apr 2, 2019

cameronmartino requested review from cameronmartino and ElDeveloper April 2, 2019 10:29

cameronmartino reviewed Apr 2, 2019

View reviewed changes

fedarko added 5 commits April 2, 2019 13:14

REL: Update changelog; bump version to 0.1.7 biocore#29

5daf5de

STY: make changelog a little prettier

e9b4fff

Spaces out some bullet points; switch from the arrow syntax for name changes to fancy-ish markdown tables

STY: oh geez attempt to fix changelog tables

a9c3514

STY: another stab at these tables

c6ed781

if this keeps not working i'll just revert to a prev commit

ENH: Update non-Q2 output filenames to match Q2

d35b734

updated changelog accordingly

fedarko added 3 commits April 2, 2019 18:21

REL: Small fix to changelog re new output filenms

8321514

fedarko added 2 commits April 3, 2019 10:58

DEP: Don't add zeroes if rank < 3

09b4bf3

Apparently this has been addressed in Emperor, per biocore#29.

REL: Mention creating PC3 of 0s has been rm'd

ce05e00

fedarko added 2 commits April 8, 2019 12:26

REL/TST: Add to changelog re: test code updates

32b726b

Probably don't need to go into specifics here (documenting the fixed test bugs is ok, but not sure we need to document just general test updates)

TST: Properly check for NaN values in Ord. Results

ba939ed

REL: Update changelog re: date + q2 test bugfix

3673044

fedarko added 3 commits April 8, 2019 13:52

REL: Make version 0.2.0; add non-Q2 docs to README

3548664

ENH/BUG: Add parameter sanity checking

140bd75

For --rank and --iterations

DOC: Update README re: recently added minimum vals

1289649

add iteration plus one in OptSpace to fix indexing issue, update test…

bd4a278

… in scripts now off by third dec.

moved rank<2 & iteration<1 value_errors to optspace along with docstr…

c448b2d

…ing. Also added rank==2 addition of zeros in the PC3 to prevent breaking emperor

cameronmartino added 2 commits April 9, 2019 10:49

removed rank<2 & iteration<1 value_errors to optspace along with docs…

3820c06

…tring

Add more tests to OptSpace, Remove Inverseclr (this is already implem…

77b2629

…ented in skbio), Remove coverge blocks on code and add tests for them

cameronmartino and others added 6 commits April 9, 2019 12:48

remove ratioviz function, reflect chnages in the chnagelog

0baa81d

remove ratioviz function, reflect chnages in the chnagelog

4fac102

REL: Mention using PC3 of 0s in CHANGELOG

d0ba01d

This adds back the text first added to the changelog in 5daf5de, and removes the text added in ce05e00.

REL: make changelog a tiny bit prettier

3105592

update changelog, fix recursion issue with duplicates issue biocore#30

bf99516

Merge branch 'master' of https://github.com/fedarko/DEICODE

c568389

cameronmartino mentioned this pull request Apr 9, 2019

Maximum recursion depth exceeded #30

Closed

fedarko added 3 commits April 9, 2019 14:40

TST: Add Makefile with "make test"

2c6602b

Should make testing easier

Merge branch 'master' of https://github.com/fedarko/DEICODE

a9f7ed7

asdofijosdifjoaisdjfoaidsj

REL: tiny wording fix in the changelog

e8011fa

cameronmartino merged commit 41278a3 into biocore:master Apr 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor redundant code between standalone and QIIME 2 RPCA methods #29

Refactor redundant code between standalone and QIIME 2 RPCA methods #29

fedarko commented Apr 1, 2019 •

edited

fedarko commented Apr 2, 2019

cameronmartino left a comment

fedarko commented Apr 2, 2019

fedarko commented Apr 2, 2019 •

edited

fedarko commented Apr 3, 2019

cameronmartino commented Apr 3, 2019

fedarko commented Apr 8, 2019 •

edited

cameronmartino commented Apr 8, 2019

fedarko commented Apr 8, 2019

fedarko commented Apr 9, 2019

cameronmartino commented Apr 9, 2019 •

edited

cameronmartino commented Apr 9, 2019 •

edited

cameronmartino commented Apr 9, 2019 •

edited

ElDeveloper commented Apr 9, 2019 via email

cameronmartino commented Apr 10, 2019

Refactor redundant code between standalone and QIIME 2 RPCA methods #29

Refactor redundant code between standalone and QIIME 2 RPCA methods #29

Conversation

fedarko commented Apr 1, 2019 • edited

fedarko commented Apr 2, 2019

cameronmartino left a comment

Choose a reason for hiding this comment

fedarko commented Apr 2, 2019

fedarko commented Apr 2, 2019 • edited

fedarko commented Apr 3, 2019

cameronmartino commented Apr 3, 2019

fedarko commented Apr 8, 2019 • edited

cameronmartino commented Apr 8, 2019

fedarko commented Apr 8, 2019

fedarko commented Apr 9, 2019

cameronmartino commented Apr 9, 2019 • edited

cameronmartino commented Apr 9, 2019 • edited

cameronmartino commented Apr 9, 2019 • edited

ElDeveloper commented Apr 9, 2019 via email

cameronmartino commented Apr 10, 2019

fedarko commented Apr 1, 2019 •

edited

fedarko commented Apr 2, 2019 •

edited

fedarko commented Apr 8, 2019 •

edited

cameronmartino commented Apr 9, 2019 •

edited

cameronmartino commented Apr 9, 2019 •

edited

cameronmartino commented Apr 9, 2019 •

edited