Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor redundant code between standalone and QIIME 2 RPCA methods #29

Merged
merged 49 commits into from
Apr 10, 2019

Conversation

fedarko
Copy link
Contributor

@fedarko fedarko commented Apr 1, 2019

  • Add minimal required options (--in_biom and --output_dir).
  • Try to create the specified output directory (if it already exists, this still works fine).
  • A few misc. typo fixes
  • Fix Duplicated Scripts #28: abstract shared RPCA functionality to a new module, deicode/rpca.py (which contains rpca(), a function that runs RPCA given various parameters)
    • The QIIME 2 plugin setup directly references this module's function. From the perspective of a Q2 DEICODE plugin user, nothing should change.
    • The standalone DEICODE code also references this module's function. In order to make things a bit more consistent with the Q2 code, this includes various minor breaking changes (i.e. some commands have been renamed). To be exact:
      • --in_biom -> --in-biom
      • --output_dir -> --output-dir
      • --min_sample_depth -> --min-sample-count
    • For the sake of clarity (and to avoid namespace conflicts), deicode/scripts/_rpca.py has been renamed to deicode/scripts/_standalone_rpca.py. I updated deicode/scripts/__init__.py and setup.py accordingly.
    • Also, the --min-feature-count and --iterations options have been added to the standalone code. This means that all of the functionality from the Q2 version of DEICODE should now be available in the standalone DEICODE version. (...this was the main reason I did this today :))
  • Abstract shared RPCA parameter settings (descriptions and default values) to a new module, deicode/_rpca_defaults.py. This way, parameter settings only have to be modified once to take effect.
  • I forgot to mention it in the commit message, but b2f7c4e makes it so that the show_default Click argument is used for all the standalone RPCA parameters with default values.

fedarko and others added 11 commits April 1, 2019 15:03
Previously, running deicode standalone from the command line (i.e.
just "deicode") would give a confusing error due to the inputs being
of type None.

Marking certain options as "required" makes it clearer that the
user is missing something when they run DEICODE without specifying
an input BIOM and/or an output directory.
IMO this makes it a bit easier to use standalone DEICODE. This is
sort of a subjective decision, so feel free to reject it if you
want :)
Now, the bulk of the RPCA code is just contained in deicode/rpca.py.
The standalone DEICODE code (deicode/scripts/_rpca.py) calls the
function in deicode/rpca.py, and does the extra work of loading the
BIOM table, writing the output files, etc. that Q2 normally takes care
of.

(The setup is somewhat similar to how rankratioviz uses generate.py
from both its Q2 and standalone codebases; a notable difference is
that we don't even have a Q2 _method file here anymore, since the
_method code is basically a subset of the code needed for the
standalone version.)

It's worth testing this a bit, but this should make maintaining
things a lot easier.

OTHER NOTES:

-This means that now all of the (previously exclusive) Q2 RPCA
 options are now available when running DEICODE outside of Q2. So now
 you can specify --minimum-feature-count and --iterations when running
 DEICODE outside of Q2.

-I changed some of the argument names/help messages to be consistent
 with Q2. THIS WILL BREAK prior code that uses DEICODE outside of
 Q2. (However, code that uses DEICODE within Q2 shouldn't be
 affected.)

-IMO, there is still some non-necessary redundancy in 1) the help
 messages and 2) the default settings of the various RPCA options.
 It may be worth defining a module where all of this information is
 stored to further eliminate redundancy (but we're already doing a lot
 better on that front than previously).
fixes a bug in the refactoring done in the prior commit. Now,
all the tests pass.

Potentially worth noting that the test output (at least for the
standalone RPCA) looks slightly different from what's in the
repository. However, it looks like the values are only slightly
different (it's probably due to numpy version diffs/machine precision
diffs or something), and all the tests pass now, so I think this
should be ok (but this isn't my call to make :)).

At this point, I think this branch might be ready to merge back
into biocore. But I'd like to see what y'all have to say about it.
This centralizes the various default parameters for the RPCA options,
cutting down on redundancy. Each option should only have to be changed
once now (so the Q2 and standalone DEICODE versions should be a lot
more consistent).
(forgot to add it in the last commit, whoops)
This should pretty much knock out biocore#28. At this point, pretty much
all of the standalone RPCA code is needed specifically for running
DEICODE outside of QIIME 2: that is, there's minimal redundancy
between the standalone and Q2 RPCA logic.
@fedarko fedarko changed the title Some small standalone DEICODE improvements Refactor redundant code between standalone and QIIME 2 RPCA methods Apr 2, 2019
@fedarko
Copy link
Contributor Author

fedarko commented Apr 2, 2019

Looks like having less code decreased the overall coverage slightly. See lemurheavy/coveralls-public#565 (comment) for reference.

Copy link
Collaborator

@cameronmartino cameronmartino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you @fedarko so much for this contribution.

I only have one request to increase the version number and update the changelog. Then I think this is ready to merge!

@fedarko
Copy link
Contributor Author

fedarko commented Apr 2, 2019

Got it, changelog and version bump should be ready later today. Does 0.1.7 sound good as a new version number? Since some of these changes aren't backwards compatible, not sure if e.g. 0.2.0 might be a better option. (Your call, of course :))

Spaces out some bullet points; switch from the arrow syntax for name
changes to fancy-ish markdown tables
if this keeps not working i'll just revert to a prev commit
@fedarko
Copy link
Contributor Author

fedarko commented Apr 2, 2019

One more change (in d35b734) -- the output filenames of the ordination and distance matrix produced by the standalone DEICODE RPCA now match the underlying filenames in the artifacts produced by the Q2 DEICODE RPCA. Updated the changelog accordingly.

@fedarko
Copy link
Contributor Author

fedarko commented Apr 3, 2019

@cameronmartino Sorry to bother you with a ton of messages. I was looking through the code, and I don't think the standalone RPCA test (currently located in test_rpca.py, but soon to be renamed to test_standalone_rpca.py) is correctly checking the output of anything. It looks like dist_res, fea_res, etc. will naturally be identical to dist_exp, fea_exp, etc. every time, since they're looking at the same exact files.

Furthermore, I'm somewhat confused about what feature.txt, samples.txt, etc. are supposed to be. It looks like these are earlier versions of the OrdinationResults output? Neither of these files -- nor distance.txt -- changes when the test is run. So, regardless of what the standalone RPCA code is doing, this test will always pass so long as it doesn't raise an error.

So: while we're at it with this PR, I think it would be worth updating this test to look at the relevant files: that is, verify that distance-matrix.tsv and ordination.txt (the new names for RPCA_distance.txt and RPCA_Ordination.txt) approximately match up with the ground truths.

It looks like assert_array_almost_equal() is still True when comparing distance-matrix.tsv and truth_distance.txt. However, I've been messing around with testing that the Ordination output matches truth_feature.txt and truth_samples.txt and it looks like the signs of the first PC's values are flipped between the "truth" feature and sample values and their respective rows in the Ordination. As an example, here's DEICODE's rank values for OTU_2414 in the test data:

OTU_2414 -4.094486487895422 2.7467056679841555 1.3098674678907734

And here's the entry for this OTU in truth_feature.txt:

OTU_2414 4.0930150209210066 2.749414394166293 1.3179536195044124

PC2 and PC3's values are approximately equal (ish), but PC1 is flipped (if you negate the PC1 value, it's approximately equal to the other PC1 value). Not sure if this is an indication of a past bug that was fixed, or a bug that has been introduced since truth_feature.txt was created.

One possible reason why the values are only approximately equal (ignoring the "flipping" thing for now) is that, as mentioned in the changelog, this PR adds some of the stuff the Q2 RPCA was doing that the non-Q2 RPCA wasn't -- that is, adding a PC of 0s when rank == 2, and using --min-feature-count with a default value of 10.

Also (not to get too nitpicky), but it looks like the Q2 RPCA test doesn't looking any of the actual values of the OrdinationResults or distance matrix. Might be worth adding to that to ensure that these values make sense? Although I don't think I understand the math behind DEICODE well enough yet to do that.

It's failing, due to the "flipping" bug (?) and other small
differences between values as discussed in biocore#29. The "truth" files
might just be really old.

I also added in "truth_sample_trimmed.txt", which is just a version
of the truth_sample.txt file without the OrdinationResults-ish stuff
at the end (so you can read it in pandas easier).
Turns out scikit-learn just imports assert_array_almost_equal directly
from numpy, which is already a dependency of DEICODE. Seriously,
check it out!
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py

...so if nothing else installing DEICODE will now be a tiny bit faster
:)
@cameronmartino
Copy link
Collaborator

@fedarko I apologize for the late response, thanks for fixing this. This was an oversight on my part.

  1. Comparing files in the tests -- This makes perfect sense. The distance matrix should always be close and should not have negative inverses etc... The loadings, however, are probably only directly comparable by absolute values. This is because of the sample loadings and feature loadings can also be the inverse, everything is relative and directionality is not absolute in ordination (but one cannot be different and the other not change). Basically, if you multiply both sample/feature loadings in any of the "PCs" by negative one it won't change the real interpretation, you would just be flipping the numerator and denominator in the log-ratio. :)

  2. Editing tests -- It is definitely worth making all the tests check the real values, you are definitely right. As with above, it would probably be best to just check between the absolute values to some rounding error.

  3. I think we can drop the adding zeros for rank < 3, that is updated in emperor now and will just default to 2D in that case (so that code is no longer needed). The Q2 script is definitely the most up to date script.

Probably don't need to go into specifics here (documenting the
fixed test bugs is ok, but not sure we need to document just general
test updates)
@fedarko
Copy link
Contributor Author

fedarko commented Apr 8, 2019

Did some more digging, it looks like python's built-in any() and all() don't work as expected with pandas (see here for details). It looks like they return True for inputs where they should return False -- here are the results from briefly testing this (input DFs based on the pandas documentation):

>>> import pandas as pd
>>> df = pd.DataFrame({"A": [True, False], "B": [False, False]})
>>> df2 = pd.DataFrame({"A": [False, False], "B": [False, False]})
>>> any(df)
True
>>> any(df2)
True
>>> all(df)
True
>>> all(df2)
True

There are apparently pandas .any() and .all() DataFrame functions that actually work as you'd expect:

>>> df.any(axis=None)
True
>>> df2.any(axis=None)
False
>>> df.all(axis=None)
False
>>> df2.all(axis=None)
False

Just pushed a fix for this. I think we're done, then?

Thanks!

@cameronmartino
Copy link
Collaborator

@fedarko Tricky! This is good to know, thanks for teasing that apart. Will have to be careful with those data frames and missing values. I think we are good to go, feel free to merge when you are ready. Thanks!

@fedarko
Copy link
Contributor Author

fedarko commented Apr 8, 2019

@cameronmartino No problem, happy to help. I don't think I can merge this in since I don't have "write access," though.

@fedarko
Copy link
Contributor Author

fedarko commented Apr 9, 2019

Ok, so four things.

I finished up a test that generates an Emperor biplot with rank 2, as @cameronmartino and I discussed today. For reference, here's the relevant function (designed to be chucked into the Test_qiime2_rpca class in the DEICODE Q2 tests file, after making a few imports/etc.):

    def test_rank2_matrix(self):
        """Verifies that using a rank 2 matrix works in DEICODE and Emperor."""

        ordination_qza, _ = q2deicode.actions.rpca(self.q2table, rank=2)
        # Generate fake sample metadata because it's required
        # self.table_sample_ids is just a list of the IDs in self.q2table
        fake_metadata_column_of_0s = [0] * len(self.table_sample_ids)
        sample_metadata_df = pd.DataFrame({"blah": fake_metadata_column_of_0s},
                                          index=self.table_sample_ids)
        sample_metadata_df.index.name = "Sample ID"
        # "Metadata" is imported from qiime2
        sample_metadata = Metadata(sample_metadata_df)
        biplot = q2emperor.actions.biplot(ordination_qza, sample_metadata)
        biplot.visualization.save("biplot.qzv")

I noticed a few things while writing/testing this code:

  1. DEICODE didn't really check the rank or iterations parameters' values up front, so for the sake of clarity I added some basic parameter checking in 140bd75 (where it ensures that --rank is at least 2, and that --iterations is at least 1). (Previously, using a rank of 0 or less would still give you an error from scipy.sparse.linalg.svds(), but the error message wasn't super clear and this still allowed a rank of 1. So I figure this is a beneficial change.)

  2. q2-emperor can produce a biplot from a rank 2 ordination (as is generated in the new test code), but it fails when loading the resulting QZV files in the browser (checking the developer console, I get the same error messages* as I do when trying to load bray_curtis_emperor.qzv from this forum post). This occurs with Chrome 73.0.3683.86 and QIIME 2 2019.1.

    I was under the impression that biocore/emperor@a93f029 had fixed this problem in Emperor. Maybe this just wasn't included in the q2-emperor 2019.1 release? (If needed, we can just add back the "add a PC of zeroes" approach to DEICODE in the meantime.)

    For reference: here's a link to the biplot QZV file produced by the test code at the top of this post (sorry, github doesn't let me upload QZVs for some reason). And for comparison, here's a link to the biplot QZV file produced with the same code but using rank=3 in the call to rpca() instead of rank=2 (just to show that the rank=2 is the problem here).

    *with slightly different line numbers, for some reason? I think this is due to Emperor version differences within the forum's QZV file, since I'm using Q2 2019.1 and that post was from 2018.

  3. Since Emperor is failing in the browser, I'm not sure that a test in DEICODE is the best way to check this error (it seems like it'd be easier to do the testing in Emperor's repository's JS tests, if anything). This is why I haven't committed the test code in this post yet.

  4. This isn't related to points 2 and 3 in this post, but I think I might have noticed a bug in deicode/_optspace.py (see here). Since the code uses 1 as a starting point in the iteration loop -- and since it assigns to index i+1 in the dist ndarray -- dist will always have a value of 0 at its second position (index 1), regardless of niter (which is just --iterations). This also means that the loop's body won't execute when --iterations is 1, since len(range(1, 1)) == 0.

    You can test this out by adding a line that says print(dist) right after the for loop I linked to. I'm not familiar enough with the OptSpace algorithm to know the best way to fix this (or what its impacts on prior DEICODE output might have been), but this seems like unintentional behavior to me.

Anyway, that's what I've found so far. Sorry for the walls of text. @cameronmartino @ElDeveloper maybe we can meet sometime tomorrow / later this week to go over some or all of these points? I'm heading home now but I can try to help out with finishing this PR / squashing the relevant bugs this week. Thanks!

@cameronmartino
Copy link
Collaborator

cameronmartino commented Apr 9, 2019

@fedarko (4) Nice catch, I just pushed the iteration in the OptSpace helper by one. It should have little to no effect because of one iteration won't affect much once it has converged. I think in the test it changed in the third or fourth decimal, I updated the test to reflect this.

@cameronmartino
Copy link
Collaborator

cameronmartino commented Apr 9, 2019

@fedarko (1) I agree but I moved this into the actual optspace.py function (here) so it is always restricted.

…ing. Also added rank==2 addition of zeros in the PC3 to prevent breaking emperor
@cameronmartino
Copy link
Collaborator

cameronmartino commented Apr 9, 2019

@fedarko (2/3) I added the zeros back, for now, this also allows us some ability to be back compatible with older versions of Emperor. I think what is breaking this is that there are only two axes given. When even in the 2D case emperor seems to expect to see more than two axes in the Ordination file. Unless @ElDeveloper objects to this workaround, I will keep it.

@ElDeveloper
Copy link
Member

ElDeveloper commented Apr 9, 2019 via email

@cameronmartino
Copy link
Collaborator

@fedarko Thanks again for this contribution. Will merge and update conda today.

@cameronmartino cameronmartino merged commit 41278a3 into biocore:master Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated Scripts
3 participants