Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loads tree in NewickFormat, uses BP to parse tree faster #42

Merged
merged 9 commits into from Apr 4, 2021

Conversation

ahdilmore
Copy link
Contributor

Uses BP to parse tree to improve Phylo-RPCA's memory usage. I also updated the setup.py to account for new dependencies.

gemelli/rpca.py Outdated
@@ -39,6 +42,11 @@ def phylogenetic_rpca(table: biom.Table,
gemelli.
"""

# loads NewickFormat tree as bp.BP tree; loads & parses tree faster
phylogeny = get_bp(phylogeny)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_bp is only like two lines and it is pretty easy to get burned by importing from private modules in python. I would probably recommend just duplicating that code here.

Copy link
Member

@ElDeveloper ElDeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks so much @ahdilmore. Since get_bp isn't a public function I would suggest to just copy the 2 lines directly into the code.

Also, NewickFormat can probably not be included here because the standalone CLI will not work if q2_types isn't installed. @cameronmartino or @gibsramen any suggestions on how to handle this? Short of writing a wrapper for this function that's decorated with the relevant types, etc?

gemelli/rpca.py Outdated Show resolved Hide resolved
gemelli/rpca.py Outdated Show resolved Hide resolved
@gibsramen
Copy link
Contributor

Not super familiar with how gemelli handles trees (or bp for that matter) but why does this necessitate changing phylogeny from a TreeNode object? Is the tree passed to phylogenetic_rpca changed when using bp?

@gwarmstrong
Copy link
Member

NewickFormat can probably not be included here because the standalone CLI will not work if q2_types isn't installed.

As a hack, you could probably use importlib.

>>> import importlib
>>> foo = importlib.import_module('random')
>>> foo.randrange(0, 10)
4
>>> bar = getattr(foo, 'randrange')
>>> bar
<bound method Random.randrange of <random.Random object at 0x000001AE50C48F20>>
>>> from random import randrange
>>> bar == randrange
True

So something like

try:
    q2_types = importlib.load_module('q2_types')
    NewickFormat = getattr(q2_types, 'NewickFormat')
except:
    NewickFormat = str

def phylo_rpca(..., tree: NewickFormat, ...):
    ...

Could work. Though I would want to check integration with qiime2.

Alternatively, it is pretty easy to include only qiime2 and q2-types in a conda install by adding -c qiime2 without doing a full qiime2 installation, so that NewickFormat could be used for the standalone without a full Qiime2 installation.

@gwarmstrong
Copy link
Member

Not super familiar with how gemelli handles trees (or bp for that matter) but why does this necessitate changing phylogeny from a TreeNode object? Is the tree passed to phylogenetic_rpca changed when using bp?

The iow.parse_newick is orders of magnitude faster than the skbio TreeNode parser. See https://github.com/wasade/improved-octo-waddle/blob/master/ipynb/performance%20comparison.ipynb

@gibsramen
Copy link
Contributor

Sure but in that case why not specify that phylogeny should be of type bp.BP?

Aside from that

Alternatively, it is pretty easy to include only qiime2 and q2-types in a conda install by adding -c qiime2 without doing a full qiime2 installation, so that NewickFormat could be used for the standalone without a full Qiime2 installation.

I think this is a decent solution.

@gwarmstrong
Copy link
Member

Sure but in that case why not specify that phylogeny should be of type bp.BP?

Good point. In that case I think a transformer would need to be defined for the q2 plugin.

@cameronmartino
Copy link
Collaborator

Thanks for the contribution @ahdilmore, this tree import is way faster. Also, good points from @gibsramen, @ElDeveloper, and @gwarmstrong.

I made a few changes:

  1. I moved the tree import to preprocessing.py as a function bp_read_phylogeny. That way we can use it in multiple commands (rclr, ctf, and rpca).
  2. I added this import to the phylo ctf and rclr command.
  3. The QIIME2 functions now all work fine with the NewickFormat 👍

It looks like everything is now passing. If someone wants to take a second check at everything, I think we should be good to merge.

Copy link
Member

@ElDeveloper ElDeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @ahdilmore and @cameronmartino! Just one suggested change because the comment seems out of place.

Comment on lines 41 to 42
# The file will still be closed even though we return from within the
# with block: see https://stackoverflow.com/a/9885287/10730311.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems a bit outdated.

Suggested change
# The file will still be closed even though we return from within the
# with block: see https://stackoverflow.com/a/9885287/10730311.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @ElDeveloper. That comment is now removed. Thanks!

@cameronmartino cameronmartino merged commit 9dfa1d1 into biocore:phylo-rclr Apr 4, 2021
@ElDeveloper
Copy link
Member

Thanks @cameronmartino!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants