Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up run time on wide datasets #32

Closed
amorris28 opened this issue Aug 14, 2019 · 8 comments
Closed

Speeding up run time on wide datasets #32

amorris28 opened this issue Aug 14, 2019 · 8 comments
Assignees

Comments

@amorris28
Copy link

Moved over from twitter.

I'm trying to run divnet on ASVs with a dataset of 44 samples and 19,921 ASVs. No ASVs appear in all samples so I've chosen a reference ASV that is present in 42 of the 44 indicated by ref_otu. I'm also leaving X = NULL with no design matrix so I'm just trying to estimate diversity and confidence intervals for each sample. physeq is my phyloseq object. If I run this on a cluster with 28 cores and 128 GB of memory, I don't see any progress after ~30 minutes. Running locally on my 4 core, 16 GB machine it crashes, I think because it runs out of memory. Function call below:

asv_div <- divnet(physeq, ncores = 28, base = ref_otu)

Thank you for the help on this!

@adw96
Copy link
Owner

adw96 commented Aug 15, 2019

Hi Andrew! Thanks so much again for using DivNet. Some thoughts

  • I was concerned that if you're running in parallel then maybe the progress bar doesn't update because the cores don't talk to each other, but I confirmed that's not the case (since we parallelise over the MH steps but the progress bar updates each EM step)
  • I would recommend network="diagonal" for a dataset of this size. This means you're allowing overdispersion (compared to a plugin aka multinomial model) but not a network structure. This isn't just about computational expense -- it's about the reliability of the network estimates. Essentially estimating network structure on 20k variables (taxa) with 50 samples with any kind of reliability is going to be very challenging, and I don't think that it's worth doing here. In our simulations we basically found that overdispersion contributes the bulk of the variance to diversity estimation (i.e. overdispersion is more important than network structure), so I don't think you are going to lose too much anyway.
  • You can control the speed-precision trade off by varying the argument tuning. The default is
    list(EMiter = 6, EMburn = 3, MCiter = 500, MCburn = 250)
    Doing fewer EMiters and MCiters reduces runtime. Perhaps try
    list(EMiter = 6, EMburn = 3, MCiter = 250, MCburn = 100)
    If you're worried that it's stalling out entirely, to check that it runs, try
    list(EMiter = 6, EMburn = 3, MCiter = 10, MCburn = 5)
    Note that we parallelise over MCiter.
  • I'm running a simulation now to see how DivNet it scales with ncores and q. I'm concerned that they might be some overhead with the parallelisation and perhaps having so many cores hurts you. I'll post my results when I get them.

This is a great test case for us so thanks for bringing it to our attention! Never in my wildest dreams did I think that someone would try to run this with 20k taxa. (My imagination stops at around 5k.) I guess I need to work with more soil!

Amy

@adw96
Copy link
Owner

adw96 commented Aug 15, 2019

@bryandmartin Anything you want to add?

@amorris28
Copy link
Author

Hey Amy!

I'm glad this is a helpful case for you all. I would love to use DivNet going forward and this is not an atypical data set for our lab group so getting to know how to make it work will be super helpful. I will try network='diagonal' and playing with the tuning argument to see how things work. Let me know how your simulations go.

Thank you for the quick turn-around!
Andrew

@adw96
Copy link
Owner

adw96 commented Aug 15, 2019

Ok a quick update (a bigger sim to come): time-vs-q.pdf

Conclusions:

  • no huge gains to adding many cores; 3 is about as good as 6.
  • no huge gain for diagonal vs naive

I'm upscaling q (number of taxa) and will see how the trends continue.

@mooreryan
Copy link
Contributor

mooreryan commented Aug 15, 2019

I was having a similar issue to the original poster (see issue #28). Large number of ASV/OTU/taxa really aren't feasible it seems.

I've also found that the ncores option really doesn't provide much benefit.

In a comment on a previous pull request (#29 (comment)), I found that the MCrow (and MCmat) functions are taking the most CPU time, so today, I started work on rewriting those functions in Rcpp. Still working some kinks out of it, but it's definitely faster.

@mooreryan
Copy link
Contributor

mooreryan commented Aug 15, 2019

This might be helpful for the original poster as well... While working on speeding up the divnet function, I made this little graph of how number of taxa scales with time. The dataset is the included Lee dataset.

If that trend holds for very large numbers of taxa (not sure if it actually would), then running ~20,000 ASVs would take at least a couple of hours.

ntaxa_vs_time

@adw96
Copy link
Owner

adw96 commented Aug 15, 2019

This is fantastic to know, @mooreryan! EM-MH algorithms are really well-suited to Rcpp but we just haven't been able to prioritise rewriting it. We would be so rapt if you were to implement it, and we would love to add you as a package coauthor/maintainer.

@mooreryan mooreryan mentioned this issue Aug 21, 2019
Closed
@adw96 adw96 closed this as completed Jun 16, 2021
@ch16S
Copy link

ch16S commented Mar 10, 2022

Hey everyone,

Really appreciate the work everyone has on divnet.
Amy, you mentioned that a diagonal matrix is most appropriate for a large number of taxa and small number of samples, as you cannot reliably estimate the interactions.

Do you think this holds true if I have 1000-2000 samples?
The samples are from soil, and are geographically diverse.

Cheers,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants