Skip to content
AmyW edited this page Oct 6, 2021 · 7 revisions

Welcome to the breakaway wiki! Its purpose is to provide some answers to Frequently Asked Questions. The issues page (including closed issues) may also help you.

This page is a work in progress, but if you have another question or would like more details, please send me an e-mail and I will do my best to answer it here.

If you've been redirected here from an e-mail, it's because I get more questions about breakaway than I can answer individually. I apologise I cannot respond to you personally, and I hope you find the answer to your questions here. Thank you for using our software.

Amy Willis, Ph.D.
Assistant Professor
Department of Biostatistics
University of Washington, Seattle, USA

Last updated June 2020

Frequently asked questions

  • What does breakaway do?

breakaway is an R package for estimating species richness. In most ecology studies, you take a sample of organisms from a large population, and you can count how many different organisms you saw in your sample (this is what I call the "sample richness"). However, since you didn't see all the organisms in the population, you probably missed some species! breakaway is a package to help you estimate how many species you missed.

By the way -- while "species richness" is a general term for the number of different categories in a system, breakaway doesn't only apply to estimating species-level richness. It can be used to estimate gene richness, MAG richness, or even phylum-level richness. The data structure that you provide to breakaway will determine the parameter that breakaway is estimating -- so if you want to estimate genus richness, give breakaway genus-level data. For more information on how to organize your data so that breakaway can estimate the richness, check out the package vignettes.

  • How does breakaway work?

How do we estimate how many taxa were unobserved? Essentially we look at the pattern of rare taxa (taxa observed once, twice, three times...) and try to predict the number of taxa observed zero times. Different statistical models exist for fitting that curve and obtaining diversity estimates, and we think that breakaway uses some reasonable models and estimation procedures. breakaway has a vignette called "Introduction to diversity estimation" that goes through some of these concepts in more detail.

TODO(AW) -- long-term plan is to make a video on this with more details #OpenScience

  • What's the best richness index? (Issue #87)

A note on terminology -- a "diversity index" is an unknown parameter associated with the population that you carry about. Species richness, Shannon diversity, and Simpson diversity are all examples of diversity indices.

There are many different ways to estimate any given diversity index. Unfortunately the most common is to use the "plug in estimate" -- this just pretends that your sample saw everything and you have no error nor uncertainty. More sophisticated methods, breakaway included, attempt to correct those errors, and tell you how uncertain your estimate is (via confidence intervals or standard errors).

The philosophy of the StatDivLab is that you (the ecologist) pick what "diversity index" you care about, and we (the statisticians) help you estimate it as well as possible. Therefore, I'm not going to tell you that richness is what you care about... but if you tell me that richness is what you care about, then I recommend using breakaway to estimate it, because I think it uses the most reasonable models and most sensible approaches to estimation.

breakaway estimates of species richness, but if you care about estimating other alpha diversity indices, check out DivNet, our R package for estimating non-richness diversity indices.

  • What is the difference between breakaway, CatchAll, the objective Bayes methods, and the Chao1 index?

breakaway refers to both the R package, and a method of estimating species richness. In addition to the method breakaway, the breakaway package also implements a number of other methods. These methods differ with respect to the models that they fit to the data, and/or how we estimate those model parameters. A quick overview of the different methods are as follows (if you see an error, please contact me to let me know ASAP, preferably providing a Markdown-formatted correction):

  • breakaway fits a sequence of models, beginning of the most complex class of models (Kemp, estimated via iteratively reweighted nonlinear least squares), and sequentially simplifying the models (Kemp, then Negative Binomial, then Poisson) if the models can't be fit. For more details on Kemp models and breakaway method, check out the paper.

  • chao_bunge fits a Negative Binomial model with a moment-based estimator.

  • wlrm_transformed fits a Negative binomial model using weighted least squares and a linear link.

  • wlrm_transformed fits a Negative binomial model using weighted least squares and a log link.

  • chao1 fits a Poisson model. An assumption of this model is that all species are equally abundant in your population. This assumption basically never holds and so I strongly recommend against this method. This method can also be interpreted as a lower bound on species richness under any mixed-Poisson model, but comparing lower bounds is rarely useful (and I also don't believe in the mixed-Poisson model, which is why I wrote breakaway in the first place!) and prone to misinterpretation, so I just recommend against this method. breakaway estimates the same quantity, but better. One fact that highlights how breakaway is more sophisticated than Chao1 is that the Chao1 estimator just uses two frequency counts, while breakaway uses as many as possible!

  • objective_bayes_*(): This set of methods fit Negative Binomial and other mixed-Poisson models using an objective Bayesian approach (where the method chooses the prior to have minimal information content).

Again, if there is anything missing here please let me know!

  • How do I compare diversity across samples?

Use the function betta() -- it's awesome and generally underused. Check out the vignettes for some examples.

Citation: Willis, A., Bunge, J., & Whitman, T. (2017). Improved detection of changes in species richness in high diversity microbial communities.. Journal of the Royal Statistical Society: Series C (Applied Statistics), 66 (5), 963-977.

TODO(AW) -- elaborate

  • Occasionally breakaway's confidence intervals are very large. Why does this occur and is it a problem? (Issue #64)

A large standard error on a single richness estimate indicates that we can't very precisely estimate the total species richness in that sample. Large confidence intervals occur because species richness estimation is an extrapolation problem, where we extrapolate off the range of the data (in this case, the frequency counts). There is a lot of uncertainty, and a high variance/standard error in these estimates.

Furthermore, the species richness estimates that breakaway implements are all asymptotically normally distributed, but in finite samples we the distribution is more skewed. For this reason, we show a log-normal approximation. Note that although it's unappealing to have so much error, this is a correct mathematical assessment of how uncertain each individual species richness estimate is.

However, each individual samples' species richness doesn't matter too much for most scientific questions of interest. Typically folks are interested in doing inference on whether diversity changes across a category/treatment/experiment, and they're going to be using all of the samples together -- I wrote betta() for doing this sort of modeling. The overall uncertainty in the inference problem is much less than the uncertainty in the individual estimation problems.

If you plug in all of your richness estimates into betta(), the samples with a large standard error will be given less weight in the regression than samples with small standard errors on their richness estimates. This is the behaviour we want, because while that sample contains some information, the standard error is telling us that the richness estimate may not be as reliable as other samples' estimates, and we should downweight the importance of that observation relative to observations would smaller standard errors on their richness estimates.

  • How do I get the output of breakaway on phyloseq objects in tabular format?

Use summary, e.g.,

GlobalPatterns %>% subset_samples(SampleType == "Soil") %>% breakaway %>% summary

will produce a table for you. (GlobalPatterns is a default data set that comes with phyloseq)

  • How I interpret the output of betta?

Check out my response to Issue #66 for some guidance

  • How do I choose a cut-off?

The cutoff controls how ...TODO(AW)

Regarding changing the cut-off -- this is a very reasonable thing to do when trying to get the best richness estimate possible for a specific sample. However, if you are going to use betta(), I recommend against tinkering with cutoff arguments, since with many samples, it all comes out in the wash, so to speak -- i.e., the central limit theorem kicks in and no single sample's cutoff choice will overly influence your results.

Note that "ignoring requested cutoff; it's too low" warnings are not bad, and they don't necessarily indicate model misspecification or cause for alarm. They just indicate that the frequency count table is "short", which means there is not a lot of data for estimating the number of missing species. I wrote in that warning to remind users not to overinterpret the total richness estimates -- essentially I want people to exercise caution and understand that there are uncertainties inherent in estimating species richness.

  • Breakaway has selected mixed poisson or negative binomial models for all my samples. Why? (#74)

By default breakaway will try to fit a Kemp model, and only if no Kemp models will fit will it go to a different model. You can specify a less well fitting model by just running, e.g., chao_bunge(ps) instead of breakaway(ps) where ps is a phyloseq object or similar, but breakaway does everything it can to make a Kemp model fit and there's no way to fix this as a user, unfortunately.

If a frequency count ratio approach is not possible, breakaway attempts to fit the Chao-Bunge estimator described here.

The Chao-Bunge estimator is based on a negative binomial model, and it's a moment-based estimator.

That estimator doesn't necessarily converge. If it doesn't converge, breakaway fits a Poisson model using maximum likelihood.

TODO(AW) cleanup and elaborate

  • If my models do not have reasonable fits is the estimate still usable? Is it still better than just using observed richness?

Yes -- definitely much better (unless the model is doing something absolutely wild like the fitted observations are way off the trend -- I think I've caught almost all of these cases by now but please post an issue if you see this!). Using the observed richness doesn't account for bias or variance at all, so even a bad bias correction and a small variance (which can come out of an unreasonable model) is better than that!

  • Where can I learn more?

TODO(AW)

Clone this wiki locally