Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which body habitats are most/least variable through time? #2

Open
floresg opened this issue Nov 1, 2012 · 20 comments
Open

Which body habitats are most/least variable through time? #2

floresg opened this issue Nov 1, 2012 · 20 comments
Assignees
Labels

Comments

@floresg
Copy link
Contributor

floresg commented Nov 1, 2012

A. Alpha diversity
a) Metrics – richness, phylogenetic diversity, Shannon Index)

  1. Coefficient of variation (CV = standard deviation/mean) – useful to compare the variation of two populations independent of the magnitude of their means.
    b) Look for difference within each body habitat based on:
  • gender, university, antibiotics, C-section birth, allergies, BMI class, etc.

B. Beta diversity
a) Metrics – weighted/unweighted UniFrac

  1. metrics that contain abundance information are more appropriate for these data because skin habitats are rich in low abundance transient otus which will be more heavily weighted using a presence/absence metric
  2. median absolute deviation (MAD) – not sensitive to outliers
  3. mean of pairwise comparisons
@ghost ghost assigned floresg Nov 1, 2012
@floresg
Copy link
Contributor Author

floresg commented Nov 1, 2012

Question about the Beta diversity part of this analysis - instead of averaging all the pairwise comparisons for an individual, should we average only those from adjacent time points?

@gregcaporaso
Copy link
Member

Added some data showing a comparison across individuals. See the analysis results here. Working on within individual comparisons now.

@floresg
Copy link
Contributor Author

floresg commented Nov 9, 2012

One question I had about these analyses is normalizing sampling effort across individuals since some people have only 5 samples and others have up to 14? If there is a time distance decay relationship, then you would expect individuals who turned in samples further apart in time would have greater variability than those that turned in samples closer in time. Should we be randomly sampling five samples from each individual for these analyses?

@rob-knight
Copy link

I would recommend doing some matched analyses testing the effect vs the subset of subjects who returned all samples (ie compare vs same 5 timepoints from subjects with all timepoints).

On Nov 9, 2012, at 3:12 PM, "floresg" <notifications@github.commailto:notifications@github.com> wrote:

One question I had about these analyses is normalizing sampling effort across individuals since some people have only 5 samples and others have up to 14? If there is a time distance decay relationship, then you would expect individuals who turned in samples further apart in time would have greater variability than those that turned in samples closer in time. Should we be randomly sampling five samples from each individual for these analyses?


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10246459.

@rob-knight
Copy link

Thanks. Those are extremely significant t test values and I bet all the nonparametric values are 0 even if you do 10^9 iterations.

The fact that forehead is lower diversity/lower variability than palm was known in Costello et al. though not sure we reported it clearly.

It might be worth reopening the discussion about which measures of variability are useful and how we should apply and compare them?

On Nov 9, 2012, at 1:27 PM, Greg Caporaso <notifications@github.commailto:notifications@github.com> wrote:

Added some data showing a comparison across individuals. See the analysis results herehttps://github.com/gregcaporaso/student-microbiome-project/wiki/Overall-variability-across-body-sites. Working on within individual comparisons now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10243444.

@gregcaporaso
Copy link
Member

From Rob's comment:

It might be worth reopening the discussion about which
measures of variability are useful and how we should apply
and compare them?

This is something that @jrrideout is actively working on for the microbiogeo analysis/paper and we'll feed the results into this analysis.

@antgonza
Copy link

I think that one of the most important questions we need to answer is what
is best wat to characterize variation in bacterial communities: mean or
median. Now, I'm not sure this is the perfect dataset to do this but it
will be good to keep it in mind while selecting analytical tools.

On Sat, Nov 10, 2012 at 8:15 AM, Greg Caporaso notifications@github.comwrote:

From Rob's comment:

It might be worth reopening the discussion about which
measures of variability are useful and how we should apply
and compare them?

This is something that Jai is actively working on for the microbiogeo
analysis/paper and we'll feed the results into this analysis.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10256088.

Antonio González Peña
Research Assistant, Knight Lab
University of Colorado at Boulder
https://chem.colorado.edu/knightgroup/
http://scholar.google.com/citations?user=d5EXd78AAAAJ

@gregcaporaso
Copy link
Member

I think just mean/median is not enough, but rather a five number summary - minimum, first quartile, median, third quartile, and maximum - would be better. Alternative would be median and median absolute deviation. Thoughts on this?

I really don't like mean for this for the usual sensitivity to outliers reason, which can be pop up here all the time e.g. if someone sneezed on their hands a couple of mins before sampling at one of the time points (while these would look different, probably not different enough to be flagged as mislabeled).

@gregcaporaso
Copy link
Member

From Rob's comment:

I would recommend doing some matched analyses testing
the effect vs the subset of subjects who returned all samples
(ie compare vs same 5 timepoints from subjects with all
timepoints).

One relatively minor issue here is that we don't currently define what it means for someone to have turned in all samples. Technically the sampling period was 10 weeks, but if people get providing samples, we kept taking them, so we have up to ~13 weeks of data from some individuals. Gilbert/Dan, you're most familiar with the metadata - would we be safe defining 10 weeks as "all"? If so, does anyone object to that definition?

@floresg
Copy link
Contributor Author

floresg commented Nov 11, 2012

We may want to define all as 8 weeks worth of samples because then more individuals will be included. One other thing to consider is consecutive time points. For some individuals those 8 samples could have been turned in over a 14 week period.

@rob-knight
Copy link

Sounds reasonable. If you're worried about outliers might it be worth looking at histograms of some/all of the distributions eg as thumbnails?

On Nov 11, 2012, at 8:54 AM, "Greg Caporaso" <notifications@github.commailto:notifications@github.com> wrote:

I think just mean/median is not enough, but rather a five number summary - minimum, first quartile, median, third quartile, and maximum - would be better. Alternative would be median and median absolute deviation. Thoughts on this?

I really don't like mean for this for the usual sensitivity to outliers reason, which can be pop up here all the time e.g. if someone sneezed on their hands a couple of mins before sampling at one of the time points (while these would look different, probably not different enough to be flagged as mislabeled).


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10268534.

@antgonza
Copy link

I guess my comment wasn't clear enough. My concern between mean/median is
due to the use/introduction of median absolute deviance (MAD) vs. the
histograms/mean we have used before for other analyses and I just do not
want this point get lost.

On Sun, Nov 11, 2012 at 12:21 PM, Rob Knight notifications@github.comwrote:

Sounds reasonable. If you're worried about outliers might it be worth
looking at histograms of some/all of the distributions eg as thumbnails?

On Nov 11, 2012, at 8:54 AM, "Greg Caporaso" <notifications@github.com
mailto:notifications@github.com> wrote:

I think just mean/median is not enough, but rather a five number summary -
minimum, first quartile, median, third quartile, and maximum - would be
better. Alternative would be median and median absolute deviation. Thoughts
on this?

I really don't like mean for this for the usual sensitivity to outliers
reason, which can be pop up here all the time e.g. if someone sneezed on
their hands a couple of mins before sampling at one of the time points
(while these would look different, probably not different enough to be
flagged as mislabeled).


Reply to this email directly or view it on GitHub<
https://github.com/gregcaporaso/student-microbiome-project/issues/2#issuecomment-10268534>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10270959.

Antonio González Peña
Research Assistant, Knight Lab
University of Colorado at Boulder
https://chem.colorado.edu/knightgroup/
http://scholar.google.com/citations?user=d5EXd78AAAAJ

@gregcaporaso
Copy link
Member

I think the histograms cover what we'd show in a five number summary. I think you're saying that it'd be worth mentioning in the paper why we're choosing to used median, etc rather than mean - is that right? I agree that that's a good technical point to mention.

Also, I wanted to point out that Jai is working on a subsampling strategy relevant for time series analysis to address Gilbert's suggestion for subsampling. We're discussing this here, and he is shooting to have a function in place that we could use to explore this by the end of this week.

@rob-knight
Copy link

There are two separate points here:

  1. mean vs median for comparisons of distances
  2. whether to use a measure of central tendency (mean or median or whatever) or a measure of spread (standard deviation or MAD or whatever)

In both cases, comparison and discussion would probably be a good idea.

Rob

On Nov 13, 2012, at 5:40 PM, Greg Caporaso <notifications@github.commailto:notifications@github.com> wrote:

I think the histograms cover what we'd show in a five number summary. I think you're saying that it'd be worth mentioning in the paper why we're choosing to used median, etc rather than mean - is that right? I agree that that's a good technical point to mention.

Also, I wanted to point out that Jai is working on a subsampling strategy relevant for time series analysis to address Gilbert's suggestion for subsampling. We're discussing this herehttps://github.com/biocore/qiime/issues/446, and he is shooting to have a function in place that we could use to explore this by the end of this week.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10350601.

@floresg
Copy link
Contributor Author

floresg commented Nov 14, 2012

Besides the moving pictures data and infant gut time-series, the other human microbiome time series studies involve the vagina and nares. Both used different metrics to quantify beta diversity variability. In the vaginal paper, they used the median of Jensen-Shannon divergence to represent "community deviation from constancy." The supplemental section of this manuscript describes this metric but it sounds like it is just another metric based on entropy. They do provide justification of this choice but it is not very clear. The nares paper used the index of multivariate dispersion (IMD) to measure "the variability of an individuals bacterial community structure among the months." I did a little digging on this metric but could not find anything very helpful. These two metrics might be something we want to look into for our work and at least should start a constructive conversation. I am not sure how to add the papers to GitHub so I will send them to Greg and maybe he can add them to my comment here?

@gregcaporaso
Copy link
Member

Here are links to those two papers: Camarinha-Silva (2012) and Gajer (2012).

@rob-knight
Copy link

We collaborate with Jacques/Pawel so let me know if methods clarifications needed: Jacques and I are on the NIH call right after Fri meeting so I can bug him then...

On Nov 14, 2012, at 9:02 PM, "Greg Caporaso" <notifications@github.commailto:notifications@github.com> wrote:

Here are links to those two papers: Camarinha-Silva (2012)http://onlinelibrary.wiley.com/doi/10.1111/j.1758-2229.2011.00313.x/full and Gajer (2012)http://sciencemedicine.org/content/4/132/132ra52.short.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-10396851.

@floresg
Copy link
Contributor Author

floresg commented Nov 30, 2012

Added beta diversity dotplots for average values and MAD. For unweighted UniFrac, the results agree with Greg's boxplots and statistical analysis, that is variability of palm > forehead > gut > tongue. However, weighted UniFrac and MAD tell a different story.

@gregcaporaso
Copy link
Member

@floresg is going to look specifically at what was previously issue #8 here (Are individuals that reported having atopic diseases (allergies, asthma, eczema, etc) more or less stable than those that did not? Diversity higher or lower?)

@floresg
Copy link
Contributor Author

floresg commented Apr 9, 2013

I have added some text and tables but am still working on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants