Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggested rarefied level #650

Closed
antgonza opened this issue Nov 20, 2014 · 13 comments
Closed

suggested rarefied level #650

antgonza opened this issue Nov 20, 2014 · 13 comments
Assignees
Milestone

Comments

@antgonza
Copy link
Member

From @clozupone. Rewriting: add a way to know the sequences per sample in a meta-analysis before defining the rarefaction level.

@antgonza antgonza added this to the Alpha 0.1 milestone Nov 20, 2014
@squirrelo
Copy link
Contributor

This has come up a bunch, and the idea is to have an on-the-fly histogram of sequences per sample created on the page where you select rarefaction level. Not sure how computationally intensive this will be though since that's potentially a lot of biom tables to go through.

@antgonza
Copy link
Member Author

antgonza commented Dec 2, 2014

The most expensive step is to create the stats of the biom file per study,
if we can store this at time of creation then a histogram should be pretty
fast.

@wasade
Copy link
Contributor

wasade commented Dec 2, 2014

It isn't a violation of the biom file format to add in hdf5 attributes. on
the other hand, we can do this specific computation much faster than the
summarize method as we only need to look at indptr, not the full matrix.
should actually be very quick to compute

On Tue, Dec 2, 2014 at 3:12 PM, Antonio Gonzalez notifications@github.com
wrote:

The most expensive step is to create the stats of the biom file per study,
if we can store this at time of creation then a histogram should be pretty
fast.


Reply to this email directly or view it on GitHub
#650 (comment).

@wasade
Copy link
Contributor

wasade commented Dec 2, 2014

From chatting with @antgonza and @josenavas, there is another subtly to this which is that in a meta analysis, the BIOM table doesn't yet exist. Regardless of if we want to present sequence counts to the user prior to or following table creation, I think we can do something reasonably fast by just interrogating the compressed axis directly:

from h5py import File

def get_sample_stats(open_hdf5_file, samples=None):
    if samples is None:
        samples = open_hdf5_file['sample_ids']

    samples = set(samples)

    indptr = open_hdf5_file['sample/indptr']
    data = open_hdf5_file['sample/data']

    for idx, id_ in enumerate(open_hdf5_file['sample/ids']):
        if id_ in samples:
            start = indptr[idx]
            end = indptr[idx + 1]

            yield (id_, data[start:end].sum())

Asking how many unique OTUs are observed per sample is even lighter. While in general it is nice to have the BIOM table in memory, it does represent a fair bit of overhead to assess basic summaries. The overhead will be significant if the number of tables that need to be interrogated is large

@gregcaporaso
Copy link
Contributor

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

@rob-knight
Copy link

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.commailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.


Reply to this email directly or view it on GitHubhttps://github.com//issues/650#issuecomment-78412691.

@wasade
Copy link
Contributor

wasade commented Mar 12, 2015

Just an extension, I like the thought of storing within the BIOM table
similar to how we store stats in the demux files. Since we're pending a
minor release for biom, we could actually just get this lumped in as a new
option to the table summarizer (--save-stats). Fetching the stats would not
require a parse of the table, just opening the h5py file. However, storing
in the db is light as well, but it puts the stats further away from the
actual data they represent.

On Wed, Mar 11, 2015 at 8:27 PM, Rob Knight notifications@github.com
wrote:

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.com
mailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and
then when the user selects their samples you could display the histogram of
number of sequences per sample? To assist the user in making their choice,
you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th
percentiles in that distribution and note what fraction of their samples
they'd discard with each level.


Reply to this email directly or view it on GitHub<
https://github.com/biocore/qiita/issues/650#issuecomment-78412691>.


Reply to this email directly or view it on GitHub
#650 (comment).

@squirrelo squirrelo modified the milestone: Alpha 0.1 Apr 4, 2015
@antgonza antgonza modified the milestone: Alpha 0.3 Oct 10, 2015
@antgonza
Copy link
Member Author

I have been thinking about this and I can see the benefits of the described approaches. However, thinking pragmatically about this, the histograms need to be good but not prefect. With this in mind, what about adding a step in the processing step (otu picking) so a default histogram is generated and we add this information to the process data table as an array or ints? The default histogram can be from 0->50K (we can discuss this or we can create a histogram of all the current biom tables so we get a better range) with 100 or 1K points. With this approach we can quickly display the histogram of one biom but more importantly can show histograms of each process data and the combination of all for metaanalyses. Finally, this solution is really easy to implement and has minimal DB changes. What do you think?

@wasade
Copy link
Contributor

wasade commented Oct 23, 2015

This should break down to:

Table = load_table(foo)
Res = hist(table.sum(axis='sample'))

Recommend storing the bar positions and height instead of a PDF so that the
visual can be changed in UI if needed, but path of least resistance is just
generate the image for display
On Oct 23, 2015 07:59, "Antonio Gonzalez" notifications@github.com wrote:

I have been thinking about this and I can see the benefits of the
described approaches. However, thinking pragmatically about this, the
histograms need to be good but not prefect. With this in mind, what about
adding a step in the processing step (otu picking) so a default histogram
is generated and we add this information to the process data table as an
array or ints? The default histogram can be from 0->50K (we can discuss
this or we can create a histogram of all the current biom tables so we get
a better range) with 100 or 1K points. With this approach we can quickly
display the histogram of one biom but more importantly can show histograms
of each process data and the combination of all for metaanalyses. Finally,
this solution is really easy to implement and has minimal DB changes. What
do you think?


Reply to this email directly or view it on GitHub
#650 (comment).

@antgonza
Copy link
Member Author

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

@gregcaporaso
Copy link
Contributor

I've got code for this (and for making the plots interactive) - code here
https://github.com/gregcaporaso/q2d2/blob/master/q2d2/__init__.py#L265,
example image attached. Feel free to use - just throw an attribution in as
a comment.

[image: Inline image 1]

On Fri, Oct 23, 2015 at 8:09 AM, Antonio Gonzalez notifications@github.com
wrote:

Yes, simple addition. We could store positions and height with the default
10 values and on merge we simply display the combination. Note that this
might not work well when you start filtering down samples by metadata ...


Reply to this email directly or view it on GitHub
#650 (comment).

@clozupone
Copy link

Sounds great!

On Oct 23, 2015, at 9:09 AM, Antonio Gonzalez <notifications@github.commailto:notifications@github.com>
wrote:

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...


Reply to this email directly or view it on GitHubhttps://github.com//issues/650#issuecomment-150603489.

@antgonza
Copy link
Member Author

antgonza commented Apr 3, 2016

Solved in #1738.

Now, every artifact has it's own summary. For example a biom has:
screen shot 2016-04-03 at 8 47 49 am
Which should help you decide your rarefaction level.

@antgonza antgonza closed this as completed Apr 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants