suggested rarefied level #650

antgonza · 2014-11-20T18:14:09Z

From @clozupone. Rewriting: add a way to know the sequences per sample in a meta-analysis before defining the rarefaction level.

squirrelo · 2014-12-02T21:59:25Z

This has come up a bunch, and the idea is to have an on-the-fly histogram of sequences per sample created on the page where you select rarefaction level. Not sure how computationally intensive this will be though since that's potentially a lot of biom tables to go through.

antgonza · 2014-12-02T22:12:01Z

The most expensive step is to create the stats of the biom file per study,
if we can store this at time of creation then a histogram should be pretty
fast.

wasade · 2014-12-02T22:13:56Z

It isn't a violation of the biom file format to add in hdf5 attributes. on
the other hand, we can do this specific computation much faster than the
summarize method as we only need to look at indptr, not the full matrix.
should actually be very quick to compute

On Tue, Dec 2, 2014 at 3:12 PM, Antonio Gonzalez notifications@github.com
wrote:

The most expensive step is to create the stats of the biom file per study,
if we can store this at time of creation then a histogram should be pretty
fast.

—
Reply to this email directly or view it on GitHub
#650 (comment).

wasade · 2014-12-02T22:52:51Z

From chatting with @antgonza and @josenavas, there is another subtly to this which is that in a meta analysis, the BIOM table doesn't yet exist. Regardless of if we want to present sequence counts to the user prior to or following table creation, I think we can do something reasonably fast by just interrogating the compressed axis directly:

from h5py import File

def get_sample_stats(open_hdf5_file, samples=None):
    if samples is None:
        samples = open_hdf5_file['sample_ids']

    samples = set(samples)

    indptr = open_hdf5_file['sample/indptr']
    data = open_hdf5_file['sample/data']

    for idx, id_ in enumerate(open_hdf5_file['sample/ids']):
        if id_ in samples:
            start = indptr[idx]
            end = indptr[idx + 1]

            yield (id_, data[start:end].sum())

Asking how many unique OTUs are observed per sample is even lighter. While in general it is nice to have the BIOM table in memory, it does represent a fair bit of overhead to assess basic summaries. The overhead will be significant if the number of tables that need to be interrogated is large

gregcaporaso · 2015-03-12T02:23:13Z

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

rob-knight · 2015-03-12T02:27:07Z

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.com mailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/650#issuecomment-78412691.

wasade · 2015-03-12T16:49:35Z

Just an extension, I like the thought of storing within the BIOM table
similar to how we store stats in the demux files. Since we're pending a
minor release for biom, we could actually just get this lumped in as a new
option to the table summarizer (--save-stats). Fetching the stats would not
require a parse of the table, just opening the h5py file. However, storing
in the db is light as well, but it puts the stats further away from the
actual data they represent.

On Wed, Mar 11, 2015 at 8:27 PM, Rob Knight notifications@github.com
wrote:

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.com
mailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and
then when the user selects their samples you could display the histogram of
number of sequences per sample? To assist the user in making their choice,
you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th
percentiles in that distribution and note what fraction of their samples
they'd discard with each level.

—
Reply to this email directly or view it on GitHub<
https://github.com/biocore/qiita/issues/650#issuecomment-78412691>.

—
Reply to this email directly or view it on GitHub
#650 (comment).

antgonza · 2015-10-23T14:59:04Z

I have been thinking about this and I can see the benefits of the described approaches. However, thinking pragmatically about this, the histograms need to be good but not prefect. With this in mind, what about adding a step in the processing step (otu picking) so a default histogram is generated and we add this information to the process data table as an array or ints? The default histogram can be from 0->50K (we can discuss this or we can create a histogram of all the current biom tables so we get a better range) with 100 or 1K points. With this approach we can quickly display the histogram of one biom but more importantly can show histograms of each process data and the combination of all for metaanalyses. Finally, this solution is really easy to implement and has minimal DB changes. What do you think?

wasade · 2015-10-23T15:05:39Z

This should break down to:

Table = load_table(foo)
Res = hist(table.sum(axis='sample'))

Recommend storing the bar positions and height instead of a PDF so that the
visual can be changed in UI if needed, but path of least resistance is just
generate the image for display
On Oct 23, 2015 07:59, "Antonio Gonzalez" notifications@github.com wrote:

I have been thinking about this and I can see the benefits of the
described approaches. However, thinking pragmatically about this, the
histograms need to be good but not prefect. With this in mind, what about
adding a step in the processing step (otu picking) so a default histogram
is generated and we add this information to the process data table as an
array or ints? The default histogram can be from 0->50K (we can discuss
this or we can create a histogram of all the current biom tables so we get
a better range) with 100 or 1K points. With this approach we can quickly
display the histogram of one biom but more importantly can show histograms
of each process data and the combination of all for metaanalyses. Finally,
this solution is really easy to implement and has minimal DB changes. What
do you think?

—
Reply to this email directly or view it on GitHub
#650 (comment).

antgonza · 2015-10-23T15:09:45Z

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

gregcaporaso · 2015-10-23T16:20:44Z

I've got code for this (and for making the plots interactive) - code here
https://github.com/gregcaporaso/q2d2/blob/master/q2d2/__init__.py#L265,
example image attached. Feel free to use - just throw an attribution in as
a comment.

[image: Inline image 1]

On Fri, Oct 23, 2015 at 8:09 AM, Antonio Gonzalez notifications@github.com
wrote:

Yes, simple addition. We could store positions and height with the default
10 values and on merge we simply display the combination. Note that this
might not work well when you start filtering down samples by metadata ...

—
Reply to this email directly or view it on GitHub
#650 (comment).

clozupone · 2015-10-24T03:50:24Z

Sounds great!

On Oct 23, 2015, at 9:09 AM, Antonio Gonzalez <notifications@github.com mailto:notifications@github.com>
wrote:

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/650#issuecomment-150603489.

antgonza · 2016-04-03T14:49:40Z

Solved in #1738.

Now, every artifact has it's own summary. For example a biom has:

Which should help you decide your rarefaction level.

antgonza added this to the Alpha 0.1 milestone Nov 20, 2014

This was referenced Mar 5, 2015

summary/histogram before submission #6

Closed

when starting an analysis, it is necessary to show the biom summarize-table results to pick the rarefaction depth #977

Closed

ElDeveloper added enhancement GUI labels Mar 12, 2015

squirrelo modified the milestone: Alpha 0.1 Apr 4, 2015

antgonza modified the milestone: Alpha 0.3 Oct 10, 2015

josenavas assigned squirrelo Feb 5, 2016

antgonza closed this as completed Apr 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggested rarefied level #650

suggested rarefied level #650

antgonza commented Nov 20, 2014

squirrelo commented Dec 2, 2014

antgonza commented Dec 2, 2014

wasade commented Dec 2, 2014

wasade commented Dec 2, 2014

gregcaporaso commented Mar 12, 2015

rob-knight commented Mar 12, 2015

wasade commented Mar 12, 2015

antgonza commented Oct 23, 2015

wasade commented Oct 23, 2015

antgonza commented Oct 23, 2015

gregcaporaso commented Oct 23, 2015

clozupone commented Oct 24, 2015

antgonza commented Apr 3, 2016

suggested rarefied level #650

suggested rarefied level #650

Comments

antgonza commented Nov 20, 2014

squirrelo commented Dec 2, 2014

antgonza commented Dec 2, 2014

wasade commented Dec 2, 2014

wasade commented Dec 2, 2014

gregcaporaso commented Mar 12, 2015

rob-knight commented Mar 12, 2015

wasade commented Mar 12, 2015

antgonza commented Oct 23, 2015

wasade commented Oct 23, 2015

antgonza commented Oct 23, 2015

gregcaporaso commented Oct 23, 2015

clozupone commented Oct 24, 2015

antgonza commented Apr 3, 2016