-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggested rarefied level #650
Comments
This has come up a bunch, and the idea is to have an on-the-fly histogram of sequences per sample created on the page where you select rarefaction level. Not sure how computationally intensive this will be though since that's potentially a lot of biom tables to go through. |
The most expensive step is to create the stats of the biom file per study, |
It isn't a violation of the biom file format to add in hdf5 attributes. on On Tue, Dec 2, 2014 at 3:12 PM, Antonio Gonzalez notifications@github.com
|
From chatting with @antgonza and @josenavas, there is another subtly to this which is that in a meta analysis, the BIOM table doesn't yet exist. Regardless of if we want to present sequence counts to the user prior to or following table creation, I think we can do something reasonably fast by just interrogating the compressed axis directly: from h5py import File
def get_sample_stats(open_hdf5_file, samples=None):
if samples is None:
samples = open_hdf5_file['sample_ids']
samples = set(samples)
indptr = open_hdf5_file['sample/indptr']
data = open_hdf5_file['sample/data']
for idx, id_ in enumerate(open_hdf5_file['sample/ids']):
if id_ in samples:
start = indptr[idx]
end = indptr[idx + 1]
yield (id_, data[start:end].sum()) Asking how many unique OTUs are observed per sample is even lighter. While in general it is nice to have the BIOM table in memory, it does represent a fair bit of overhead to assess basic summaries. The overhead will be significant if the number of tables that need to be interrogated is large |
What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level. |
Great idea! On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.commailto:notifications@github.com> wrote: What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level. — |
Just an extension, I like the thought of storing within the BIOM table On Wed, Mar 11, 2015 at 8:27 PM, Rob Knight notifications@github.com
|
I have been thinking about this and I can see the benefits of the described approaches. However, thinking pragmatically about this, the histograms need to be good but not prefect. With this in mind, what about adding a step in the processing step (otu picking) so a default histogram is generated and we add this information to the process data table as an array or ints? The default histogram can be from 0->50K (we can discuss this or we can create a histogram of all the current biom tables so we get a better range) with 100 or 1K points. With this approach we can quickly display the histogram of one biom but more importantly can show histograms of each process data and the combination of all for metaanalyses. Finally, this solution is really easy to implement and has minimal DB changes. What do you think? |
This should break down to: Table = load_table(foo) Recommend storing the bar positions and height instead of a PDF so that the
|
Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ... |
I've got code for this (and for making the plots interactive) - code here [image: Inline image 1] On Fri, Oct 23, 2015 at 8:09 AM, Antonio Gonzalez notifications@github.com
|
Sounds great! On Oct 23, 2015, at 9:09 AM, Antonio Gonzalez <notifications@github.commailto:notifications@github.com> Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ... — |
Solved in #1738. Now, every artifact has it's own summary. For example a biom has: |
From @clozupone. Rewriting: add a way to know the sequences per sample in a meta-analysis before defining the rarefaction level.
The text was updated successfully, but these errors were encountered: