New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting weighted abundance to raw counts #461
Comments
A few belated responses :). @luizirber is tackling the first question in his thesis, with benchmarking and all. Should be available in a month or two. regarding (2), sourmash operates on k-mers, not reads, so translation back to read counts will always be direct. That having been said, gather outputs direct k-mer counts for the query; this is the column This number seems to reflect the abundance of the matching genome in the query with reasonable accuracy; @taylorreiter did a comparison with simka that she may be able to describe, too. |
re-reading this, there are a few different issues. the first question is whether the abundance of the subject genomes will be properly measured by sourmash. I believe the answer is yes - sourmash gather abundance (and I think lca gather :) should be closely related to the "true" k-mer abundance. (I mis-spoke when I invoked Taylor's experience w/simka, that was about metagenome-to-metagenome abundance.) another issue is (of course) the representation in the database. If the exact genomes aren't in the gather database (and they never are, with metagenomes...) you run into several challenges where you're only measuring the bits of the genome that is in the database. Maybe that's ok. a related issue is the question of multiple related strains or species in the gather database. If there are overlaps in the genomes in the database, the reported abundances should represent the bits that are unique to each genome in a way that is a bit hard to understand. Happy to expand. That all having been said, |
Hi Titus @ctb , I don't know whether I could ask the following questions related to this issue, anyway.... We've used your tool to quantify the abundance of a set number of reference genomes against raw reads (query fastq).
forgive me for my offensiveness to hijack this disscussion. |
hi @yuzie0314 I'll move this to a new issue and respond there! |
Hi Titus @ctb Does the abundance output reflect the number of reads associated with a given species? So if I know the number of reads in a sample, could I do: Or is this an apples and orange situation where kmer abundance doesnt equate to read abundance? Many thanks, |
Yes, I think so. More pointers in a bit.
On Mar 26, 2024, at 7:32 AM, Amanda ***@***.***> wrote:
Hi Titus @ctb<https://github.com/ctb>
Does the abundance output reflect the number of reads associated with a given species? So if I know the number of reads in a sample, could I do:
relative abundance = median_abund / total number of reads?
Or is this an apples and orange situation where kmer abundance doesnt equate to read abundance?
Many thanks,
Amanda
—
Reply to this email directly, view it on GitHub<#461 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAMOSH3MQBG6UVJ6LI726DY2FTFJAVCNFSM4E3HG7G2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBSGA2DIMBVG4ZA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Just to check @ctb , thats a 'yes' to doing relative abundance = median_abund / total number of reads? |
Some thoughts while @ctb is offline:
Yes - k-mer analysis (including abundance) is strongly correlated with read mapping. There's some additional context in the FAQ here: https://sourmash.readthedocs.io/en/latest/faq.html#how-do-read-mapping-rates-for-metagenomes-compare-with-k-mer-statistics and Figure 5 in https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2 provides a k-mers vs read-mapping plot showing correspondence. Note that we plot the number of bases covered to account for differences between reads and k-mers.
If you're going for differential abundance, we've had success using k-mer counts with For relative abundance, it would be better to normalize by the number of k-mers in each sample rather than the number of reads (since
|
Thanks so much for your help! One last question @bluegenes - is there a reason why you've recommended median_abund over average_abund? |
This is based on long history with k-mers - some k-mers can be very high multiplicity for various reasons (repeats, or sequencing artifacts/errors) and the presence of one or two such k-mers in a read or contig will seriously bias the average, while the median will be unaffected. (This is why we used median k-mer abundance in digital normalization, over a decade ago.) |
Hi Titus,
I am using lca gather to quantify the abundance of a set number of reference genomes I have in a panel of metagenomic datasets (fastq reads). I am then extracting the "f_unique_weighted" from the different samples and using this to determine which genomes are differentially abundant between two sample groups. I just had a couple of questions:
What is the actual meaning of the values output by f_unique_weighted? Does this equate to % of reads?
There is an increasing body of evidence suggesting that for differential abundance analysis we should be using raw counts subsequently transformed with compositional methods, instead of simply using percentages. Would there be a way of converting these values to actual counts? Would it be valid to multiply this number by the total read count of the metagenomic dataset?
Many thanks in advance for your help and advice.
Alex
The text was updated successfully, but these errors were encountered: