Converting weighted abundance to raw counts #461

alexmsalmeida · 2018-04-18T10:20:47Z

Hi Titus,

I am using lca gather to quantify the abundance of a set number of reference genomes I have in a panel of metagenomic datasets (fastq reads). I am then extracting the "f_unique_weighted" from the different samples and using this to determine which genomes are differentially abundant between two sample groups. I just had a couple of questions:

What is the actual meaning of the values output by f_unique_weighted? Does this equate to % of reads?
There is an increasing body of evidence suggesting that for differential abundance analysis we should be using raw counts subsequently transformed with compositional methods, instead of simply using percentages. Would there be a way of converting these values to actual counts? Would it be valid to multiply this number by the total read count of the metagenomic dataset?

Many thanks in advance for your help and advice.

Alex

ctb · 2020-04-04T13:57:39Z

A few belated responses :).

@luizirber is tackling the first question in his thesis, with benchmarking and all. Should be available in a month or two.

regarding (2), sourmash operates on k-mers, not reads, so translation back to read counts will always be direct. That having been said, gather outputs direct k-mer counts for the query; this is the column median_abund in the CSV produced by gather with -o/--output. The value is the median abundance of the match genome's hashes in the query signature. (Note the query signature must have been computed with --track-abundance.)

This number seems to reflect the abundance of the matching genome in the query with reasonable accuracy; @taylorreiter did a comparison with simka that she may be able to describe, too.

ctb · 2020-04-06T16:50:30Z

re-reading this, there are a few different issues.

the first question is whether the abundance of the subject genomes will be properly measured by sourmash. I believe the answer is yes - sourmash gather abundance (and I think lca gather :) should be closely related to the "true" k-mer abundance.

(I mis-spoke when I invoked Taylor's experience w/simka, that was about metagenome-to-metagenome abundance.)

another issue is (of course) the representation in the database. If the exact genomes aren't in the gather database (and they never are, with metagenomes...) you run into several challenges where you're only measuring the bits of the genome that is in the database. Maybe that's ok.

a related issue is the question of multiple related strains or species in the gather database. If there are overlaps in the genomes in the database, the reported abundances should represent the bits that are unique to each genome in a way that is a bit hard to understand. Happy to expand.

That all having been said, median_abund in the CSV output is probably the right thing to use and it addresses concerns about raw counts.

yuzie0314 · 2024-03-13T03:58:30Z

Hi Titus @ctb , I don't know whether I could ask the following questions related to this issue, anyway....

We've used your tool to quantify the abundance of a set number of reference genomes against raw reads (query fastq).
In addtion, according to your suggestion here, we used median_abund to determine which genomes are differentially abundant among sample groups. However, we had some doubts on this value.

when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?
we've compared your tool with bwa together to see if their results are comparable. The answer is Yes. why we used your tool to replace our original mapper was that your tool is way faster than bwa-based methods since our reference set was big (~60k genomes).
we also spend some time on comparing biobakery results with sourmash evaluating abundance and we wanted to start to draft our manuscripts on these findings, so we need to obtain more information related to this discussion from you, and thanks in advance.

forgive me for my offensiveness to hijack this disscussion.
Yuzie

ctb · 2024-03-13T15:25:44Z

hi @yuzie0314 I'll move this to a new issue and respond there!

Amanda-Biocortex · 2024-03-26T13:31:58Z

Hi Titus @ctb

Does the abundance output reflect the number of reads associated with a given species? So if I know the number of reads in a sample, could I do:
relative abundance = median_abund / total number of reads?

Or is this an apples and orange situation where kmer abundance doesnt equate to read abundance?

Many thanks,
Amanda

ctb · 2024-03-26T13:36:43Z

Yes, I think so. More pointers in a bit. On Mar 26, 2024, at 7:32 AM, Amanda ***@***.***> wrote: Hi Titus @ctb<https://github.com/ctb> Does the abundance output reflect the number of reads associated with a given species? So if I know the number of reads in a sample, could I do: relative abundance = median_abund / total number of reads? Or is this an apples and orange situation where kmer abundance doesnt equate to read abundance? Many thanks, Amanda — Reply to this email directly, view it on GitHub<#461 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAMOSH3MQBG6UVJ6LI726DY2FTFJAVCNFSM4E3HG7G2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBSGA2DIMBVG4ZA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Amanda-Biocortex · 2024-03-26T14:00:52Z

Just to check @ctb , thats a 'yes' to doing relative abundance = median_abund / total number of reads?

bluegenes · 2024-03-26T17:41:16Z

hi @Amanda-Biocortex,

Some thoughts while @ctb is offline:

Does the abundance output reflect the number of reads associated with a given species?

Yes - k-mer analysis (including abundance) is strongly correlated with read mapping. There's some additional context in the FAQ here: https://sourmash.readthedocs.io/en/latest/faq.html#how-do-read-mapping-rates-for-metagenomes-compare-with-k-mer-statistics and Figure 5 in https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2 provides a k-mers vs read-mapping plot showing correspondence. Note that we plot the number of bases covered to account for differences between reads and k-mers.

So if I know the number of reads in a sample, could I do:
relative abundance = median_abund / total number of reads?

If you're going for differential abundance, we've had success using k-mer counts with corncob, see tutorial: https://taylorreiter.github.io/2022-08-29-From-raw-metagenome-reads-to-taxonomic-differential-abundance-with-sourmash-and-corncob/.

For relative abundance, it would be better to normalize by the number of k-mers in each sample rather than the number of reads (since median_abund is a k-mer abundance estimate, not a read count). We can get the total number of k-mers reflected by the median_abund first and then normalize:

n_weighted_match = (unique_intersect_bp/scaled) * median_abund
rel_abund = n_weighted_match/total_weighted_hashes

Relevant gather columns:

unique_intersect_bp- Size of overlap between match and remaining query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes.

scaled - Scaled value of the comparison.

median_abund - Median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.

total_weighted_hashes - Sum of hashes x abundance for the entire dataset. Constant value.

You can see a full annotated list of gather output columns here: https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#appendix-d-gather-csv-output-columns

Amanda-Biocortex · 2024-03-27T14:16:22Z

Thanks so much for your help!

One last question @bluegenes - is there a reason why you've recommended median_abund over average_abund?

ctb · 2024-04-02T14:38:06Z

One last question @bluegenes - is there a reason why you've recommended median_abund over average_abund?

This is based on long history with k-mers - some k-mers can be very high multiplicity for various reasons (repeats, or sequencing artifacts/errors) and the presence of one or two such k-mers in a read or contig will seriously bias the average, while the median will be unaffected.

(This is why we used median k-mer abundance in digital normalization, over a decade ago.)

luizirber added the question label May 4, 2019

ctb mentioned this issue Apr 24, 2020

[MRG] document gather abund tests a bit better; minor refactoring #886

Merged

5 tasks

luizirber closed this as completed in #886 Apr 28, 2020

ctb mentioned this issue Mar 13, 2024

sourmash vs aligners, & comparison with other tools #3079

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting weighted abundance to raw counts #461

Converting weighted abundance to raw counts #461

alexmsalmeida commented Apr 18, 2018

ctb commented Apr 4, 2020

ctb commented Apr 6, 2020

yuzie0314 commented Mar 13, 2024 •

edited

ctb commented Mar 13, 2024

Amanda-Biocortex commented Mar 26, 2024

ctb commented Mar 26, 2024 via email

Amanda-Biocortex commented Mar 26, 2024

bluegenes commented Mar 26, 2024

Amanda-Biocortex commented Mar 27, 2024

ctb commented Apr 2, 2024

Converting weighted abundance to raw counts #461

Converting weighted abundance to raw counts #461

Comments

alexmsalmeida commented Apr 18, 2018

ctb commented Apr 4, 2020

ctb commented Apr 6, 2020

yuzie0314 commented Mar 13, 2024 • edited

ctb commented Mar 13, 2024

Amanda-Biocortex commented Mar 26, 2024

ctb commented Mar 26, 2024 via email

Amanda-Biocortex commented Mar 26, 2024

bluegenes commented Mar 26, 2024

Amanda-Biocortex commented Mar 27, 2024

ctb commented Apr 2, 2024

yuzie0314 commented Mar 13, 2024 •

edited