Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting weighted abundance to raw counts #461

Closed
alexmsalmeida opened this issue Apr 18, 2018 · 10 comments · Fixed by #886
Closed

Converting weighted abundance to raw counts #461

alexmsalmeida opened this issue Apr 18, 2018 · 10 comments · Fixed by #886
Labels

Comments

@alexmsalmeida
Copy link

Hi Titus,

I am using lca gather to quantify the abundance of a set number of reference genomes I have in a panel of metagenomic datasets (fastq reads). I am then extracting the "f_unique_weighted" from the different samples and using this to determine which genomes are differentially abundant between two sample groups. I just had a couple of questions:

  1. What is the actual meaning of the values output by f_unique_weighted? Does this equate to % of reads?

  2. There is an increasing body of evidence suggesting that for differential abundance analysis we should be using raw counts subsequently transformed with compositional methods, instead of simply using percentages. Would there be a way of converting these values to actual counts? Would it be valid to multiply this number by the total read count of the metagenomic dataset?

Many thanks in advance for your help and advice.

Alex

@ctb
Copy link
Contributor

ctb commented Apr 4, 2020

A few belated responses :).

@luizirber is tackling the first question in his thesis, with benchmarking and all. Should be available in a month or two.

regarding (2), sourmash operates on k-mers, not reads, so translation back to read counts will always be direct. That having been said, gather outputs direct k-mer counts for the query; this is the column median_abund in the CSV produced by gather with -o/--output. The value is the median abundance of the match genome's hashes in the query signature. (Note the query signature must have been computed with --track-abundance.)

This number seems to reflect the abundance of the matching genome in the query with reasonable accuracy; @taylorreiter did a comparison with simka that she may be able to describe, too.

@ctb
Copy link
Contributor

ctb commented Apr 6, 2020

re-reading this, there are a few different issues.

the first question is whether the abundance of the subject genomes will be properly measured by sourmash. I believe the answer is yes - sourmash gather abundance (and I think lca gather :) should be closely related to the "true" k-mer abundance.

(I mis-spoke when I invoked Taylor's experience w/simka, that was about metagenome-to-metagenome abundance.)

another issue is (of course) the representation in the database. If the exact genomes aren't in the gather database (and they never are, with metagenomes...) you run into several challenges where you're only measuring the bits of the genome that is in the database. Maybe that's ok.

a related issue is the question of multiple related strains or species in the gather database. If there are overlaps in the genomes in the database, the reported abundances should represent the bits that are unique to each genome in a way that is a bit hard to understand. Happy to expand.

That all having been said, median_abund in the CSV output is probably the right thing to use and it addresses concerns about raw counts.

@yuzie0314
Copy link

yuzie0314 commented Mar 13, 2024

Hi Titus @ctb , I don't know whether I could ask the following questions related to this issue, anyway....

We've used your tool to quantify the abundance of a set number of reference genomes against raw reads (query fastq).
In addtion, according to your suggestion here, we used median_abund to determine which genomes are differentially abundant among sample groups. However, we had some doubts on this value.

  1. when we use aligner like bwa or blast, we have to set a reasonable coverage cutoff (say 60% coverage) to decrease false positive when we quantify abundance between genomes and reads. However, when we used your tool, how to make sure those genomes present in samples are not false positive?
  2. we've compared your tool with bwa together to see if their results are comparable. The answer is Yes. why we used your tool to replace our original mapper was that your tool is way faster than bwa-based methods since our reference set was big (~60k genomes).
  3. we also spend some time on comparing biobakery results with sourmash evaluating abundance and we wanted to start to draft our manuscripts on these findings, so we need to obtain more information related to this discussion from you, and thanks in advance.

forgive me for my offensiveness to hijack this disscussion.
Yuzie

@ctb
Copy link
Contributor

ctb commented Mar 13, 2024

hi @yuzie0314 I'll move this to a new issue and respond there!

@Amanda-Biocortex
Copy link

Hi Titus @ctb

Does the abundance output reflect the number of reads associated with a given species? So if I know the number of reads in a sample, could I do:
relative abundance = median_abund / total number of reads?

Or is this an apples and orange situation where kmer abundance doesnt equate to read abundance?

Many thanks,
Amanda

@ctb
Copy link
Contributor

ctb commented Mar 26, 2024 via email

@Amanda-Biocortex
Copy link

Just to check @ctb , thats a 'yes' to doing relative abundance = median_abund / total number of reads?

@bluegenes
Copy link
Contributor

hi @Amanda-Biocortex,

Some thoughts while @ctb is offline:

Does the abundance output reflect the number of reads associated with a given species?

Yes - k-mer analysis (including abundance) is strongly correlated with read mapping. There's some additional context in the FAQ here: https://sourmash.readthedocs.io/en/latest/faq.html#how-do-read-mapping-rates-for-metagenomes-compare-with-k-mer-statistics and Figure 5 in https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2 provides a k-mers vs read-mapping plot showing correspondence. Note that we plot the number of bases covered to account for differences between reads and k-mers.

So if I know the number of reads in a sample, could I do:
relative abundance = median_abund / total number of reads?

If you're going for differential abundance, we've had success using k-mer counts with corncob, see tutorial: https://taylorreiter.github.io/2022-08-29-From-raw-metagenome-reads-to-taxonomic-differential-abundance-with-sourmash-and-corncob/.

For relative abundance, it would be better to normalize by the number of k-mers in each sample rather than the number of reads (since median_abund is a k-mer abundance estimate, not a read count). We can get the total number of k-mers reflected by the median_abund first and then normalize:

n_weighted_match = (unique_intersect_bp/scaled) * median_abund
rel_abund = n_weighted_match/total_weighted_hashes

Relevant gather columns:

  • unique_intersect_bp- Size of overlap between match and remaining query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes.
  • scaled - Scaled value of the comparison.
  • median_abund - Median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
  • total_weighted_hashes - Sum of hashes x abundance for the entire dataset. Constant value.

You can see a full annotated list of gather output columns here: https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#appendix-d-gather-csv-output-columns

@Amanda-Biocortex
Copy link

Thanks so much for your help!

One last question @bluegenes - is there a reason why you've recommended median_abund over average_abund?

@ctb
Copy link
Contributor

ctb commented Apr 2, 2024

One last question @bluegenes - is there a reason why you've recommended median_abund over average_abund?

This is based on long history with k-mers - some k-mers can be very high multiplicity for various reasons (repeats, or sequencing artifacts/errors) and the presence of one or two such k-mers in a read or contig will seriously bias the average, while the median will be unaffected.

(This is why we used median k-mer abundance in digital normalization, over a decade ago.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants