Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour on the ZymoBIOMICS Fecal Reference #5

Closed
fplazaonate opened this issue Dec 16, 2023 · 14 comments
Closed

Unexpected behaviour on the ZymoBIOMICS Fecal Reference #5

fplazaonate opened this issue Dec 16, 2023 · 14 comments

Comments

@fplazaonate
Copy link

Hi @bluenote-1577,

I have downloaded Illumina sequencing data of the ZymoBIOMICS Fecal Reference.
Then, I have performed taxonomic profiling with sylph (database: gtdb_r207, parameters: --estimate-unknown --read-length 150)
and with meteor, the tool we are currently developping based on species specific marker genes.

Then, I have compared species coverage of both tool in this log-scale scatterplot:
image

As you can see, both tools have good agreement for high coverage (lambda) species but sylph starts overestimating coverage when it reestimates lambda with its statistical model.
In addition, syplh fails to detect species below ~ 0.2x coverage which is one order of magnitude above the detection limit with simulated data.

There are strain mixtures of the same species in this sample that may explain this strange behaviour.

Your feedback is welcomed,
Florian

@bluenote-1577
Copy link
Owner

Hi @fplaza,

Thanks for the amazing benchmark.

This behaviour is interesting, and I believe it indicates that sylph's statistical model is not generalizing well enough to this set of sequencing conditions. Whether that's strain mixtures, exact sequencing technology, etc, I am not sure.

If you could link me the exact zymo dataset you're using, I would be happy to investigate further and see what's happening.

I'll leave this issue up as I investigate further. If I am unable to fix this and sylph ends up slightly overestimating coverage on real datasets, then I will update the manuscript/README with a warning. Perhaps this is a deeper technical problem that I will need to mathematically investigate on another project.

Thanks,

Jim

@bluenote-1577
Copy link
Owner

A few other points:

  1. I should mention that in Supplementary Fig. 8 in our manuscript (version 1, as of this post), you can see that sylph underestimates ANI a bit for a real illumina dataset. This is equivalent to overestimating coverage, which you have shown.
  2. Also, in Fig. 1 C, sylph overestimates coverage a bit on even synthetic datasets for very low effective coverage
  3. Note the difference in "effective coverage" versus 'coverage". The former takes into account sequencing error and a read-length factor, and is smaller than "coverage". I'm not sure if your synthetic results used effective or true coverage (specified by -u and --read-length options)

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 17, 2023

I ran an experiment on the Meslier et al. (2022) real dataset for Illumina reads downsampled to 1Gbp as specified on our manuscript.

image

I used minimap2 and coverm for x-axis coverage. Orange dots are not detected by sylph.

Similar behaviour occurs for the low-coverage genomes as you have presented. I'm using reference genomes with 100% ANI to the community, so this is a best-case scenario as far as detection limits. Our supplementary figures show that lower ANI means slightly lower detection limits.

@fplazaonate
Copy link
Author

fplazaonate commented Dec 17, 2023

Hi @bluenote-1577 ,

Thanks for the quick reply.

Here is the Illumina dataset (official link on the Zymo website is dead):
https://filesender.renater.fr/?s=download&token=6a4f92f1-fef4-4988-9f5a-01ee3c46fd11
PacBio sequencing data is available here:
https://t.co/LyMykkaWCD

I have plotted True coverage.
Notably, the results are the same with MetaPhlan4 (using this time relative abundance as MP4 does not compute coverage)
image

Best,
Florian

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 18, 2023

Hi @fplaza,

I discovered an interesting artifact recently. Can you let me know if your reads are quality controlled? Importantly, are your reads deduplicated?

I found that duplicated reads in Illumina sequencing cause big issues for sylph and low abundances. Even 9% duplicated reads, which fastQC says is high-quality, ruins sylph's statistical model for low abundance calculations. I am working now on an algorithm in sylph for deduplication.

After running fastp -i reads1.fq -I reads2.fq -o dedup.1.fq --out2 dedup.2.fq -D, try running sylph sketch -1 dedup.1.fq -2 dedup.2.fq ... etc instead and let me know how that goes if you have time. EDIT: See below

Thanks again for your extremely helpful benchmarks and bringing this important issue up.

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 19, 2023

Hi @fplaza,

I implemented a new sketching algorithm for paired-end reads in sylph that almost resolves this issue. I will be testing it and releasing the new algorithm in sylph v0.5.0 in the next few weeks.

If you want to try it, see the v0.5.0 branch in the sylph repository. You'll have to resketch your reads, though.

For a real mouse gut genome:

image

More tests will be to come...

@fplazaonate
Copy link
Author

fplazaonate commented Dec 19, 2023

Hi @bluenote-1577,

Reads have been quality controlled with fastp but without deduplication.
Duplication rate is around 8%.
The same pattern appears on other paired-end Illumina sequencing data (human fecal metagenomes).

image

Could you create a binary executable for this new version so I can give it a try?

Florian

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 19, 2023

@fplaza here's a link https://easyupload.io/4qqbgb

You can also compile it if you have rust available. Compilation is usually pretty easy

Run the same commands, making sure using -1 and -2 for sketching.

@fplazaonate
Copy link
Author

fplazaonate commented Dec 19, 2023

Here is a comparison for the ZymoBIOMICS Fecal Reference.
Results are indeed much better
image

Same thing on the other human fecal metagenomes (sylph new profiles only)
image

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Dec 19, 2023

Awesome!

I will test on my own and improve this in the upcoming week. I'll let you know when sylph v0.5 is out officially.

Much thanks.

@fplazaonate
Copy link
Author

Great!
I have tested another sequencing technology for which I see the same bias as for Illumina in sylph v0.4. I will open another issue.

@fplazaonate
Copy link
Author

Hi @bluenote-1577 ,
Do you think you can provide a workaround for single end libraries as well?

@bluenote-1577
Copy link
Owner

@fplaza

Yes I can do it for single-end libraries too. I will need to do a bit of testing to make sure deduplication is not too aggressive.

Also: the deduplication procedure in the current sylph v0.5 branch is more up to date than the binary I provided you and the final version will be different than the binary, just FYI.

@bluenote-1577
Copy link
Owner

v0.5.0 Now released on github. Single and paired-end deduplication supported with options to remove deduplication. Closing this issue for now. Conda and static binary will be out in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants