Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrubby updates #9

Closed
3 tasks
esteinig opened this issue Jan 27, 2024 · 7 comments
Closed
3 tasks

Scrubby updates #9

esteinig opened this issue Jan 27, 2024 · 7 comments

Comments

@esteinig
Copy link

esteinig commented Jan 27, 2024

@gokeson Ammar mentioned you were integrating Scrubby with the pipeline! Really cool, it was mostly a small side project thing, but people seem to be using it here and there, so will do my best to upgrade it accordingly over the next two weeks or so.

Is there anything specific you were keen to see besides easy deployment via BioConda and/or binaries? We can keep a checklist here, including if you'd like to add anything relevant for you lab as well.

Scrubby wishlist:

  • Distribution via BioConda or at least private channel
  • HPRG reference genome database for depletion
  • Reference database downloader with pre-built indices
@esteinig
Copy link
Author

Michael and Lachlan are publishing an updated version of their human pangenome database for depletion assessment (https://github.com/mbhall88/classification_benchmark, see preprint linked there). I am assessing it for clinical metagenomics data at the moment.

Michael's benchmark shows that besides the obvious performance of long reads, (simulated) Illumina reads are depleted with high sensitivity and specificity with Kraken2 and the pangenome DB, and that minimap2 is a great follow-up from that with the alignment, essentially the process that Scrubby follows to speed things up in our high-depth clinical samples. I need to assess this under more realistic conditions for low abundance pathogens, but that may not be relevant to you.

So... all this to say it looks like the approach has decent performance at least for getting rid of human reads in these simulated conditions ^^

@gokeson
Copy link
Collaborator

gokeson commented Jan 30, 2024

@esteinig Thank you very much for developing Scrubby. I can tell you that it has greatly improved this pipeline. I was using bbsplit and kneaddata in the past for hg depletion. Neither of the two offers the efficiency that scrubby offers. The opportunity to extract at a set taxa level has made my life a lot easier and assembly faster especially with clinical samples (less so with isolates, unless there's heavy contamination).

Scrubby wishlist:

  • Distribution via BioConda or at least private channel in the meantime

This is going to make CtGAP easier to use for BioConda fans, so I appreciate your help with this. Our aim is to get this pipeline ready for publishing around June (still waiting on one more ref genome to be sequenced and included). So I guess we will have plenty of time to test this pipeline with all Scrubby updates.

  • HPRG reference genome database for depletion (see next comment)

Thank you so much for including this in the next version. It will be mighty useful when we start our comparative analyses of global Chlamydia trachomatis genomes from clinical samples (mostly metagenomes) later in the year.

@ammaraziz
Copy link
Owner

HPRG reference genome database for depletion (see next comment)

Will you distribute the HPRG with scrubby or include a subcommand for downloading/preprocessing? That'd make it much easier for end users.But at the same time it means supporting the downloading which can be a huge pain - see the kraken2 repo issues which are filled with issues of downloading+building the reference. Another option is to have some docs for the end user that specifies best practices, eg download this reference, run this minimap2 command then point scrubby to it.

We could help there if you want!

@esteinig
Copy link
Author

@gokeson Thanks for letting me know! Great to hear it's been useful. There is a lot of things that can be improved in terms of runtime and resource management. Without going to deep into the weeds, but we found that for our deep short read data the depletion step can still be quite slow when there is overwhelming host material.

I'd be very curious (if you don't mind communicating publicly about this, otherwise always happy to change to email) - are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads?

We have been building a technically fairly complex clinical diagnostic stack (including interface for interpretation and reporting, host genome analysis if you have consent and a few other nifty things) - it is not quite ready for people to use yet, but it's been used in production on challenging samples with expected low abundance of pathogenic agents from neurological conditions (strongly depending on wet-lab protocols in our experience). Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :)

@ammaraziz yes absolutely! If you remember vaguely from last year, there is something in the works for Cerebro (which includes host indices). I think it's a good suggestion in the interim and a simple downloader with a list of links is probably not too onerous to maintain (the indices are thankfully not as large as taxonomic databases)

@ammaraziz
Copy link
Owner

ammaraziz commented Jan 31, 2024

@esteinig I have to confess that @gokeson knows about Cerebro. I spilled the beans about it last year when chatting with him. He was very keen to test it but I didn't mention anything because you weren't ready to share and it was undergoing the big change at the time. I could run him through the installation and usage for Cerebro if I have your blessing. If I remember correctly the metagenomic project is related to this pipeline but not exactly the same.

I think worth discussing this outside of this repo but I'm not opposed to continuing the discussion here. We could have a zoom meeting to discuss Cerebro and actually I wanted to pick your brain on the best approach for Chlyamdia assembly, there are a few oddities we could use help with.

P.S Sola (Gokeson) is in QLD so our timezones are very close.

@esteinig
Copy link
Author

Lmao no drama man! It's still not properly validated with clinical data and it's a bit of a construction site. I am a little hesitant to let people try and use it - it's absolutely gonna break for someone else and the database thing is a pain point ^^ I'm more than happy to share when it's usable of course, will let you know ASAP.

It's also very very much focused on low abundance sample types and short reads (at the moment) simply because we don't have many other datasets for diagnostics and doing something for the scope of ✨ metagenomics ✨ i.e. complex natural communities with diverse stuff hanging out, is not in scope for Cerebro. There's probably better MAG related pipeline from the ACE people at UQ.

Yeah agree, we can catch up on Zoom sometime on this! :)

@gokeson
Copy link
Collaborator

gokeson commented Feb 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants