-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrubby updates #9
Comments
Michael and Lachlan are publishing an updated version of their human pangenome database for depletion assessment (https://github.com/mbhall88/classification_benchmark, see preprint linked there). I am assessing it for clinical metagenomics data at the moment. Michael's benchmark shows that besides the obvious performance of long reads, (simulated) Illumina reads are depleted with high sensitivity and specificity with Kraken2 and the pangenome DB, and that So... all this to say it looks like the approach has decent performance at least for getting rid of human reads in these simulated conditions ^^ |
@esteinig Thank you very much for developing Scrubby. I can tell you that it has greatly improved this pipeline. I was using bbsplit and kneaddata in the past for hg depletion. Neither of the two offers the efficiency that scrubby offers. The opportunity to extract at a set taxa level has made my life a lot easier and assembly faster especially with clinical samples (less so with isolates, unless there's heavy contamination).
This is going to make CtGAP easier to use for BioConda fans, so I appreciate your help with this. Our aim is to get this pipeline ready for publishing around June (still waiting on one more ref genome to be sequenced and included). So I guess we will have plenty of time to test this pipeline with all Scrubby updates.
Thank you so much for including this in the next version. It will be mighty useful when we start our comparative analyses of global Chlamydia trachomatis genomes from clinical samples (mostly metagenomes) later in the year. |
Will you distribute the HPRG with scrubby or include a subcommand for downloading/preprocessing? That'd make it much easier for end users.But at the same time it means supporting the downloading which can be a huge pain - see the kraken2 repo issues which are filled with issues of downloading+building the reference. Another option is to have some docs for the end user that specifies best practices, eg download this reference, run this minimap2 command then point scrubby to it. We could help there if you want! |
@gokeson Thanks for letting me know! Great to hear it's been useful. There is a lot of things that can be improved in terms of runtime and resource management. Without going to deep into the weeds, but we found that for our deep short read data the depletion step can still be quite slow when there is overwhelming host material. I'd be very curious (if you don't mind communicating publicly about this, otherwise always happy to change to email) - are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads? We have been building a technically fairly complex clinical diagnostic stack (including interface for interpretation and reporting, host genome analysis if you have consent and a few other nifty things) - it is not quite ready for people to use yet, but it's been used in production on challenging samples with expected low abundance of pathogenic agents from neurological conditions (strongly depending on wet-lab protocols in our experience). Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :) @ammaraziz yes absolutely! If you remember vaguely from last year, there is something in the works for |
@esteinig I have to confess that @gokeson knows about I think worth discussing this outside of this repo but I'm not opposed to continuing the discussion here. We could have a zoom meeting to discuss Cerebro and actually I wanted to pick your brain on the best approach for Chlyamdia assembly, there are a few oddities we could use help with. P.S Sola (Gokeson) is in QLD so our timezones are very close. |
Lmao no drama man! It's still not properly validated with clinical data and it's a bit of a construction site. I am a little hesitant to let people try and use it - it's absolutely gonna break for someone else and the database thing is a pain point ^^ I'm more than happy to share when it's usable of course, will let you know ASAP. It's also very very much focused on low abundance sample types and short reads (at the moment) simply because we don't have many other datasets for diagnostics and doing something for the scope of ✨ metagenomics ✨ i.e. complex natural communities with diverse stuff hanging out, is not in scope for Cerebro. There's probably better MAG related pipeline from the ACE people at UQ. Yeah agree, we can catch up on Zoom sometime on this! :) |
Hi Eike,
Apologies for the very late reply. I got busy away from work these past
many days. Happy to be back at my desk again.
*are you trying to retrieve whole genomes or are these very low abundance
sample types you are sequencing, are you using short or long reads?*
Both! We do some in-house QC to ensure we recover as close as possible to a
whole genome for our clinical samples. We also have a separate project
focusing mostly on microbiome. In the later project, we don't do much
in-house QC but also try to recover the whole genome sequences of *C.
trachomatis* where possible (fails many times but is always worth trying).
*Happy to share if that might be useful for you as well - but given the
recent push for this across public health labs, you probably have your own
system going :)*
Sounds exciting. Ammar has mentioned it in the past and I am very much
looking forward to it. We do not have such a system yet, so I am super keen
to give this a try.
And we should meet up on Zoom soon. Ammar speaks greatly of you. Keen to
put a face to the name.
…On Wed, 31 Jan 2024 at 09:59, Eike Steinig ***@***.***> wrote:
@gokeson <https://github.com/gokeson> Thanks for letting me know! Great
to hear it's been useful. There is a lot of things that can be improved in
terms of runtime and resource management. Without going to deep into the
weeds, but we found that for our deep short read data the depletion step
can still be quite slow when there is overwhelming host material.
I'd be very curious (if you don't mind communicating publicly about this,
otherwise always happy to change to email) - are you trying to retrieve
whole genomes or are these very low abundance sample types you are
sequencing, are you using short or long reads?
We have been building a technically fairly complex clinical diagnostic
stack (including interface for interpretation and reporting, host genome
analysis if you have consent and a few other nifty things) - it is not
quite ready for people to use yet, but it's been used in production on
challenging samples with expected low abundance of pathogenic agents from
neurological conditions (strongly depending on wet-lab protocols in our
experience). Happy to share if that might be useful for you as well - but
given the recent push for this across public health labs, you probably have
your own system going :)
@ammaraziz <https://github.com/ammaraziz> yes absolutely! If you remember
vaguely from last year, there is something in the works for Cerebro
(which includes host indices). I think it's a good suggestion in the
interim and a simple downloader with a list of links is probably not too
onerous to maintain (the indices are thankfully not as large as taxonomic
databases)
—
Reply to this email directly, view it on GitHub
<#9 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BCRGAHRHJHIYMVZWWXN6L23YRGCN5AVCNFSM6AAAAABCNTX4IKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYGEYTIMZYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Regards,
Shola
|
@gokeson Ammar mentioned you were integrating
Scrubby
with the pipeline! Really cool, it was mostly a small side project thing, but people seem to be using it here and there, so will do my best to upgrade it accordingly over the next two weeks or so.Is there anything specific you were keen to see besides easy deployment via
BioConda
and/or binaries? We can keep a checklist here, including if you'd like to add anything relevant for you lab as well.Scrubby
wishlist:BioConda
or at least private channelThe text was updated successfully, but these errors were encountered: