New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalpel InDel calling support #428
Comments
Quite interesting. It looks like it might be another good candidate for somatic variant calling. |
My thoughts exactly. Could implement it as a separate tool or bunch it with MuTect. |
Given that MuTect doesn't handle indels and Scalpel does exactly fit the niche, IMO it looks sensible to bundle both together. Brad, what do you think about it? |
Miika; A focus on somatic calling to start with makes a lot of the sense but it would be great if you could make this general so we could inject it into different tools as an indel supplement. A lot of somatic callers focus on SNPs so I'd imagine it would be generally useful (as long as it benchmarks well). Thanks again. |
cool, I'll look into this in the next few weeks |
OK so I've started work on Scalpel integration here: I reckon we can make calls to I've logged a few tickets about Scalpel here https://sourceforge.net/p/scalpel/bugs/3/, https://sourceforge.net/p/scalpel/tickets/2/ and https://sourceforge.net/p/scalpel/tickets/1/ but in the meantime I fixed some of the show stoppers myself in https://github.com/mjafin/scalpel I've been also testing Scalpel on real data and in single sample mode it catches indels well. However, in somatic mode it somehow produces empty vcf output always. My test cases are really very obvious indels that are clearly there in the somatic sample and not in the normal. I've checked coverage is good in both and in single sample mode Scalpel produces the indel for the somatic sample OK. I'll see if I can see something obvious in the code. |
Miika -- this is great, thank you for doing this work. Definitely keep us up to date on the bugs and tumor/normal results. I'll be happy to integrate this and scalpel installation whenever you think things are in a useful state. I'm excited to test this out. |
@chapmanb looking to get this working in the next 24h or so. Is there any chance you could introduce an insertion and deletion in the |
Miika; https://www.synapse.org/#!Synapse:syn312572/wiki/62018 I'll download and give this a go on subsetting for a more useful test case. Unfortunately I don't believe they've released a truth VCF but we could bug them for this now that the challenge is complete. |
In data lunedì 26 maggio 2014 11:32:11, Brad Chapman ha scritto:
Agreed! The more test datasets we have for this, the better. |
@chapmanb @lbeltrame Great ideas both - I know one of my colleagues is partaking in the Dream challenge and has some data but I don't know if I'm allowed to pass it on. Let me know what you think. @lbeltrame would be awesome if you could test it out! You need to clone https://github.com/mjafin/scalpel and run
If you see any bugs lemme know! Edit:
Edit2:
|
Oh, and scalpel supports |
So the way I'm pulling the scalpel calls together is by combining inferred somatic and common InDels, and setting the FILTER field for the common InDels to REJECT (just made a commit here https://github.com/mjafin/bcbio-nextgen/commit/3b8aea914e0700bad7c3b67e4ea9199fa7ba5d32). Another way would be to have an INFO field value for somatic or common I suppose. In tumor-only mode I'm just reporting all discovered indels (obviously not optimal). Anyways, @chapmanb would you like me to do a pull request of my scalpel branch or would you be able run the basic tests by cloning my scalpel branch? |
…ata on chr22 that includes indels #428
Miika; I added better tumor/normal test cases using chr22 data from the ICGC-TCGA dream challenge BAMs that includes indels. I'm not sure if it has a somatic specific indel (VarScan and FreeBayes disagree on one potential) but it definitely has more complex regions for assessment so can hopefully test Scalpel better. Thanks again for all the work. |
How do you define "common"?
I think you should add SOMATIC to putative positive somatic indels. See my (VarScan 2 needs more adjustments as I'd love to flag calls that are germline, |
@lbeltrame By common I mean InDels that are both in the tumour and normal, and have the same inferred zygosity. @chapmanb Big thanks - how do I trigger bcbio to download/run the chr22 test data? |
Ok, thanks for the explanation: indeed marking REJECT would be probably the I'm hoping to get scalpel to run on my data later this week, no promises though (have to attend a conference the next weekend). |
@lbeltrame I'll look into the SOMATIC info field when I get the chance, thanks for the pointer. I ran FreeBayes previously for the ERP002442 study sample 10-497-T. However, for whatever reason, the VT (somatic, LOH, germline) tag is only present in 2 out of the 600 indels. Scalpel is identifying 300 indels it thinks are somatic. |
Are you running FreeBayes using bcbio-nextgen? Look for SOMATIC tags if so, I |
Ah, I must be looking at a run finished quite some time ago then. Will need to rerun FreeBayes. Or better yet, identify a proper dataset like the DREAM one. |
Miika;
It'll be in |
Thanks @chapmanb, the output definitely contains (scalpel) indels now! |
Closing, as per #432, thanks! |
Actually, I don't think scalpel is producing any InDels at all. Looking at the logs:
Looks like the Microassembler part hasn't compiled. |
As an aside, I ran the NA12878 exome benchmark and Scalpel isn't doing too well (even when it's working! Used my own compiled version). HaplotypeCaller is catching some 6400 indels, FreeBayes around 6k and scalpel 3600 indels using default settings. Disappointing, to be honest. |
Did you have any go at somatic indels, e.g. Scalpel compared to the Somatic |
Hm, fishy. It looks however like some other code path. VarScan 2 ran fine last |
Well, varscan is officially also pants :)
|
I assume this does include the strand bias filter? (if you did through bcbio-nextgen, yes, it was included). |
Miika; From these results it seems like the best calling approach right out of the box is MuTecT plus an indel caller. If we can tweak and improve scalpel that would be ideal. Pindel is also an option. I also have faith we can tweak FreeBayes calls to improve them, based on success with germline calling. It would be ideal to have a couple of choices, and I'd also like to have a non-licensed option. Thanks again, that's a huge help to focus future effort. Much appreciated. |
@chapmanb Which LCR bed file would you recommend pulling? (I'm not using bcbio to manage my genomes) |
Miika; https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/dbsnp.py#L170 |
I wonder if, given these results, we can split out some bullet points of things that can be done from here? So we can, potentially, split the work ahead. |
@chapmanb Thanks I'll give that a try!
The chr19 Dream data is synthetic so we could possibly use that as a basic benchmark data set. I took the BAM files, used samtools to extract the reads into fastq (discarding singletons) and started from there. |
Just a quick update, excluding the low complexity regions reduced the false positive InDel calls for Scalpel (1058) and VarDict (717) to around 350. Although I think Scalpel was advertised in the paper to be immune to repetitive regions..? |
OK, another quick update, with the latest fixes to FreeBayes and excluding the low complexity regions, its indel false positive rate has dropped to 185! The SNP FP rate is still relatively high but its something to work from. I'll close the ticket for now. |
…428). This check already handled as part of VarScan calling. Move test to samtools, where it is needed. This will ened reworking/testing with samtools 0.2.0 calling
Miika; I also fixed the VarScan scaling issue you reported to hopefully help with future evaluations at scale. Thanks again. |
…cbio#428). This check already handled as part of VarScan calling. Move test to samtools, where it is needed. This will ened reworking/testing with samtools 0.2.0 calling
Miika; http://biorxiv.org/content/biorxiv/early/2014/06/10/006148.full.pdf The caller might already apply some of these as part of the standard process, but it's an interesting read in addition to the practical stuff. The 60x coverage for WGS indel detection is a good reference to have. |
Thanks Brad, will definitely give that a thorough read! I think their focus is on lower coverage data and I suspect (based on personal communication with the authors) that high(er) coverage regions can make the microassembler not converge. One of the authors' suggestion was to downsample the data, but I don't know how practical that would be. In any case, I'm worried that with such a low sensitivity for the NA12878 sample (only 3k+) it really doesn't matter what the specificity is - the sensitivity is unacceptable. If you have the chance to run scalpel for any NA12878 it would be great, just to be sure I haven't botched something.. |
Just another quick update, I managed to run ensemble calling from bcbio.variation on the combo of MuTect (SNPs only), FreeBayes and VarDict. I had to manually remove any variants that were REJECTed and also the 'normal' sample column. The SNP TP rate was up there with FreeBayes (857 vs. 856) and the FP rate was marginally higher than that of MuTect (102->106). For InDels, there were only 707 TP but only 14 FP! Edit. I'll need to rerun all of this, with all the latest updates to all the tools, at some point and update the tables. |
Impressive results! Can you share the parameters you used for bcbio.variation? (And Brad, I wonder if all the findings / results from these investigation |
@lbeltrame I used pretty standard settings:
However I suspect the output indels might be the intersect set of freebayes and vardict as I don't think any of the above annotation will be in the calls.. but then again I'm still trying to understand how the ensemble calling actually works. This is still very much 'live' work and requires quite a bit of manual intervention (plus I don't know where the chr19 DREAM data could be pulled from without username/password). Our guys also haven't made the vardict code publicly available yet. I'm happy to write something up though once things settle down a bit. |
In data giovedì 12 giugno 2014 07:14:39, Miika Ahdesmaki ha scritto:
Thanks. It goes without saying, but of course I'm going to help on that too. While I don't have many data sets available, we're starting to do some To add to the discussion and provide some data from my experience on targeted
We've validated a MuTect called SNP with digital PCR with a predicted fraction |
Luca and Miika; It's great to hear about the Ensemble results, and I suspect it's probably what Miika suggested: the variants called by two reliable callers. My longer term thinking for Ensemble calling is to reduce it to something simpler and rely on these type of heuristics. The SVM can recover a small fraction of additional variants but I'm not sure all the time spent and tuning is worth the gain. A simpler approach should be able to speed things up and make the overall process much cleaner. Thanks again. |
We could possibly filter variants if |
To follow up on Miika's initial evaluation work we now have an automated whole genome evaluation for cancer data using DREAM challenge synthetic dataset 3 (https://www.synapse.org/#!Synapse:syn312572/wiki/62018): https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/cancer-dream-syn3-getdata.sh Here are results for MuTect, VarScan, VarDict and FreeBayes with the current development version: This confirms all the observations above and gives us a practical dataset we can iterate and improve on. The plans are to try and improve filtering to help with false positives then begin testing Ensemble calling and other approaches to get a high quality final callset. It's exciting to see cancer calling continue to improve -- thanks to everyone for their help so far with this. |
-----BEGIN PGP SIGNED MESSAGE----- Absolutely terrific! Thanks to everyone involved! Luca Beltrame, Ph.D. iQI/BAEBCgApBQJUBKMnIhxMdWNhIEJlbHRyYW1lIDxsYmVsdHJhbWVAa2RlLm9y |
Hi mjafin , |
It does work on WGS samples as well, but is quite slow since Scalpel is primarily developed on exomes. We use it in cases where we need indels, but also recommend the VarDict variant caller included in bcbio which calls indels as well as Scalpel and runs much quicker. Hope this helps. |
In data venerdì 9 ottobre 2015 07:32:12 CEST, Brad Chapman ha scritto:
also recommend the VarDict variant caller included in bcbio which calls
indels as well as Scalpel and runs much quicker. Hope this helps.
Can it be used as indelcaller? I found scalpel to be extremely slow even for
targeted sequencing analyses.
|
Luca -- VarDict calls indels as part of the standard calling. You don't use it as https://github.com/bcbio/bcbio.github.io/blob/master/_posts/2015-10-05-vardict-filtering.md |
Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/
Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me
The text was updated successfully, but these errors were encountered: