bcbio / bcbio-nextgen Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNA-seq variant calling best practices #389
Comments
|
Miika; |
|
I've seen some 10% of SNPs in RNA-seq data being in putative RNA editing loci (as per http://rnaedit.com/). The Broad seem to have completely ignored things like these in their best practices (with focus only on splitting N-reads and lowering quality thresholds). To be continued.. |
|
Hi Miika, Variant support for RNA-seq would be awesome, thanks for getting the discussion started. We've had a couple pings about it. Great-- I am agreed that editing loci need to be filtered out of the SNP analysis, I'm surprised the Broad tools haven't implemented that. That might be something to add to the GEMINI database as a user-supplied annotation at the end so you can do selects by not being in the putative editing loci. There are some other gotcha areas of calling SNPs from RNA-seq data that probably need to get filtered out too, calling at junctions is one place, calling in regions with high self-chaining too. |
|
At any rate having something set and then tweaking it like Brad said sounds like a good plan. |
|
Hi Rory, I'm happy to look into introducing the required steps in bcbio. As far as I can see, one easy route would be to check within the |
|
Thanks for getting going with this MIika, looks great. @chapmanb, what do you think? I remember we talked about how to slip in the RNA-seq variant calling but I forget if we came to any plan about it. |
|
Rory and Miika; |
|
Hi Miika, I've seen you posting around a bit (elsewhere, now here) regarding variant calling from RNA-seq, something I'm trying to do myself. If the thread is too old to revive, I can start a new one. I'm trying out your splitnreads and wanted to chime in on an issue that I can see regarding such an approach in general. When converting rna-seq reads to pseudo-dna-seq, the splice site will very frequently be a new read end. I expect this will lead to a bias against calling variants or indels near the splice junction, since you won't have a window of good mapping around the putative variant after the reads are split since they were mapped as spliced RNA, and thus the window around the variant is in a different exon. As an atlernative to just splitting at Ns, what about something like a heuristic "tweaking" of the splitting? We could take the minimum of the actual overhang and maybe 5, 6, 10, or 20 bases and add reference sequence to the end, pulling the mapping and base quality from the actual mapping to the other exon. This shouldn't affect intronic variants since they'd be missed anyway coming from RNA-seq, though I think it would maybe hurt sensitivity to variants in exons containing an alternative splice site. Still, there are more exons than alternative splice sites, so I think it would be a net gain in sensitivity (could even test this). Jason |
|
Hi Jason, However, the only initial reasons for implementing splitNreads were the choking of FreeBayes and Gatk on RNA-seq data (where they see complex spliced reads). Ideally, FreeBayes should support RNA-seq data natively without these hacks I think. I see a few options:
Regardless, there will be loads of false positives so some filtering will be in order. I'd suggest filtering RNA-editing sites (http://rnaedit.com/) to begin with. A reference standard evaluation wouldn't hurt either (something like NA12878). |
|
I'd also be interested in having RNA Seq variant calling implemented. Is this topic still of interest to the community, and may a single-run implementation of the vardict approach as a free alternative to the existing GATK implementation make sense? |
|
Hi @schelhorn, you can currently use STAR (or Tophat) aligned bam files from the RNA-seq run in a second pass of bcbio, selecting the variant calling pipeline and |
|
Hi Sven-Eric, Another option is to stick variantcall: gatk under the algorithm section for your RNA-seq run, and it will call variants as part of your RNA-seq run using GATK. We haven't done very much work with RNA-seq variant calling, so if you have any suggestions about where we can improve, especially at filtering out the false positives, that would be super sweet. |
|
@mjafin: Thanks, but that two-run approach was something I'd like to avoid, if possible. That's why I was asking if it would make sense to promote Vardict to a first class citizen for RNA Seq variant calling, similar to GATK. Perhaps the intention of my question wasn't clear. @roryk: True, but the GATK walker used in the current implementation is non-free, AFAIK. I would currently like to avoid shelling out $32k for the GATK commercial license. |
|
I'm not familiar with how @roryk implemented the direct Gatk based RNA-seq variant calling, but if it's possible to skip the Gatk spliced read splitting then VarDict should be able to produce variants fairly easily (it handles spliced reads intrinsically). |
|
Hi, I am now changing freebayes from the first step for VarDict. The first round of variant calling is done sample-wise and the second round is done using multisample calling on all position identified as likely variation on the first round. I choose HISAT as first mapper because it is pretty fast (any comments on this welcomed), but could be replace by something else fast and good like SNAP, and GSNAP as second mapper because it is very sensitive and comes out well on evaluation. This was actually somehow inspired by the description of GSTRUCT,a GSNAP/GMAP software family not released yet but described on sup mat of Engström et al (2013 www.nature.com/articles/nmeth.2722). What option are you using for VarDict on RNA-seq @mjafin (if I can pick your brain)? |
|
I kind of forgot about this-- @mjafin have you run VarDict on RNA-seq data? It would be sweet to have a free version implemented instead of having to use GATK. |
|
Bcbio VarDict support for RNA Seq variant calling would be great, +1! |
|
Ugh, me too.. Yes VarDict runs on Star aligned BAMs out of the box. No Broad tools should be run on the bams between alignment and VarDict. I think @roryk if you can skip the bam splice splitting in the rnaseq pipeline then we could produce VarDict variants readily. We should use a 15 to 20 percent AF cutoff to get rid of some of the noise plus I should really look into annotating RNA editing sites after calling |
|
Han et al., Cancer Cell. (2015) have recently published a RNA-editing study on most of the TCGA data (5,672 RNA-Seq samples, that is a little less than number of samples that were screened for genomic mutations in TCGA I believe). Their approach may be a little ad-hoc, but they also use the RADAR/rnaedit.com database. In any way, having RNA-Seq calling with From the publication:
|
|
Thanks @schelhorn, Totally agree, thanks for putting this on our RADAR. I'm slowly working through my list of stuff to implement for bcbio-nextgen and this is on the list. |
|
Great; take your time, your work is much appreciated. |
|
@mjafin, I hope you don't mind me re-opening an old thread, but when you said that VarDict could/would handle STAR aligned bams out of the box, can you explain what pre-processing - if any - VarDict requires be applied? Note, I have applied STAR by hand (NOT via the bcbio pipeline), using the 2-pass alignment, as recommended by the Broad; but having provided a GTF to STAR (as recommended by STAR's author). The exact steps I followed are here: https://groups.google.com/d/msg/rna-star/0YGQtuwcmTY/NxgAapIJAwAJ. Am I correct in assuming, they would need to be sorted & de-duped? Should users also call STAR with the "--outSAMstrandField intronMotif" parameter (necessary to add the XS param to the bams that are generated)? |
|
@pwaltman VarDict is generally unfussy about bam files but it's generally best to avoid using any Broad tools being applied to BAMs. Sorting and indexing is necessary, de-duping is optional. We typically use the BAMs as they come from bcbio. For unstranded data bcbio uses the infronMotif setting, see https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/star.py#L67 |
|
@mjafin and others, freebayes should now handle the spliced reads. If this isn't the case please ping me over on the freebayes repo. The only problem I remember was during left alignment of indels. I don't have the commit I'd handy but I believe it has been resolved. |
|
Thanks Erik, that should give us plenty of free and commercial options now to run variant calling in RNA-seq. I'll close this issue even though not everything here has been necessarily implemented (RNA-editing) |
Hi All,
Collecting some information and opinions around RNA-seq variant calling here. The Broad institute have released best practices for germline variant calling in RNA-seq, http://gatkforums.broadinstitute.org/discussion/3891/best-practices-for-variant-calling-on-rnaseq.
What they're essentially doing is using a new restricted walker,
SplitNCigarReads, to split splice junction reads (Nin CIGAR) into multiple reads, as most variant callers (Gatk, FreeBayes too) seem to choke on N-reads. Further, they're doing some trimming of intronic tails, saying they're resulting from poor alignments, but in my opinion they could also be pre-mRNAs. Furthermore in variant calling they're generally lowering the quality thresholds.I took a look at the output of
SplitNCigarReadsand it's actually producing improper BAM files but the Broad commented that these BAM files must only be used for HaplotypeCaller and not for anything else downstream: http://gatkforums.broadinstitute.org/discussion/comment/12950/#Comment_12950For a more generic approach, I could maybe look into implementing a proper N-read splitter that would also set the
FLAGcorrectly and fix some of the tags (likeNM) for the split reads, if read splitting was seen as the way to go.Any thoughts anyone?
The text was updated successfully, but these errors were encountered: