Skip to content

Latest commit

 

History

History
172 lines (156 loc) · 8.51 KB

rnaseq_variants.md

File metadata and controls

172 lines (156 loc) · 8.51 KB

Variant calling using bulk RNA-seq data

2020-05-12: works for validation samples, fails for some random samples due to pybedtools issue. To avoid HaplotypeCallerSpark issue, update to the latest development version.

Workflow

This workflow demonstrates how to call variants with GATK3.8 using bulk RNA-seq data of GM12878 cell line (blood).

At least 3 RNA-seq datasets exist for NA12878:

  • SRR307897 is of bad quality - don't use it;
  • SRR307898 is a bit old (Illumina GAII) but it was used in Piskol2013 article, which is reliable work on RNA-seq variant calling validation;
  • SRR5665260, NextSeq-500.

GATK3.8 requires additional installation step.

1. Project structure

mkdir NA12878_RNA-seq_validation
cd NA12878_RNA-seq_validation
mkdir config input final work

2. Download input data (~6G total)

cd input
wget -c -O NA12878_1.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_1.fastq.gz
wget -c -O NA12878_2.fq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR307/SRR307898/SRR307898_2.fastq.gz

or (if ebi is not online)

wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR307898/SRR307898.3
fastq-dump --gzip --split-files SRR307898.3

3. Config file input/NA12878.yaml

Note the use of batch even for a single sample.

details:
- algorithm:
    aligner: star
    strandedness: unstranded
    variantcaller: gatk-haplotype
    jointcaller: gatk-haplotype-joint
    tools_off:
    - gatk4
  analysis: RNA-seq
  description: NA12878_SRR307898
  files:
  - /path/NA12878/input/NA12878_SRR307898_1.fq.gz
  - /path/NA12878/input/NA12878_SRR307898_2.fq.gz
  genome_build: hg38
  metadata:
    batch: NA12878
fc_name: NA12878
resources:
  default:
    cores: 4
    jvm_opts:
    - -Xms750m
    - -Xmx7000m
    memory: 15G
upload:
  dir: ../final

4. Run bcbio - assuming 60G/4cores machine or cluster node.

cd work
bcbio_nextgen.py ../config/NA12878.yaml -n 4

Parameters

  • variantcaller: gatk-haplotype or vardict. You can use just one variant caller for RNA-seq data in a bcbio project. If you want calls from two callers, run a separate project or edit variantcaller parameter and re-run.
  • tools_off: [gatk4]: when set, runs gatk3.8, which gives better precision.
  • batch is required:
    metadata:
      batch: NA12878
    

Validation

How to validate calls from bcbio

Use high quality variants (PASS filters, depth>=10 reads), remove potential RNA editing events.

bcftools view -f PASS -e "INFO/DP<10 | INFO/possible_rnaedit==1" NA12878-gatk-haplotype-annotated.vcf.gz |  bgzip -c > NA12878.pass.vcf.gz
tabix NA12878.pass.vcf.gz
wget https://raw.githubusercontent.com/naumenko-sa/cre/master/data/intersect.hg38.bed
wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/NA12878.validate.sh
NA12878.validate.sh \
NA12878.pass.vcf.gz \
/path/bcbio/genomes/Hsapiens/hg38/validation/giab-NA12878/truth_small_variants.vcf.gz \
intersect.hg38.bed \
/path/bcbio/genomes/Hsapiens/hg38/rtg/hg38.sdf

Results will be in NA12878.pass.vcf.gz.stat:

snp tp-baseline 5720
indels tp-baseline 184
snp fp 1205
indels fp 2141
snp fn 29323
indels fn 2455

Validation results, 2016-2019, GRCh37, SRR307898

+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|date      |type |bcbio|gatk   |TP   |FP    |FN    |FDR|FNR|Target|Total called|
+==========+=====+=====+=======+=====+======+======+===+===+======+============+
|2016-12-08|SNP  |1.0.0|3.6    |7,892|266   |30,851|3% |80%|38,743|8,158       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2016-12-08|INDEL|1.0.0|3.6    |256  |117   |2,788 |31%|92%|3,044 |373         |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2018-06-06|SNP  |1.0.9|4.0.1.2|6,817|2,575 |31,926|27%|82%|38,743|9,392       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2018-06-06|INDEL|1.0.9|4.0.1.2|228  |24,448|2,816 |99%|93%|3,044 |24,676      |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2018-06-14|SNP  |1.0.9|3.8    |7,936|315   |30,807|4% |80%|38,743|8,251       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2018-06-14|INDEL|1.0.9|3.8    |306  |280   |2,738 |48%|90%|3,044 |586         |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2019-03-01|SNP  |1.1.3|4.1.0.0|6,380|842   |32,363|12%|84%|38,743|7,222       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2019-03-01|INDEL|1.1.3|4.1.0.0|374  |3,131 |2,670 |89%|88%|3,044 |3,505       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2019-03-01|SNP  |1.1.3|3.8    |6,244|558   |32,499|8% |84%|38,743|6,802       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+
|2019-03-01|INDEL|1.1.3|3.8    |192  |1,951 |2,852 |91%|94%|3,044 |2,143       |
+----------+-----+-----+-------+-----+------+------+---+---+------+------------+

Validation results, 2020, hg38

+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|date      |data     |type |bcbio|caller            |TP   |FP   |FN    |FDR|FNR|Target|Total called|
+==========+=========+=====+=====+==================+=====+=====+======+===+===+======+============+
|2020-05-12|SRR307898|SNP  |1.2.3|gatk,4.1.6.0      |5,849|1,570|29,194|21%|83%|35,043|7,419       |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|2020-05-12|SRR307898|INDEL|1.2.3|gatk,4.1.6.0      |181  |2907 |2458  |94%|93%|2,639 |3,088       |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|2020-05-12|SRR307898|SNP  |1.2.3|vardict-java,1.7.0|6,333|916  |28,710|13%|82%|35,043|7,249       |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|2020-05-12|SRR307898|INDEL|1.2.3|vardict-java,1.7.0|166  |580  |2,473 |78%|94%|2,639 |746         |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|2020-05-13|SRR307898|SNP  |1.2.3|gatk3.8-1.0       |5720 |1205 |29,323|17%|83%|35043 |6925        |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+
|2020-05-13|SRR307898|INDEL|1.2.3|gatk3.8-1.0       |184  |2,141|2,455 |92%|93%|2,639 |2325        |
+----------+---------+-----+-----+------------------+-----+-----+------+---+---+------+------------+

Conclusions

  • It is not surprising to see high False Negative rate FN = FN/(FN+TP) in RNA-seq variant calling, i.e. that we are not calling 83% of variants. We validate against intersect.hg38.bed, which is based on exome capture regions and is representing all protein coding genes. Only a minor fraction of genes are expressed in blood, only these genes have read coverage that makes variant calling possible.
  • Indel calling is not reliable with RNA-seq data.
  • Current recommendation is to use gatk3.8 > vardict > gatk4 for better SNP calling precision in bcbio.
  • Annotation of RNA-editing sites and removal of variants around splice junctions improved precision in bcbio, but still more work is needed to achieve better precision at the level of Piskol2013 (<1% FDR).

TODO

  • update gatk3.8 validation
  • validate with SRR5665260
  • integrate validation into bcbio
  • improve post GATK4 filters for better precision
  • paired DNA/RNA-seq variant calling
  • somatic variant calling with RNA-seq, see discusion
  • track pybedtools release.

References