Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia and EML4-ALK in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.
Challenges in the diagnosis of gene fusions include that there could be many and sometimes unknown fusion partners, low gene expression (e.g. ALK), non fusion-specific protein expression (e.g. ROS1), potential involvments of cryptic splice sites, low sequence diversity at genomic breakpoints and associated mapping difficulty, and poor sample (poor quality, limited amount and low tumor cellularity) in clinical specimens. The anchored multiplex PCR (AMP) is a clinically proven technology that addresses all these issues and has accelerated gene fusion discoveries and supported robust clinical diagnosis (Zheng Z, et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat Med. 2014).
Equally important to a robust wet lab technology is a high-performing computational method for calling gene fusions. SplitFusion is fast by leveraging the chimeric split-read alignments of BWA-MEM (Li H. 2013). SplitFusion is agnostic to known coding transcripts. SplitFusion is sensitive, specific, computationally efficient, and features highly desirable abilities in clinical reporting, including the capabilities to infer fusion transcript frame-ness and exon-boundary alignments; to calculate number of unique DNA/RNA fragment ligation sites; and the SplitFusion-Target mode allows for continuous evidence-based improvement in clinical reporting.
SplitFusion can be used for RNA-seq data and the Anchored Multiplex PCR (AMP) data.
The analsyis consists of the following computational steps:
-
Reference alignment and deduplication.
-
CIGAR transformation.
-
Candidate breakpoint calling.
-
Initial breakpoint filtering.
-
Breakpoint gene annotation, frame-ness, exon boundary, further filtering and target reporting.
-
Result reporting and visualization.
When running SplitFusion, you can specify paths to the tools and genome files you already have. If not, here are the human genome data and tools for installation.
-
E.g. I saved my large database files under /home/user1/database/:
cd /home/user1/database wget https://data.broadinstitute.org/snowman/hg19/Homo_sapiens_assembly19.fasta
-
E.g. I installed the above tool in /home/user1/tools/:
cd /home/user1/tools wget -O samtools.tar.bz2 https://sourceforge.net/projects/samtools/files/latest/download tar -xvf samtools.tar.bz2 cd samtools-1.10 ## Note the samtools-x.xx version mkdir /home/user1/tools/samtools ./configure --prefix=/home/user1/tools/samtools make; make install; cd ..
-
git clone https://github.com/lh3/bwa.git cd bwa; make; cd ..
-
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz tar -zxvf bedtools-2.29.1.tar.gz cd bedtools2; make
-
Currently, SplitFusion by default uses snpEff but it also supports ANNOVA. snpEff is a free annotation software. When you run it for the first time, it will automatically download the genome database (either hg19 or hg38) from the internet. To run with snpEff, you only need to specify your snpEff directory via --annovar /home/user1/tools/snpEff
-
If you use ANNOVAR, you need a free registration. Then run SplitFusion with the flag --AnnotationMethod annovar. Note that the annovar sub-directory structure should be maintained, e.g. if you install annovar under /home/user1/tools: /home/user1/tools/annovar/annotate_variation.pl /home/user1/tools/annovar/table_annovar.pl /home/user1/tools/annovar/humandb/hg19_refGeneMrna.fa /home/user1/tools/annovar/humandb/hg19_refGene.txt
-
R Install requried R packages within R:
install.packages(c("Rcpp", "data.table", "plyr"))
-
Python Install a Python module by pip:
pip install future
cd /home/user1/tools/
git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git
python /home/user1/tools/SplitFusion/exec/SplitFusion.py -h
usage: SplitFusion.py [-h] --refGenome REFGENOME --annovar SNPEFF_PATH --samtools
SAMTOOLS --bedtools BEDTOOLS --bwa BWA --R R --perl PERL
--output OUTPUT --sample_id SAMPLE_ID
[--bam_file BMS_FILE] [--fastq_file1 FASTQ_FILE1] [--fastq_file2 FASTQ_FILE2]
[--panel_dir PANEL_DIR] [--panel PANEL] [--steps STEPS]
[--AnnotationMethod ANNOTATIONMETHOD] [--thread THREAD]
[--minMQ MINMQ] [--minMQ1 MINMQ1]
[--minMapLength MINMAPLENGTH]
[--minMapLength2 MINMAPLENGTH2]
[--maxQueryGap MAXQUERYGAP] [--maxOverlap MAXOVERLAP]
[--minExclusive MINEXCLUSIVE]
[--FusionMinStartSite FUSIONMINSTARTSITE]
[--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION]
[--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION]
Split-Fusion is a fast data analysis pipeline detects gene fusion based on
chimeric split-read alignments.
optional arguments:
-h, --help show this help message and exit
--refGenome REFGENOME
The reference genome file, with a full path
[required].
--annovar SNPEFF The snpEff path [required].
--samtools SAMTOOLS The samtools executable file with full path [Optional].
--bedtools BEDTOOLS The bedtools executable file with full path [Optional].
--bwa BWA The bwa executable file with full path [Optional].
--R R The R executable file with full path [Optional].
--perl PERL The perl executable file with full path [Optional].
--output OUTPUT The directory for output SplitFusion results [required].
--sample_id SAMPLE_ID
The name of sample to be analyzed [required].
--bam_file BAM_FILE If the bam_file is specified, the Kickstart mode will
be used. [Either fastq_file or bam_file should be
specified]
--fastq_file1 FASTQ_FILE1
The fastq file (Read 1 of paired-end) to be analyzed
[Either fastq_file or bam_file should be specified].
--fastq_file2 FASTQ_FILE2
Read 2 of paired-end fastq file.
--panel_dir PANEL_DIR
For TARGET mode: the path where known significant
fusions or splicing isoforms (whitelist) or unwanted
fusions involving homologous genes or recurrent falsed
positives (blacklist) are stored. Default='NA'
--panel PANEL The prefix name of TARGET gene panel file. E.g.,
'LungFusion' for LungFusion.GSP2.bed. Default='NA'
--steps STEPS Specify steps to run. Default='1_fastq-bam,2_bam-
breakpoint,3_breakpoint-filter,4_breakpoint-anno
,5_breakpoint-anno-post'
# Examples of running different modes of SplitFusion
## First, copy example data files from pipeline to local testing directory, e.g.:
mkdir -p /home/user1/SplitFusion-test/data
cp /home/user1/tools/SplitFusion/inst/data/example_data/Lib001.* /home/user1/SplitFusion-test/data/
## Before using BWA, the reference genome needs to be indexed, e.g.:
/home/user1/tools/bwa/bwa index -a bwtsw /home/user1/database/Homo_sapiens_assembly19.fasta
##=========================================================
## Start from FASTQ files, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
--refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
--annovar /home/user1/tools/snpEff \
--samtools /home/user1/tools/samtools/bin/samtools \
--bedtools /home/user1/tools/bedtools2/bin/bedtools \
--bwa /home/user1/tools/bwa/bwa \
--R /usr/bin/R \
--perl /usr/bin/perl \
--output /home/user1/SplitFusion-test/output \
--sample_id "Lib001" \
--fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
--fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
--thread 4
##=========================================================
## Kickstart mode, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
--refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
--annovar /home/user1/tools/snpEff \
--samtools /home/user1/tools/samtools/bin/samtools \
--bedtools /home/user1/tools/bedtools2/bin/bedtools \
--bwa /home/user1/tools/bwa/bwa \
--R /usr/bin/R \
--perl /usr/bin/perl \
--output /home/user1/SplitFusion-test/kickstart-mode-output \
--sample_id "Lib001" \
--bam_file /home/user1/SplitFusion-test/data/Lib001.bam \
--thread 1
##=========================================================
## TARGET gene panel mode, with panel info
##=========================================================
mkdir -p /home/user1/SplitFusion-test/panel
cp /home/user1/tools/SplitFusion/inst/data/panel/* /home/user1/SplitFusion-test/panel/
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
--refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
--annovar /home/user1/tools/snpEff \
--samtools /home/user1/tools/samtools/bin/samtools \
--bedtools /home/user1/tools/bedtools2/bin/bedtools \
--bwa /home/user1/tools/bwa/bwa \
--R /usr/bin/R \
--perl /usr/bin/perl \
--output /home/user1/SplitFusion-test/target-mode-output \
--sample_id "Lib001" \
--fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
--fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
--panel_dir /home/user1/SplitFusion-test/panel \
--thread 4 \
--panel LungFusion
##=======================================================================
## Selecting only some steps to run.
## If you have run Steps 1 and 2, and just want to repeat analyses from
## the Step 3 (e.g. the whitelist/blacklist has been updated), then:
##=======================================================================
python /home/user1/tools/SplitFusion/exec/SplitFusion.py \
--refGenome /home/user1/database/Homo_sapiens_assembly19.fasta \
--annovar /home/user1/tools/snpEff \
--samtools /home/user1/tools/samtools/bin/samtools \
--bedtools /home/user1/tools/bedtools2/bin/bedtools \
--bwa /home/user1/tools/bwa/bwa \
--R /usr/bin/R \
--perl /usr/bin/perl \
--output /home/user1/SplitFusion-test/output \
--sample_id "Lib001" \
--fastq_file1 /home/user1/SplitFusion-test/data/Lib001.R1.fq \
--fastq_file2 /home/user1/SplitFusion-test/data/Lib001.R2.fq \
--panel_dir /home/user1/SplitFusion-test/panel \
--panel LungFusion \
--steps "3_breakpoint-filter,4_breakpoint-anno,5_breakpoint-anno-post"
An example brief output table:
SampleID | GeneExon5_GeneExon3 | frame | num_partner_ends | num_unique_reads | exon.junction | breakpoint | transcript_5 | transcript_3 | function_5 | function_3 | gene_5 | cdna_5 | gene_3 | cdna_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lib001 | EML4_intronic---ALK_exon20 | in-frame | 9 | 11 | Both | 2_29446396__2_42492091 | NM_019063 | NM_004304 | intronic | exonic | EML4 | 667 | ALK | 3171 |
Lib001 | EML4_exon6---ALK_exon20 | in-frame | 2 | 1 | Both | 2_29446396__2_42491871 | NM_019063 | NM_004304 | exonic | exonic | EML4 | 667 | ALK | 3171 |
An example output fastq file for the EML4_intronic---ALK_exon20 fusion of sample Lib001 is:
>NS500673:45:HHK2HAFXX:1:11206:14487:5244:
ATTGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTATTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTGTCTGCAATTTTGGTAACTTTTG
>NS500673:45:HHK2HAFXX:1:11307:26804:17812:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:2:11207:15517:8468:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:2:21212:23872:20046:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:3:21410:6281:16202:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTTATGCTTG
>NS500673:45:HHK2HAFXX:3:21605:12646:6668:
GTCATCATCAACCAAGCAAAAATGTCAACTCGCGAAAAAAACAGCCAAGTGTTCCGCCGGAAGCACCAGGAGCTGCAAGCCAT
>NS500673:45:HHK2HAFXX:4:21510:10840:13732:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTAGTTGATGATGACATC
>NS500673:45:HHK2HAFXX:4:21605:21501:18033:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGGTGATGACATCTTTATGCTTG
Within R, run:
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> BiocManager::install("igvR")
> source("SplitFusion/R/bam2igv.R")
> bam2igv(bamfile = "/home/user1/SplitFusion-test/output/Lib001/Lib001.EML4_intronic---ALK_exon20.bam")