GitHub - ding-lab/Fusion_hg38

Fusion_GRCh38

Fusion Pipeline officially works on RIS compute1. 🤗 In the mean time, here are the new hightlights:

Suitable for batch jobs and large projects, providing the project has the catalog in these format

Alchemist.RNA.Map; CPTAC3.BamMap; PE-CGS.FileMap
Currently, this Fusion pipeline is designed to work for 3 main projects: PE-CGS, ALCHEMIST and CPTAC3

Updates 02/29/24:

Added Arriba Fusion to the pipeline, now the fusion pipeline outputs include the fusion called by four fusion calling tools

Updates 02/25/22:

Updated Fusion filter annotation to hgnc_complete_set_2021-06-01.txt

Major Updates 01/13/22:

Updated STAR v2.7.2c to STAR v2.7.8a
Updated STAR-Fusion v1.7.0 to STAR-Fusion v1.10.0
Updated ctat genome libs, from GRCh38_gencode_v31_CTAT_lib_Oct012019 to GRCh38_gencode_v37_CTAT_lib_Mar012021

Adapted from Dan Cui Zhou's hg38 pipeline v2.2 and Qingsong Gao's hg19 pipeline

VERSION: v2.7

Build: hg38 (GRCh38_gencode_v37 latest version supported by STAR-Fusion)

Cluster: RIS Compute1

Compute resources: at least 60GB MEM and 150 GB scratch disk space for each sample

Three tools are used for fusion calling: Databases and references were downloaded from their respective websites.

STAR-Fusion is downloaded from https://github.com/STAR-Fusion/STAR-Fusion/wiki

EricScript is downloaded from https://sites.google.com/site/bioericscript

EricScript is currently located in /storage1/fs1/dinglab/Active/Projects/PECGS/PECGS_pipeline/Fusion/ericscript-0.5.5
since the docker version EricScript have this issue https://github.com/ding-lab/Fusion_hg38/issues/1

Integrate is downloaded from https://sourceforge.net/p/integrate-fusion/wiki/Home/

Integrate is located in /storage1/fs1/dinglab/Active/Projects/PECGS/PECGS_pipeline/Fusion/INTEGRATE_0_2_6

Fusion pipeline: Fusions are called by each tool then merged into a single file. This is done for tumor and normal RNA-Seq files individually. Since raw fusion calls may contain many false positives, extensive filtering is done, as detailed below. Finally, normal fusions are then filtered out from the tumor fusions.

Filtering strategy: Get fusions reported by at least 2 callers or reported by STAR-Fusion (shows higher sensitivity) but with higher supporting evidence (defined by fusion fragments per million total reads, or FFPM, >0.1). Then, remove fusions present in the filtering database, which includes:

Uncharacterized genes, immunoglobin genes, mitochondrial genes, etc.
Fusions from the same gene or paralog genes (downloaded from https://www.genenames.org/cgi-bin/statistics)
Fusions reported in TCGA normal samples from pancan fusion analysis (PMID:29617662), GTEx tissues (reported in star-fusion output), and non-cancer cell studies (PMID: 26837576)

Output format: In the output file, each row represents one fusion. There are 9 columns for each fusion:

FusionName
LeftBreakpoint
RightBreakpoint
Cancer__Sample
JunctionReadCount
SpanningFragCount
FFPM - fusion fragments per million total reads, 'NA' means the fusion was found by both EricScript and Integrate but not STAR-Fusion
PROT_FUSION_TYPE - INFRAME, FRAMESHIFT or '.'
CallerN - number of callers

Processing details: Run "makeDir.py" to create the appropriate folder directory for each sample. Run "to_run.sh" in order to submit tmux instances. Once it's all done, run "normalFilter.py" to remove normal fusions.

v2.2 revision history: Fixed bug that basically ignored Integrate calls Automatically compresses large Star fusion file Added a copy any fusion script so we can provide normal calls too (no longer filtering germline by default)

Usage

add the following into bashrc or bash_profile on compute1

export PATH=/storage1/fs1/dinglab/Active/Projects/PECGS/PECGS_pipeline/Fusion/INTEGRATE_0_2_6/INTEGRATE-build/bin:$PATH
export LD_LIBRARY_PATH="/storage1/fs1/dinglab/Active/Projects/PECGS/PECGS_pipeline/Fusion/INTEGRATE_0_2_6/INTEGRATE-build/vendor/divsufsort/lib:$LD_LIBRARY_PATH"

pull fusion docker on compute1

LSF_DOCKER_VOLUMES="$STORAGE1_DINGLAB:$STORAGE1_DINGLAB" PATH="$STORAGE1_DINGLAB:$PATH" bsub -Is -q 'dinglab-interactive general-interactive' -G compute-dinglab -M 50G -R 'select[mem>50G] span[hosts=1] rusage[mem=50G]' -a 'docker(songyizhe/fusion:2.6)' /bin/bash -l

The lastest docker image is here

https://hub.docker.com/repository/registry-1.docker.io/songyizhe/fusion/

Read in samples and make directory and symblinks for each sample

## Worklog -- This is how I processed Alchemist data for fusion hg38 pipeline

## cd to the folder for fusion analysis
cd /storage1/fs1/dinglab/Active/Projects/ALCHEMIST/Alchemist_analysis/RNA-seq_Fusion_Alchemist_batch_01_11_22

## download the ALCHEMIST catalog to the folder

wget https://raw.githubusercontent.com/ding-lab/GDAN.catalog/main/Alchemist/Alchemist.RNAMap.storage1.tsv

## Get the sample_name and save to samples.txt

grep ALCH Alchemist.RNAMap.storage1.tsv | cut -f 1 | rev | cut -c6- | rev | sort | uniq > samples.txt

# cut first column, reverse the character order, cut the last 6 character, reverse back, sort, keep unique sample names

## make directories for each sample in the subfolder of the current folder.

python makeDir_ris_alchemist.py samples.txt Alchemist.RNAMap.storage1.tsv .

activate conda env

source activate Fusion

or

conda activate Fusion

Run fusion

Example script (generated from step 2)

bash /storage1/fs1/dinglab/Active/Projects/PECGS/PECGS_pipeline/Fusion/Fusion_hg38_scripts/fusion_pipeline_ris_v1.sh ALCH-B2NW-TTP1-A ALCH-B2NW-TTP1-A_1.fastq.gz ALCH-B2NW-TTP1-A_2.fastq.gz 10

Contact

Yizhe EJ Song (y.song@wustl.edu)

Thanks a lot to Qingsong and Daniel Cui Zhou who initially developed and optimized this pipeline.

Version Update:

v2.2 Fusion hg38 build for katmai, annotation libs GRCh38_gencode_v31

v2.3 Dockerized fusion hg38 v2.2 pipeline

v2.4 Updated STAR v2.7.2c to STAR v2.7.8a; Updated STAR-Fusion v1.7.0 to STAR-Fusion v1.10.0

v2.5 Fixed Erikscript error with docker

v2.6 Updated STAR-Fusion and Integrate annotation libs to GRCh38_gencode_v37

v2.7 Combined STAR-Fusion and Integrate Fusion alignment steps. Worked more efficiently for batch jobs. Minor updates to accommodate changes of catalog and analysis summary files.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
AlCHEMIST_Fusion		AlCHEMIST_Fusion
CPTAC3_pipeline		CPTAC3_pipeline
Docker		Docker
INTEGRATE_0_2_7		INTEGRATE_0_2_7
Multiple_myeloma_pipeline		Multiple_myeloma_pipeline
PE-CGS_Fusion		PE-CGS_Fusion
arriba/scripts		arriba/scripts
README.md		README.md
combine_call.pl		combine_call.pl
copy_Fusions.sh		copy_Fusions.sh
filter.pl		filter.pl
fusion_pipeline_ris_v2.sh		fusion_pipeline_ris_v2.sh

ding-lab/Fusion_hg38

Folders and files

Latest commit

History

Repository files navigation

Fusion_GRCh38

Updates 02/29/24:

Updates 02/25/22:

Major Updates 01/13/22:

VERSION: v2.7

Build: hg38 (GRCh38_gencode_v37 latest version supported by STAR-Fusion)

Cluster: RIS Compute1

Compute resources: at least 60GB MEM and 150 GB scratch disk space for each sample

Usage

Contact

Version Update:

About

Resources

Stars

Watchers

Forks

Languages