SV caller for long-reads
Overview
- Extract reads suggesting presence of SVs from a bam file
- Identify SV candidates
- Filter SV candidates using repeats information and number of support reads
- python3
- pysam module of python
- numpy module
- perl
- samtools (version 0.1.18 or higher)
- Two bam files (one bam sorted by read name and another sorted by genome coordinates)
- Index file (.bai) for bam file sorted by genome coordinate
- Fastq file of the sequence data
If samtools is not installed in the environment, the path to the execution file of samtools can be specified within the config file (pram.config).
vcf file of SVs (SV.vcf)
$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ sh CAMPHOR.sh <bam(sorted by read name)> <bam(sorted by genome coordinate)> <fastq> <output>
$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ sh CAMPHOR.sh ./example/NA18943.chr22.sort_by_name.test.bam ./example/NA18943.chr22.sort.test.bam ./example/NA18943.chr22.sort.test.fastq test
Install Docker in your computer, and run the following commands to install and run test.
Currently,CAMPHOR requires 16GB memory at build.
$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ docker build -t camphor .
$ docker run --rm -it -v $PWD:/out camphor
If you want to run your own data, please run the following commands.
$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ docker build -t camphor .
$ docker run --rm -it -v <path to output directory>:/out -v <path to input directory>:/input camphor sh CAMPHOR.sh /input/NA18943.chr22.sort_by_name.test.bam /input/NA18943.chr22.sort.test.bam /input/NA18943.chr22.sort.test.fastq /out/
We consider the parameter set in the provided configuration appropriate for 20x coverage WGS data. If you would like to use different parameters, please make changes in the parm.config file. In the current file, minimum variant allele frequency (MIN_VAF), minimum number of reads (MIN_READ_NUMBER) and minimum indel length (MIN_INDEL_LENGTH) (bp) are set to 0.15, 2 and 100, respectively. For greater depth of coverage, larger minimum number of reads should be appropriate.
We developed this method with nanopore sequence data base-called by albacore (total error rate =~ 15%), and set the minimum indel length to 100bp to remove false positives. However, newer basecallers have increased accuracy, and smaller minimum indel length (50bp or smaller) as well as lower minimum variant allele frequency (0.1 or lower) can be used.
If repeat information files are provided, our method filters SV candidates with the provided repeat information (Repeat masker, Tandem repeat finder, Segmental duplication, Self-chain). The filtering can increase the specificity. Please prepare annotation files with the following procedures.
Download rmsk.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
$ grep Simple_repeat <path to rmsk.txt>|python3 .src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt
Download simpleRepeat.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
$ python3 ./src/repeat/simpleRepeat.py <path to simpleRepeat.txt>|sort -k1,1 -k2,2g > ./data/simplerepeat.txt
Download genomicSuperDups.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
$ python3 ./src/repeat/seg_dup.py <path to genomicSuperDups.txt>|sort -k1,1 -k2,2g > ./data/seg_dup.txt
Download chainSelf.txt file from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
$ python3 .src/repeat/ucsc_selfchain.py <path to chainSelf.txt> | sort -k1,1 -k2,2g > ./data/chainSelf.txt
Obtaining the annotation data used from UCSC Download and format change for repeat information can be performed automatically with the commands below.
$ cd CAMPHOR/CAMPHOR
$ mkdir ./data/
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz | zcat | grep Simple_repeat | python3 ./src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz | zcat | python3 ./src/repeat/simpleRepeat.py /dev/stdin | sort -k1,1 -k2,2g > ./data/simplerepeat.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz | zcat | python3 ./src/repeat/seg_dup.py /dev/stdin | sort -k1,1 -k2,2g > ./data/seg_dup.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chainSelf.txt.gz | zcat | python3 ./src/repeat/ucsc_selfchain.py /dev/stdin | sort -k1,1 -k2,2g > ./data/chainSelf.txt
Performance of this tool is provided in the following publications; Fujimoto et al. Whole genome sequencing with long-reads reveals complex structure and origin of structural variation in human genetic variations and somatic mutations in cancer. Genome Medicine (2021).
The performance comparison in Fujimoto et al. was performed as follows.
- Read selection was performed with the following script. This script divides a large fastq file to small ones
python3 ./src/split_fastq_with_number.py <input fastq file> <output fastq file> <number of files>
- Indels in vcf files were extracted and converted to the following format.
<chr> <start> <chr> <end> <SV type (DEL or INS)> <length (bp)>
For the format change, the following commands were used.
Sniffles
$python3 ./src/change_vcf_format_Sniffles.py <vcf from sniffles>
SVIM
$python3 ./src/change_vcf_format_SVIM.py <vcf from SVIM>
Gold standard set
$python3 ./src/change_vcf_format_GS.py <NA19240.BIP-unified.vcf>
CAMPHOR
<SVtype>_candidate.filtered.txt files were used.
- SV lists were compared with the following command
$python3 ./src/SV_comparison.py <SV list of the gold standard set generated in step2> <SV list of the each caller generated in step2> 500 120
GPLv3
Akihiro Fujimoto - afujimoto@m.u-tokyo.ac.jp