CAMPHOR

SV caller for long-reads

Overview

Extract reads suggesting presence of SVs from a bam file
Identify SV candidates
Filter SV candidates using repeats information and number of support reads

Requirement

python3
pysam module of python
numpy module
perl
samtools (version 0.1.18 or higher)

Input file

Two bam files (one bam sorted by read name and another sorted by genome coordinates)
Index file (.bai) for bam file sorted by genome coordinate
Fastq file of the sequence data

If samtools is not installed in the environment, the path to the execution file of samtools can be specified within the config file (pram.config).

Output file format

vcf file of SVs (SV.vcf)

Usage

$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ sh CAMPHOR.sh <bam(sorted by read name)> <bam(sorted by genome coordinate)> <fastq> <output>

Example

$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ sh CAMPHOR.sh ./example/NA18943.chr22.sort_by_name.test.bam ./example/NA18943.chr22.sort.test.bam ./example/NA18943.chr22.sort.test.fastq test

Installation and usage via Docker

Install Docker in your computer, and run the following commands to install and run test.

Currently,CAMPHOR requires 16GB memory at build.

$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ docker build -t camphor .
$ docker run --rm -it -v $PWD:/out camphor

If you want to run your own data, please run the following commands.

$ git clone https://github.com/afujimoto/CAMPHOR
$ cd CAMPHOR/CAMPHOR
$ docker build -t camphor .
$ docker run --rm -it -v <path to output directory>:/out -v <path to input directory>:/input camphor sh CAMPHOR.sh /input/NA18943.chr22.sort_by_name.test.bam /input/NA18943.chr22.sort.test.bam /input/NA18943.chr22.sort.test.fastq /out/

Parameter settings in configuration file

We consider the parameter set in the provided configuration appropriate for 20x coverage WGS data. If you would like to use different parameters, please make changes in the parm.config file. In the current file, minimum variant allele frequency (MIN_VAF), minimum number of reads (MIN_READ_NUMBER) and minimum indel length (MIN_INDEL_LENGTH) (bp) are set to 0.15, 2 and 100, respectively. For greater depth of coverage, larger minimum number of reads should be appropriate.

We developed this method with nanopore sequence data base-called by albacore (total error rate =~ 15%), and set the minimum indel length to 100bp to remove false positives. However, newer basecallers have increased accuracy, and smaller minimum indel length (50bp or smaller) as well as lower minimum variant allele frequency (0.1 or lower) can be used.

Repeat filtering

If repeat information files are provided, our method filters SV candidates with the provided repeat information (Repeat masker, Tandem repeat finder, Segmental duplication, Self-chain). The filtering can increase the specificity. Please prepare annotation files with the following procedures.

Repeat masker

Download rmsk.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ grep Simple_repeat <path to rmsk.txt>|python3 .src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt

Tandem repeat

Download simpleRepeat.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python3 ./src/repeat/simpleRepeat.py <path to simpleRepeat.txt>|sort -k1,1 -k2,2g > ./data/simplerepeat.txt

Segmental duplication

Download genomicSuperDups.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python3 ./src/repeat/seg_dup.py <path to genomicSuperDups.txt>|sort -k1,1 -k2,2g > ./data/seg_dup.txt

Self-chain

Download chainSelf.txt file from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python3 .src/repeat/ucsc_selfchain.py <path to chainSelf.txt> | sort -k1,1 -k2,2g > ./data/chainSelf.txt

Obtaining the annotation data used from UCSC Download and format change for repeat information can be performed automatically with the commands below.

$ cd CAMPHOR/CAMPHOR
$ mkdir ./data/
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz | zcat | grep Simple_repeat | python3 ./src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz | zcat | python3 ./src/repeat/simpleRepeat.py /dev/stdin | sort -k1,1 -k2,2g > ./data/simplerepeat.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz | zcat | python3 ./src/repeat/seg_dup.py /dev/stdin | sort -k1,1 -k2,2g > ./data/seg_dup.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chainSelf.txt.gz | zcat | python3 ./src/repeat/ucsc_selfchain.py /dev/stdin | sort -k1,1 -k2,2g > ./data/chainSelf.txt

Performance

Performance of this tool is provided in the following publications; Fujimoto et al. Whole genome sequencing with long-reads reveals complex structure and origin of structural variation in human genetic variations and somatic mutations in cancer. Genome Medicine (2021).

The performance comparison in Fujimoto et al. was performed as follows.

Read selection was performed with the following script. This script divides a large fastq file to small ones

python3 ./src/split_fastq_with_number.py <input fastq file> <output fastq file> <number of files>

Indels in vcf files were extracted and converted to the following format.
<chr> <start> <chr> <end> <SV type (DEL or INS)> <length (bp)>

For the format change, the following commands were used.
Sniffles

$python3 ./src/change_vcf_format_Sniffles.py <vcf from sniffles>

SVIM

$python3 ./src/change_vcf_format_SVIM.py <vcf from SVIM>

Gold standard set

$python3 ./src/change_vcf_format_GS.py <NA19240.BIP-unified.vcf>

CAMPHOR
<SVtype>_candidate.filtered.txt files were used.

SV lists were compared with the following command

$python3 ./src/SV_comparison.py <SV list of the gold standard set generated in step2> <SV list of the each caller generated in step2> 500 120

Licence

GPLv3

Contact

Akihiro Fujimoto - afujimoto@m.u-tokyo.ac.jp

http://www.humgenet.m.u-tokyo.ac.jp/index.en.html

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
CAMPHOR		CAMPHOR
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAMPHOR

Requirement

Input file

Output file format

Usage

Example

Installation and usage via Docker

Parameter settings in configuration file

Repeat filtering

Repeat masker

Tandem repeat

Segmental duplication

Self-chain

Performance

Licence

Contact

About

Releases

Packages

Contributors 2

Languages

License

afujimoto/CAMPHOR

Folders and files

Latest commit

History

Repository files navigation

CAMPHOR

Requirement

Input file

Output file format

Usage

Example

Installation and usage via Docker

Parameter settings in configuration file

Repeat filtering

Repeat masker

Tandem repeat

Segmental duplication

Self-chain

Performance

Licence

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages