CAMPHORsomatic

Somatic SV caller for long-reads

Overview

Identify SV candidates from bam files for cancer and matched normal samples
Compare SV candidates between cancer and matched normal samples and identify somatic SV candidates

Requirement

python3
pysam module of python
numpy module
perl

samtools (version 0.1.18 or higher)

If samtools is not installed in the environment, the path to the execution file of samtools can be specified within the config file (pram.config).

Input files

Two bam files (one bam sorted by read name and another sorted by genome coordinates) for cancer and matched-normal samples
Index file (.bai) for bam files sorted by genome coordinates
Fastq file of cancer

Output file format

vcf file of SVs (somatic_SV.vcf)

Usage

$ cd <path to CAMPHOR>
$ sh CAMPHOR_SVcall.sh <bam of cancer sample (sorted by read name)> <bam of cancer sample (sorted by genome coordinate)> <output directory of cancer>
$ sh CAMPHOR_SVcall.sh <bam of normal sample (sorted by read name)> <bam of normal sample (sorted by genome coordinate)> <output directory of normal>
$ sh CAMPHOR_comparison.sh <output directory of cancer> <output directory of normal> <bam of cancer sample (sorted by genome coordinate)> <bam of normal sample (sorted by genome coordinate)> <fastq file of cancer> <output directory of somatic SV>

Example

$ git clone https://github.com/afujimoto/CAMPHORsomatic
$ cd CAMPHORsomatic
$ sh CAMPHOR_SVcall.sh ./example/sample1.sort_by_name.test.bam ./example/sample1.sort.test.bam sample1
$ sh CAMPHOR_SVcall.sh ./example/sample2.sort_by_name.test.bam ./example/sample2.sort.test.bam sample2
$ sh CAMPHOR_comparison.sh ./sample1 ./sample2 ./example/sample1.sort.test.bam ./example/sample2.sort.test.bam ./example/sample1.sort.test.fastq SV

Installation and usage via Docker

Currently, CAMPHORsomatic requires 16GB memory at build.

Install Docker in your computer, and run the following commands to install and run test.

$ git clone https://github.com/afujimoto/CAMPHORsomatic
$ cd <path to CAMPHORsomatic>
$ docker build -t camphorsomatic .
$ docker run --rm -it -v $PWD/sv:/CAMPHOR/SV -v $PWD/sample1:/CAMPHOR/sample1 -v $PWD/sample2:/CAMPHOR/sample2 camphorsomatic

If you want to run for your own data, please run the below commands.

$ git clone https://github.com/afujimoto/CAMPHORsomatic
$ cd <path to CAMPHORsomatic>
$ docker build -t camphorsomatic .
$ docker run --rm -it \
  -v <path to directory of cancer bam>:/cancer_input \
  -v <path to directory of normal bam>:/normal_input \
  -v <path to output directory of cancer>:/cancer_output \
  -v <path to output directory of normal>:/normal_output \
  -v <path to output directory>:/output \
  camphorsomatic \
  sh CAMPHOR_SVcall.sh \
    /cancer_input/<name of bam file of cancer sample (sorted by read name)> \
    /cancer_input/<name of bam of cancer sample (sorted by genome coordinate)> \
    /cancer_output \
   && sh CAMPHOR_SVcall.sh \
     /normal_input/<name bam file of normal sample (sorted by read name)> \
     /normal_input/<name bam file of normal sample (sorted by genome coordinate)> \
     /normal_output \
   && sh CAMPHOR_comparison.sh \
     /cancer_output \
     /normal_output \
     /cancer_input/<name of bam of cancer sample (sorted by genome coordinate)> \
     /normal_input/<name bam file of normal sample (sorted by genome coordinate)> \
     /cancer_input/<name of fastq file of cancer> \
     /output

Parameter settings in configuration file

We consider the parameters set in the provided configuration appropriate for 20x coverage WGS data.

We developed this method with nanopore sequence data basecalled by albacore (total error rate =~ 15%), and set minimum indel length to 100bp to remove false positives. However, newer basecallers have increased accuracy and, smaller minimum indel length (50bp or smaller) can be used. Users can change the "MIN_INDEL_LENGTH" within the pram.config file.

Repeat filtering

Our method filters SV candidates with the provided repeat information (Repeat masker, Tandem repeat finder, Segmental duplication, Self-chain). Please prepare annotation files with the following procedures.

Make a directory for repeat files in CAMPHORsomatic directory

$ mkdir data

Repeat masker

Download rmsk.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ grep Simple_repeat <path to rmsk.txt>|python ./src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt

Tandem repeat

Download simpleRepeat.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python ./src/repeat/simpleRepeat.py <path to simpleRepeat.txt>|sort -k1,1 -k2,2g > ./data/simplerepeat.txt

Segmental duplication

Download genomicSuperDups.txt from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python ./src/repeat/seg_dup.py <path to genomicSuperDups.txt>|sort -k1,1 -k2,2g > ./data/seg_dup.txt

Self-chain

Download chainSelf.txt file from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

$ python ./src/repeat/ucsc_selfchain.py <path to chainSelf.txt> | sort -k1,1 -k2,2g > ./data/chainSelf.txt

Obtaining the annotation data used from UCSC Download and format change for repeat information can be performed automatically with the commands below.

$ cd CAMPHORsomatic
$ mkdir ./data/
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz | zcat | grep Simple_repeat | python3 ./src/repeat/rmsk.py /dev/stdin > ./data/rmsk.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz | zcat | python3 ./src/repeat/simpleRepeat.py /dev/stdin | sort -k1,1 -k2,2g > ./data/simplerepeat.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz | zcat | python3 ./src/repeat/seg_dup.py /dev/stdin | sort -k1,1 -k2,2g > ./data/seg_dup.txt
$ curl -L http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chainSelf.txt.gz | zcat | python3 ./src/repeat/ucsc_selfchain.py /dev/stdin | sort -k1,1 -k2,2g > ./data/chainSelf.txt

Data of normal samples

CAMPHOR_comparison.sh compares cancer SVs and normal SV candidates, and removes germline SVs. For this comparison, _candidate.txt0 files in are used. Users can merge these SV files of multiple normal samples, and save as _candidate.txt0 in a new directory. The new directory can be used as in analysis with CAMPHOR_comparison.sh. This analysis increases power to remove germline SVs.

Performance

False positive rate was estimated to be ~7% with PCR in a liver cancer sample set (Fujimoto et al. Whole genome sequencing with long-reads reveals complex structure and origin of structural variation in human genetic variations and somatic mutations in cancer. Genome Medicine (2021).).

Licence

GPLv3

Contact

Akihiro Fujimoto - afujimoto@m.u-tokyo.ac.jp

http://www.humgenet.m.u-tokyo.ac.jp/index.en.html

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
example		example
src		src
CAMPHOR_SVcall.sh		CAMPHOR_SVcall.sh
CAMPHOR_comparison.sh		CAMPHOR_comparison.sh
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pram.config		pram.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAMPHORsomatic

Requirement

Input files

Output file format

Usage

Example

Installation and usage via Docker

Parameter settings in configuration file

Repeat filtering

Make a directory for repeat files in CAMPHORsomatic directory

Repeat masker

Tandem repeat

Segmental duplication

Self-chain

Data of normal samples

Performance

Licence

Contact

About

Releases

Packages

Contributors 2

Languages

License

afujimoto/CAMPHORsomatic

Folders and files

Latest commit

History

Repository files navigation

CAMPHORsomatic

Requirement

Input files

Output file format

Usage

Example

Installation and usage via Docker

Parameter settings in configuration file

Repeat filtering

Make a directory for repeat files in CAMPHORsomatic directory

Repeat masker

Tandem repeat

Segmental duplication

Self-chain

Data of normal samples

Performance

Licence

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages