Skip to content

dpuiu/MitoHPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MitoHPC : Mitochondrial High Performance Caller

For Calling Mitochondrial Homoplasmies and Heteroplasmies

Citing

A bioinformatics pipeline for estimating mitochondrial DNA copy number and heteroplasmy levels from whole genome sequencing data, Battle et. al, NAR 2022 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9112767/

INPUT

* Illumina paired-end reads pre-aligned to a reference: .bam or .cram files 
* Alignment indeces: .bai or .crai files
* Alignment indexstats: .idxstats files

SYSTEM PREREQUISITES

* OS:                UNIX/LINUX 
* SOFTWARE PACKAGES: git,wget,tar,unzip,make,autoconf,gcc,java,perl,python 

PIPELINE PREREQUISITES

* SOFTWARE PACKAGES: bwa,samtools,bedtools,fastp,samblaster,bcftools,htslib,freebayes
* JAVA JARS:         gatk,mutserve,haplogrep,haplocheck
* HUMAN ASSEMBLY:    hs38DH

INSTALL

DOWNLOAD PIPELINE

$ git clone https://github.com/dpuiu/MitoHPC.git

UPDATE PIPELINE (optional)

$ cd MitoHPC/
$ git pull
  or
$ git checkout .

SETUP ENVIRONMENT (important)

$ cd MitoHPC/scripts
$ export HP_SDIR=`pwd`                           # set script directory variable 
$ echo "export HP_SDIR=`pwd`" >> ~/.bashrc       # add variable to your ~/.bashrc so it gets initialized automatically
$ . ./init.sh                                    # or one of init.{hs38DH,hg19,mm39}.sh  correponding to  a different reference

INSTALL SYSTEM PREREQUISITES (optional)

# needs admin/sudo permissions
$ sudo $HP_SDIR/install_sysprerequisites.sh      # install perl,pthon,java,wget ...
                                                 #  (unless already installed)

INSTALL PREREQUISITES & CHECK INSTALL

$ $HP_SDIR/install_prerequisites.sh              # install bwa,samtools,bedtools ...
                                                 #  (unless already	installed)
  or
$ $HP_SDIR/install_prerequisites.sh -f           # force install  bwa,samtools,bedtools ...
                                                 #  (latest versions) 

$ $HP_SDIR/checkInstall.sh
# if successfull => "Success message!"

$ cat checkInstall.log                           # records software version/path info

PIPELINE USAGE

SETUP ENVIRONMENT

# go to your working directory 
# copy over the init.sh file

$ cp -i $HP_SDIR/init.sh .                   # copy init to working dir.

# view/edit init.sh file
$ cat init.sh                                # check/edit local init file
    ...
   export HP_RNAME=hs38DH                    # human reference used for alignment (Ex: hs38DH, hg19)
   
   export HP_IN=$PWD/in.txt                  # input TSV file (automatically generated by runnng ". ./init.sh")

   export HP_ADIR=$PWD/bams/                 # pre-existing alignment dir; .bam, .bai, [.idxstats] files
   # or  
   export HP_ADIR=$PWD/crams/                #                             .cram, .crai, [.idxstats] files
  
   export HP_ODIR=$PWD/out/                  # output dir  

   export HP_L=222000                        # Use at most 222K reads (random subsampling); if empty, use all the reads

   export HP_M=mutect2                       # SNV caller: mutect2, mutserve, or freebayes
   export HP_I=2                             # number of SNV calling iterations
                                             #  1: one, uses rCRS or RSRS as reference 
                                             #  2: two, the 2nd one used the sample consensus 
                                             #     computed in the 1st iteration; higher accuracy 

   export HP_FNAME=filter                                                                                                  # VCF filter name
   export HP_FRULE="egrep -v 'strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP'"            # VCF filtering rule 

   export HP_SH=bash                         # job scheduling: bash, qsub,sbatch, ..

$ rm -f $HP_IN                                # delete the previous $HP_IN("in.txt") file if necessary
$ . ./init.sh                                # source init file 

$ printenv | grep HP_ | sort                 # check HP_ variables 
$ nano $HP_IN                                # check input file

RUN PIPELINE

$ $HP_SDIR/run.sh > run.all.sh               # create command file in working dir.
$ bash ./run.all.sh                          # execute command file      

# or run in parallel
$ grep filter.sh ./run.all.sh | parallel ; getSummary.sh

RE-RUN PIPELINE (optional)

$ nano init.sh                               # update parameters
$ . ./init.sh                                # source init file
$ $HP_SDIR/run.sh > run.all.sh               # recreate command file

$ bash ./run.all.sh

DOCKERHUB IMAGE

# available at
https://hub.docker.com/r/dpuiu1/mitohpc/tags

# note: not fully tested; uses the newest version of prerequisite software + a few other updates 
 
# pull image (if necessary)
$ docker pull dpuiu1/mitohpc

# cd work_directory

# run interactively; use the "PIPELINE USAGE" instructions
$ docker run -e  HP_SDIR="/MitoHPC/scripts/" -w  $PWD  -v $HOME:$HOME -it dpuiu1/mitohpc

# run using the "mitohpc.sh" script
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt"  -w  $PWD  -v $HOME:$HOME -u "$(id -u):$(id -g)"  dpuiu1/mitohpc  mitohpc.sh
# or
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt"  -w  $PWD  -v $HOME:$HOME -u "$(id -u):$(id -g)"  dpuiu1/mitohpc  cat /MitoHPC/scripts//mitohpc.sh > ./mitohpc.sh
# edit ./mitohpc.sh ; chmod u+x ./mitohpc.sh
docker run -e HP_ADIR="bams" -e HP_ODIR="out" -e HP_IN="in.txt"  -w  $PWD  -v $HOME:$HOME -u "$(id -u):$(id -g)"  dpuiu1/mitohpc ./mitohpc.sh 

SINGULARITY IMAGE

# available at
https://cloud.sylabs.io/library/dpuiu/project-dir/mitohpc

# note: not fully tested; uses the newest version of prerequisite software + a few other updates 
 
# pull image (if necessary)
$ singularity pull mitohpc.sif library://dpuiu/project-dir/mitohpc:latest
$ SINGULARITY_PATH=`pwd -P`

# cd work_directory

# run interactively ; use the "PIPELINE USAGE" instructions
$ singularity shell --env HP_SDIR="/MitoHPC/scripts/" $SINGULARITY_PATH/mitohpc.sif --pwd

OUTPUT

# under $HP_ODIR: TAB/SUMMARY/VCF/FASTA Files: 

count.tab                                                      # total reads & mtDNA-CN counts
subsample.tab                                                  # subsampled read counts
cvg.tab                                                        # subsampled coverage stats

# 1st ITERATION
{mutect2,mutserve,freebayes}.{03,05,10}.vcf                    # SNVs; 3,5,10% heteroplasmy thold; single SNV from single sample in each line; last column = sample name
{mutect2,mutserve,freebayes}.{03,05,10}.concat.vcf             # SNVs; 3,5,10% heteroplasmy thold; single SNV from single sample in	each line;               GT/AF/SF
{mutect2,mutserve,freebayes}.{03,05,10}.merge.vcf              # SNVs; 3,5,10% heteroplasmy thold; single SNV from multiple samples in each line
{mutect2,mutserve,freebayes}.{03,05,10}.merge.sitesOnly.vcf    
{mutect2,mutserve,freebayes}.{03,05,10}.$HP_FNAME.*            # SNVs filtered according to $HP_RULE

{mutect2,mutserve,freebayes}.{03,05,10}.tab                    # SNV counts
{mutect2,mutserve,freebayes}.{03,05,10}.summary                # SNV count summaries
{mutect2,mutserve,freebayes}.{03,05,10}.pos                    # position summaries; AC and AF values for both HOM and HET
{mutect2,mutserve,freebayes}.{03,05,10}.suspicious.{tab,ids}   # samples which either have low mtDNA-CN,low cvg, failed haplockeck, multiple NUMT/HG labels (AF<1)  

{mutect2,mutserve,freebayes}.fa                                # consensus sequence
{mutect2,mutserve,freebayes}.haplogroup?.tab                   # haplogroup
{mutect2,mutserve,freebayes}.haplocheck.tab                    # contamination screen   

# 2nd ITERATION
{mutect2.mutect2,freebayes.freebayes}.*	

LEGEND

Run                           # Sample name

all                           # number of reads in the sample reads
mapped                        # number of aligned reads in the sample
MT                            # number of reads aligned to chrM
M                             # mtDNA copy number

haplogroup                    # sample haplogroup

H                             # number of homozygous SNVs (the multiallelic SNVs counted once)
h                             # number of heterozygous SNVs
S                             # number of homozygous SNVs (excluding INDELs)
s                             # number of heterozygous SNVs (excluding INDELs)
I                             # number of homozygous INDELs
i                             # number of heterozygous INDELs
...
?p                            # "p" suffix stands for non homopolimeric
...
A                             # total number of SNVs (H+h)

SNV Annotation

FILES

CDS.bed.gz        : coding regions
COMPLEX.bed.gz    : coding regions
RNR.bed.gz        : rRNA
TRN.bed.gz        : tRNA
DLOOP.bed.gz      : D-loop
HV.bed.gz         : hyper variable regions
HP.bed.gz         : homopolymers
TAG.bed.gz        : DLOOP,CDS,RNR,TRN,GAP or NON(coding)

MITIMPACT.vcf.gz  : MITIMPACT APOGEE scores
HG.vcf.gz         : haplogroup specific SNV's
dbSNP.vcf.gz      : dbSNP
NUMT.vcf.gz       : NUMT SNV's
HS.bed.gz         : hotspots
NONSYN.vcf.gz     : NONSYNONIMOUS codon mutations in CDS regions
STOP.vcf.gz       : STOP codon mutations

SUMMARY

.        #regions  min  q1   q2   q3    max   mean    sum
CDS      13        207  525  784  1141  1812  876.54  11395  
RNR      2         954  954  1559 1559  1559  1256.50 2513
TRN      22        59   68   69   70    75    68.54   1508
DLOOP    2         576  576  746  746   746   661.00  1322
HV       3         136  136  315  359   359   270.00  810
HP       10        5    9    13   15    22    12.600  126

.          #SNVs    #POSITIONS     
NONSYN     25890    9314
MITIMPACT  24190    9103                                     
STOP       1615     1425
HG         1098     1098
dbSNP      388      359      
NUMT       382      381
HS         13       13

EXAMPLE

3 simulated samples, 100x chrM coverage 

INPUT

$ head $HP_IN
  #sampleName       inputFile            outputPath/prefix
  sample_A          bams/sample_A.bam    out/sample_A/sample_A
  sample_B          bams/sample_B.bam    out/sample_B/sample_B
  sample_C          bams/sample_C.bam    out/sample_C/sample_C

OUTPUT

output directories : 1/sample

$ ls out/
  sample_A/
  sample_B/
  sample_C/

directory structure

$ ls $HP_ADIR/*
  bams/sample_A.bam
  bams/sample_A.bam.bai
  bams/sample_A.idxstats
  bams/sample_A.count

read counts and mtDNA-CN(M)

 $ cat $HP_ODIR/count.tab
   Run         all_reads  mapped_reads  MT_reads    mtDNA-CN
   sample_A    851537886  848029490     396766      182
   sample_B    884383716  882213718     506597      223
   sample_C    786560467  785208588     503241      249

vcf files : concat & merge

# 1st iteration : homoplasmies(AF=1) and heteroplasmies(AF<1)
$ cat mutect2.03.concat.vcf  
  ...
  ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="INDEL">
  ##INFO=<ID=GT,Number=1,Type=String,Description="Genotype">
  ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
  ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
  ##INFO=<ID=HP,Number=0,Type=Flag,Description="Homoloplymer">
  ##INFO=<ID=HS,Number=0,Type=Flag,Description="Hot spot">
  ##INFO=<ID=CDS,Number=0,Type=Flag,Description="coding region">
  ##INFO=<ID=RNR,Number=0,Type=Flag,Description="rRNA">
  ##INFO=<ID=TRN,Number=0,Type=Flag,Description="tRNA">
  ##INFO=<ID=DLOOP,Number=0,Type=Flag,Description="D-loop">
  ##INFO=<ID=HG,Number=1,Type=String,Description="haplogroup">
  ##INFO=<ID=NUMT,Number=0,Type=Flag,Description="NUMT like">
  ##INFO=<ID=HV,Number=0,Type=Flag,Description="Hypervariable">
  ##INFO=<ID=AP,Number=1,Type=String,Description="Mitimpact APOGEE">
  ##INFO=<ID=APS,Number=1,Type=String,Description="Mitimpact APOGEE_score">
  ...
  #CHROM POS   ID REF ALT  QUAL  FILTER                      INFO                                                                               FORMAT SAMPLE
  chrM   64    .  C   T    .     clustered_events;haplotype  HV=HV-II;DLOOP;GT=0|1;DP=66;AF=1                                                   SM     sample_A
  chrM   73    .  A   G    .     PASS                        HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0/1;DP=51;AF=1  SM     sample_B
  chrM   73    .  A   G    .     PASS                        HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0/1;DP=46;AF=1  SM     sample_C
  chrM   73    .  A   G    .     clustered_events;haplotype  HV=HV-II;DLOOP;NUMT=MT780574.1|KM281532.1|KM281533.1|KM281524.1;GT=0|1;DP=70;AF=1  SM     sample_A
  chrM   146   .  T   C    .     clustered_events;haplotype  HV=HV-II;DLOOP;GT=0|1;DP=88;AF=1                                                   SM     sample_A
  ...
  chrM   1037  .  A   C    .     PASS                        RNR=RNR1;GT=0/1;DP=70;AF=0.178                                                     SM     sample_B
  chrM   1038  .  C   G    .     PASS                        RNR=RNR1;GT=0/1;DP=70;AF=0.165                                                     SM     sample_A
  chrM   1042  .  T   A    .     PASS                        RNR=RNR1;GT=0/1;DP=60;AF=0.24                                                      SM     sample_C
  ...
  chrM   3988  .  T   G    .     PASS                        CDS=ND1;AP=Pathogenic;APS=0.52;GT=0/1;DP=60;AF=0.224;AP=Pathogenic;APS=0.52        SM     sample_B 
  chrM   3989  .  A   T    .     PASS                        CDS=ND1;AP=Pathogenic;APS=0.52;GT=0/1;DP=65;AF=0.237;AP=Pathogenic;APS=0.52        SM     sample_A
  chrM   5186  .  A   T    .     PASS                        CDS=ND2;HG=U;AP=Pathogenic;APS=0.51;GT=0|1;DP=97;AF=0.212;AP=Pathogenic;APS=0.51   SM     sample_C
  ...

$ cat mutect2.03.merge.vcf  
  ...      
  #CHROM POS   ID REF ALT  QUAL  FILTER  INFO                                      FORMAT    sample_A      sample_B      sample_C
  ...
  chrM   3988  .  T   G    .     .       AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52  GT:DP:AF  .:.:.         0/1:60:0.224  .:.:.
  chrM   3989  .  A   T    .     .       AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52  GT:DP:AF  0/1:65:0.237  .:.:.         .:.:.
  chrM   3993  .  A   T    .     .       AC=1;AN=2;CDS=ND1                         GT:DP:AF  .:.:.         .:.:.         0/1:52:0.258
  ...

$ cat mutect2.03.merge.sitesOnly.vcf
   ...
  #CHROM POS   ID REF ALT  QUAL  FILTER  INFO                                      
  ...
  chrM   3988  .  T   G    .     .       AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52
  chrM   3989  .  A   T    .     .       AC=1;AN=2;CDS=ND1;AP=Pathogenic;APS=0.52
  chrM   3993  .  A   T    .     .       AC=1;AN=2;CDS=ND1                       
  ...

SNV counts

# 1st iteration
$ cat mutect2.03.tab            
  Run       H   h   S   s   I  i  Hp  hp  Sp  sp  Ip ip  A
  sample_A  30  45  28  36  2  9  28  44  28  36  0  8   75
  sample_B  26  43  25  34  1  9  22  42  22  34  0  8   69
  sample_C  39  43  35  35  4  8  37  43  35  35  2  8   82

# 2nd iteration
$ cat  mutect2.mutect2.03.tab 
  Run       H   h   S   s   I  i  Hp  hp  Sp  sp  Ip  ip  A
  sample_A  30  44  28  36  2  8  28  44  28  36  0   8   74		# one less heteroplasmy (h) compared with the 1st iteration
  sample_B  26  42  25  34  1  8  22  42  22  34  0   8   68		# one less heteroplasmy	(h) compared with the 1st iteration
  sample_C  39  43  35  35  4  8  37  43  35  35  2   8   82

SNV Summaries

# 1st iteration
$ cat mutect2.03.summary  | column -t
  id  count  nonZero  min  max  median  mean   sum
  H   3      3        26   39   30      31.67  95
  h   3      3        43   45   43      43.67  131
  S   3      3        25   35   28      29.33  88
  s   3      3        34   36   35      35     105
  I   3      3        1    4    2       2.33   7
  i   3      3        8    9    9       8.67   26
  Hp  3      3        22   37   28      29     87
  hp  3      3        42   44   43      43     129
  Sp  3      3        22   35   28      28.33  85
  sp  3      3        34   36   35      35     105
  Ip  3      1        0    2    0       0.67   2
  ip  3      3        8    8    8       8      24
  A   3      3        69   82   75      75.33  226

Haplogroups

$ cat mutect2.haplogroup.tab    
  Run        haplogroup  
  sample_A   A2+(64)     
  sample_B   B2          
  sample_C   C           

Haplocheck Contamination

$ cat mutect2.haplocheck.tab  | column -t
  Run        ContaminationStatus  ContaminationLevel  Distance  SampleCoverage
  sample_A   NO                   ND                  14        78
  sample_B   NO                   ND                  14        79
  sample_C   NO                   ND                  14        78
Re-annotate
 $ reAnnotateVcf.sh old.vcf new.vcf

FAQ

How to Use a Different Reference

  • Edit init.sh: variables HP_RNAME, HP_RMT, HP_RNUMT, HP_RCOUNT, HP_RURL, HP_MTLEN

  • Or use an alternative init.*.sh

    # Example: mm39 mouse reference
    # INSTALL section: replace ./init.sh with ./init.mm39.sh
    cd $HP_SDIR/
    mv ./init.sh ./init.default.sh
    cp ./init.mm39.sh ./init.sh
    
  • Rerun "init" scripts (!!! important)

    cd $HP_SDIR/
    . ./init.sh                    # init nvironmental variables
    ./install_prerequisites.sh     # downloads new reference from $HP_RURL, extracts the $HP_MT sequnce,creates new reference .fasta, .fai, .dict, .bwt files
    

How to Use Unaligned FASTQ Files

  • One has to pre-aligned the FASTQ reads to the whole genome reference

    # Example: hs38DH human reference
    # cd to the work directory  ; run "init.sh"
    cp -i $HP_SDIR/init.sh .
    . ./init.sh
    
    # create hs38DH bwa index (only run once)
    bwa_index.sh
    
    # create directory paths(symlinks)
    ln -s path_to_fastq_directory fastq
    mkdir bams
    
    # for each sample: test fwd/rev FASTQ files exist
    ls -l fastq/sample_[12].f*q*
    
    # for each sample: run bwa mem
    bwa_mem.sh fastq/sample bams/sample
    
    ls bams/
    

How to update the SNV filtering criteria

  • By default, the SNV with DP<100 or labeled as "strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP" are filtered out

  • To change the criteria, update the following 2 variables in the "init.sh" file

    export HP_FRULE="egrep -v 'strict_strand|strand_bias|base_qual|map_qual|weak_evidence|slippage|position|HP'"
    export HP_DP=100
    
  • Samples with multiple problems (failing haplocheck, multiple NUMTs, multiple heteroplasmies belonging to a different haplogroup, average coverage < HP_DP ...) are saved in "*.suspicious.{id,tab}" files

  • By default, suspicious samples are not removed; The users should check these files and maually remove the samples

Using HiFi reads(experimental)

  • Update the git: download scripts/{filterHiFi.sh,fixbcftoolsVcf.pl,bcftools.vcf} ; make the *.pl scripts executable

  • Run replacing the default(mutect2) snvcaller with bcftools , filter.sh with filterHiFi.sh, minimum coverage 250

    $ run.sh | sed 's|mutect2|bcftools|g' | sed 's|filter.sh|filterHiFi.sh|' |  sed 's|100|25|g' > run.all.sh