Skip to content

Latest commit

 

History

History
128 lines (100 loc) · 4.12 KB

disambiguate.md

File metadata and controls

128 lines (100 loc) · 4.12 KB

Disambiguation Micro-Benchmark

Introduction

As a preliminary test, we will attempt to disambiguate simulated paired-end reads from a homologous gene (ABL1) across human and murine species.

Assembly Species Coordinates
hg38 Homo sapiens chr9:130713881-130887675
mm10 Mus musculus chr2:31688354-31807093
rn6 Rattus norvegicus chr3:10041820-10145076

Location of the ABL1 gene in three model species.

Dependencies

Simulation of Training Data

We will simulate 1,000 paired-end reads with uniform coverage over the target gene in all three species. For the sake of tracking read pairs, we use the -P flag and prefix each read name with the source assembly ID.

❯ mkdir -p insilico
❯ parallel \
    'dwgsim -N 1000 -c 0 -z 42 -x <( echo {2} ) -P {1} \
         /pipeline/reference-data/references/{1}/{1}.fa \
         insilico/{1}' \
    ::: hs38DH mm10 rn6 \
    :::+ 'chr9 130713881 130887675' 'chr2 31688354 31807093' 'chr3 10041820 10145076'

Digitaly Mixing All Training Data

We concatenate all reads into a single FASTQ pair and then convert that file into an umapped BAM (uBAM) file.

❯ cat insilico/*read1* > insilico/all.read1.fastq
❯ cat insilico/*read2* > insilico/all.read2.fastq
❯ picard FastqToSam \
    F1=insilico/all.read1.fastq \
    F2=insilico/all.read2.fastq  \
    OUTPUT=insilico/all.unmapped.bam \
    SAMPLE_NAME=all 

Alignment

We align all read pairs to all three reference genomes independently. Although verbose, we follow the GATK best practice alignment guide so that we ensure the sequence dictionaries of the output BAMs are correct.

❯ parallel \
    'picard SamToFastq \
         INPUT=insilico/all.unmapped.bam \
         INTERLEAVE=true \
         FASTQ=/dev/stdout \
     | bwa mem -t 8 -p \
         /pipeline/reference-data/references/{}/{}.fa \
         /dev/stdin \
     | picard MergeBamAlignment \
         ALIGNED=/dev/stdin \
         UNMAPPED=insilico/all.unmapped.bam \
         OUTPUT=insilico/all.{}.mapped.bam \
         REFERENCE_SEQUENCE=/pipeline/reference-data/references/{}/{}.fa
  ::: hs38DH mm10 rn6

Disambiguation

We are now ready to disambiguate our aligned templates!

❯ java -jar cvbio.jar Disambiguate -i insilico/all.*.mapped.bam -o insilico/disambiguated

Validation

Let's see how we did:

\ls -1 insilico/disambiguated*
insilico/disambiguated.hs38DH.bam
insilico/disambiguated.mm10.bam
insilico/disambiguated.rn6.bam
❯ parallel -k \
    'echo Distribution of known alignments disambiguated into: {} \
     && picard ViewSam \
          INPUT=insilico/disambiguated.{}.bam \
          ALIGNMENT_STATUS=All \
          PF_STATUS=All \
        2>/dev/null \
        | grep -v "@" \
        | tr "_" "\t" \
        | cut -f1 \
        | sort \
        | uniq -c \
     && echo' \
    ::: hs38DH mm10 rn6

Distribution of known alignments disambiguated into: hs38DH
1898 hs38DH

Distribution of known alignments disambiguated into: mm10
1906 mm10
   6 rn6

Distribution of known alignments disambiguated into: rn6
   2 mm10
1914 rn6

Conclusion

Because these are alignments, and not read pairs or templates, we cannot directly calculate performance metrics such as sensitivity or specificty of this method. We hope to improve these benchmarks.