As a preliminary test, we will attempt to disambiguate simulated paired-end reads from a homologous gene (ABL1
) across human and murine species.
Assembly | Species | Coordinates |
---|---|---|
hg38 |
Homo sapiens | chr9:130713881-130887675 |
mm10 |
Mus musculus | chr2:31688354-31807093 |
rn6 |
Rattus norvegicus | chr3:10041820-10145076 |
Location of the
ABL1
gene in three model species.
We will simulate 1,000 paired-end reads with uniform coverage over the target gene in all three species.
For the sake of tracking read pairs, we use the -P
flag and prefix each read name with the source assembly ID.
❯ mkdir -p insilico
❯ parallel \
'dwgsim -N 1000 -c 0 -z 42 -x <( echo {2} ) -P {1} \
/pipeline/reference-data/references/{1}/{1}.fa \
insilico/{1}' \
::: hs38DH mm10 rn6 \
:::+ 'chr9 130713881 130887675' 'chr2 31688354 31807093' 'chr3 10041820 10145076'
We concatenate all reads into a single FASTQ pair and then convert that file into an umapped BAM (uBAM) file.
❯ cat insilico/*read1* > insilico/all.read1.fastq
❯ cat insilico/*read2* > insilico/all.read2.fastq
❯ picard FastqToSam \
F1=insilico/all.read1.fastq \
F2=insilico/all.read2.fastq \
OUTPUT=insilico/all.unmapped.bam \
SAMPLE_NAME=all
We align all read pairs to all three reference genomes independently. Although verbose, we follow the GATK best practice alignment guide so that we ensure the sequence dictionaries of the output BAMs are correct.
❯ parallel \
'picard SamToFastq \
INPUT=insilico/all.unmapped.bam \
INTERLEAVE=true \
FASTQ=/dev/stdout \
| bwa mem -t 8 -p \
/pipeline/reference-data/references/{}/{}.fa \
/dev/stdin \
| picard MergeBamAlignment \
ALIGNED=/dev/stdin \
UNMAPPED=insilico/all.unmapped.bam \
OUTPUT=insilico/all.{}.mapped.bam \
REFERENCE_SEQUENCE=/pipeline/reference-data/references/{}/{}.fa
::: hs38DH mm10 rn6
We are now ready to disambiguate our aligned templates!
❯ java -jar cvbio.jar Disambiguate -i insilico/all.*.mapped.bam -o insilico/disambiguated
Let's see how we did:
❯ \ls -1 insilico/disambiguated*
insilico/disambiguated.hs38DH.bam
insilico/disambiguated.mm10.bam
insilico/disambiguated.rn6.bam
❯ parallel -k \
'echo Distribution of known alignments disambiguated into: {} \
&& picard ViewSam \
INPUT=insilico/disambiguated.{}.bam \
ALIGNMENT_STATUS=All \
PF_STATUS=All \
2>/dev/null \
| grep -v "@" \
| tr "_" "\t" \
| cut -f1 \
| sort \
| uniq -c \
&& echo' \
::: hs38DH mm10 rn6
Distribution of known alignments disambiguated into: hs38DH
1898 hs38DH
Distribution of known alignments disambiguated into: mm10
1906 mm10
6 rn6
Distribution of known alignments disambiguated into: rn6
2 mm10
1914 rn6
Because these are alignments, and not read pairs or templates, we cannot directly calculate performance metrics such as sensitivity or specificty of this method. We hope to improve these benchmarks.