# rDNA Haplotype Discovery
```
pi:ababaian
start: 2016 11 25
complete : 2016 12 08
```
## Introduction

To date, rDNA alignments are based on a reference sequence [GenBank: U13369.1](http://www.ncbi.nlm.nih.gov/nuccore/555853). One of the main issues I've anticipated and appears to hold true in the data is that if there are clusters of variants in this consensus sequence then alignment is likely to be difficult since reads mapping to those variants will be dropped out.

## Discovery - A3838

During the 'automated VCF' experiment exactly this case was found and standard VCF calling pipeline fails to call these variants since the reads are 'poorly aligned'. In NA19240, at position 1,003,838 there is a variant allele C to A at 5% frequency, with reads coming from both strands. Keep in mind if one estimates 300 rDNA copies then having a minor frequency of \<0.3% is biologically pertinant, although may be below detection thresholds. Interpreted another way ~15/300 copies of rDNA are variant in NA19240.

```
chr13:1,003,838
<hr>Total count: 652
A      : 34  (5%,     15+,   19- )
C      : 617  (95%,     283+,   334- )
G      : 0
T      : 1  (0%,     0+,   1- )
N      : 0
---------------
```
In IGV I sorted reads by 'base' which shows that the A variant is associated with more neighbouring variants. Together this leads to a very low MAPQ score for these reads and thus exclusion from VCF calling. This variant occurs in a conserved sequence of 18S.

![20161115_na19240_igv_altHaplo_A3838.png](../figure/20161115_na19240_igv_altHaplo_A3838.png)


I selected the reads with the A3838 variant, and the original consensus sequence of the rDNA and multiple-aligned them with standard paramters using `clustalX`. I then manually fixed the alignment in Jalview. Output of that is...


In [1]:
cd ~/Crown/data/haplo_discovery
cat 20161125_altHaplo_A3838.mfa

>NHIST_0024/1-100
........................................................................
........................................................................
.......GCTCCTCTCCTACTTGGAAAACTGTGGTAATTCTAGAGCTATTACATGCCGAAGGGCACTGACCC
CCTTCACGGGGAAGATGCGTGCATTTATCAGATTA.....................................
........................................................................
...........................................
>NHIST_0004/1-100
........................................................................
....................................TGTGAATGGCTCATTAAATCAGTTATGGTTCTTTTG
GTTGCTGGCTCCTCTCCTACTTGGAAAACTGTGGTAATTCTAGAGCTATTACATGCCGAAGGGC........
........................................................................
........................................................................
...........................................
>NHIST_0015/1-100
........................................................................
.......................................

![20161115_rDNA_haplo_A3838_realigned](../figure/20161115_rDNA_haplo_A3838_realigned.png)

Which you can see appears to be a variant haplotype of 18S with multiple neighbouring variations.
```
\>Standard_rDNA_around_C3838
TGCGAATGGCTCATTAAATCAGTTATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTA
ATTCTAGAGCTAATACATGCCGACGGGCGCTGACCCCCTTCGCGGGGGGGATGCGTGCATTTATCAGATCAA
AACCAACCCGGTCAGCCCCTCTCCGGCCCCGGCCGGGGGGCGGGCGCCG
\>Variant_rDNA_around_A3838
TGTGAATGGCTCATTAAATCAGTTATGGTTCTTTTGGTCGCTCGCTCCTCTCCTACTTGGAAAACTGTGGTA
ATTCTAGAGCTAATACATGCCGAAGGGCGCTGACCCCCTTCGCGGGGAAGATGCGTGCATTTATCAGATCAA
AACCAACCCGGTCAGCCCCTCTCCGGCCCCGGCCGGGGGGTCGGGTACC
```

Blat(UCSC) A3838 to hg19
```
Side by Side Alignment


00001 tgtgaatggctcattaaatcagttatggttcttttggtcgctcgctcctc 00050
>>>>> || ||||||||||||||||||||||||||||||||| ||||||||||||| >>>>>
18257 tgcgaatggctcattaaatcagttatggttcttttgatcgctcgctcctc 18306

00051 tcctacttggaaaactgtggtaattctagagctaatacatgccgaagggc 00100
>>>>> |||||||| ||||||||||||||||||||||||||| ||||||||||||| >>>>>
18307 tcctactttgaaaactgtggtaattctagagctaatgcatgccgaagggc 18356

00101 gctgacccccttcgcggggaagatgcgtgcatttatcagatcaaaaccaa 00150
>>>>> |||||||||||||| ||||||||||||||||||||||||||||||||||| >>>>>
18357 gctgacccccttcgtggggaagatgcgtgcatttatcagatcaaaaccaa 18406

00151 cccggtcagcccctctccggccccggccggggggtcgggtacc 00193
>>>>> ||||||||||||||||| |||||||||| ||||||||||| || >>>>>
18407 cccggtcagcccctctctggccccggccagggggtcgggtgcc 18449

```
BLAST (NCBI)
There are actually really good hits to match the variant consensus sequence:

```
Score	Expect	Identities	Gaps	Strand
318 bits(172) 	2e-83 	186/193(96%) 	0/193(0%) 	Plus/Plus

>AC018688.9 Homo sapiens BAC clone RP11-462H3 from UL, complete sequence
TGTGAATGGCTCATTAAATCAGTTATGGTTCTTTTGGTTGCTGGCTCCTCTCCTACTTGGAAAACTGTGGTAATTCTAGA
GCTATTACATGCCGAAGGGCACTGACCCCCTTCACGGGGAAGATGCGTGCATTTATCAGATTAAAACCAACCCAGTCAGC
CCCTCTCCGGCCCCGGCCGGGGGGTCGGGTACC
```
Downloaded that BAC clone sequence to `~/Crown/data/haplo_discovery`

# T3774
"Walking" down the fragments there looks to be more variants at about 5% frequency and paired-end reads link the A3838 variations with these ones. Same Sort-->Align strategy was used to pull out T3774 variant reads:
```
chr13:1,003,774
<hr>Total count: 771
A      : 0
C      : 733  (95%,     357+,   376- )
G      : 1  (0%,     1+,   0- )
T      : 37  (5%,     18+,   19- )
N      : 0
---------------
```
![20161125_T3447_aligned.png](../figure/20161125_T3447_aligned.png)


## Replicating A3838 + T3774 in NA12878

### Using NA12878 PacBio - rep2
```
chr13:1,003,774
<hr>Total count: 500
A      : 2  (0%,     1+,   1- )
C      : 480  (96%,     206+,   274- )
G      : 1  (0%,     0+,   1- )
T      : 17  (3%,     10+,   7- )
N      : 0
---------------
DEL: 41
INS: 98
```
```
chr13:1,003,838
<hr>Total count: 521
A      : 13  (2%,     5+,   8- )
C      : 506  (97%,     208+,   298- )
G      : 2  (0%,     1+,   1- )
T      : 0
N      : 0
---------------
DEL: 25
INS: 78
```
### Using NA12878 Ultra-deep DNA library
```
chr13:1,003,774
<hr>Total count: 9314
A      : 40  (0%,     40+,   0- )
C      : 8972  (96%,     8888+,   84- )
G      : 18  (0%,     18+,   0- )
T      : 284  (3%,     256+,   28- )
N      : 0
---------------
```
```
chr13:1,003,838
<hr>Total count: 9489
A      : 321  (3%,     276+,   45- )
C      : 9096  (96%,     8743+,   353- )
G      : 37  (0%,     29+,   8- )
T      : 35  (0%,     35+,   0- )
N      : 0
---------------
```

## Discussion

This probably needs to be expanded but my take away from this is that there are alternative rDNA regions, this could very likely be pseudogenes which needs to be ruled out. I'm not satisfied with the alignment/detection of these variants by GATK/VCF so what I pivoted to is to see if there are alternative aligners which could do a better job at aligning divergent reads.

This section really wasn't 'Hypothesis' driven to be honest, it was partly informative but I'm unsure if it was worth the effort, I need to focus my efforts on asking dis-provable questions instead of tugging at every thread which shows itself on this project.