# hgr0 reference rDNA
```
pi:ababaian
files: ~/Crown/resources/hgr0/
start: 2017 02 13
complete : 2017 02 20
```
## Introduction

I've gone through and 'refined' the original rDNA repeat reference. For now I'll be focusing on RNA45S and it's promoter. I won't worry about all the phasing issues and alternative haplotypes for this first round of analysis but just get a bunch of genomes/transcriptomes aligned and get a 'global' picture of the rDNA variation.



## Objective

Define a reference 'rDNA' based on NA19240 for use in initial series of alignment experiments.


## Materials and Methods

Schematic of hgr0 reference rDNA
``` 
                         1M                                          +43 kb
hgr       ----------------|              U13369.1                     |---------
                          ( RNA45S )                                ( )promoter



              10k     -1kb 1M          13.5 kb
hgr0       ---| 5s |--|prom|    RNA45S |-----------------------------------------
```

### Reference Sequences

In [2]:
cd ~/Crown/data/hgr0



In [2]:
# Starting Reference Sequences
cd ~/Crown/data/hgr0

# 5s_noAlu (Alu is N-masked)
# To be inserted starting at position 10,000
cat 5s_nAlu.fa

>r5s_nAlu
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
gtgggccctgggccctgacgcctcggagcactccctgctccgagcgggcc
cgatgtggtggaagctcgggagcgcgggagccgggggaaggccgcgggcc
agcggctcgggggtccccgatccgagccccgcggccccgggctggcggtg
tcggctgcaatccggcgggcacggccgggccgggctgggctcttggggca
gccaggcgcctccttcagcGTCTACGGCCATACCACCCTGAACGCGCCCG
ATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTG
GATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTTTttctttggct
ttttgctgtttctttccttttcttccNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ccgcccggcctgctgtaggcttttgtggcttccccgctgcctcccttccc
cccacagtcgccatgcttcccaacctcccctgactctgctccccctttac
c

In [3]:
# NA19240 corrected RNA45S
# Inserted starting at position 1,000,001
cat NA19240_alt45s.fa

>NA19240_45S
GCTGACACGCTGTCCTCTGGCGACCTGTCGCTGGAGAGGTTGGGCCTCCG
GATGCGCGCGGGGCTCTGGCCTACCGGTGACCCGGCTAGCCGGCCGCGCT
CCTGCTTGAGCCGCCTGCCGGGGCCCGCGGGCCTGCTGTTCTCTCGCGCG
TCCGAGCGTCCCGACTCCCGGTGCCGGCCCGGGTCCGGGTCTCTGACCCA
CCCGGGGGCGGCGGGGAAGGCGGCGAGGGCCACCGTGCCCCCGTGCGCTC
TCCGCTGCGGGCGCCCGGGGCGGCCGCGACAACCCCACCCCGCTGGCTCC
GTGCCGTGCGTGTCAGGCGTTCTCGTCTCCGCGGGGTTGTCCGCCGCCCC
TTCCCCGGAGTGGGGGGTTGGCCGGAGCCGATCGGCTCGCTGGCCGGCCG
GCCTCCGCTCCCGGGGGGCTCTTCGTGATCGATGTGGTGACGTCGTGCTC
TCCCGGGCCGGGTCCGAGCCGCGACGGGCGAGGGGCGGACGTTCGTGGCG
AACGGGACCGTCCTTCTCGCTCCGCCCCGCTGGGGTCCCCTCGTCTCTCC
TCTCCCCGCCCGCCGGCGGTGCGTGTGGGAAGGCGTGGGGTGCGGACCCC
GGCCCGACCTCGCCGTCCCGCCCGCCGCCTTCTGCGTCGCGGGTGCGGGC
CGGCGGGGTCCTCTGACGCGGCAGACAGCCCTCGCTGTCGCCTCCAGTGG
TTGTCGACTTGCGGGCGGCCCCCCTCCGCGGCGGTGGGGGTGCCGTCCCG
CCGGCCCGTCGTGCTGCCCTCTCGGGGGGTTTGCGCGAGCGTCGGCTCCG
CCTGGGCCCTTGCGGTGCTCCTGGAGCGCTCCGGGTTGTCCCTCAGGTGC
CCGAGGCCGAACGGTGGTGTGTCGTTCCCGCCCCCGGCGCCCCCTCCTCC
GGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGAGCTCGTCGGTGT

In [3]:
# NA19240 Upstream promoter of RNA45S
# Inserted /ENDING/ at position 1,000,000
cat rRNA_Promoter_refined.fa

>NA19240_refined_promoter
AACCGCGCCGTGGGTTGTCTTCTGACTCTGTCGCGGTCGAGGCAGAGACG
CGTTTTGGGCACCGTTTGTGTGGGGTTGGGGCAGAGGGGCTGCGTTTTCG
GCCTCGGGAAGAGCTTCTCGACTCACGGTTTCGCTTTCGCGGTCCACGGG
CCGCCCTGCCAGCCGGATCTGTCTCGCTGACGTCCGCGGCGGTTGTCGGG
CTCCATCTGGCGGCCGCTTTGAGATCGTGCTCTCGGCTTCCGGAGCTGCG
GTGGCAGCTGCCGAGGGAGGGGACCGTCCCCGCTGTGAGCTAGGCAGAGC
TCCGGAAAGCCCGCGGTCGTCAGCCCGGCTGGCCCGGTGGCGCCAGAGCT
GTGGCGCGTCGCTTGTGAGTCACAGCTCTGGCGTGCAGGTTTATGTGGGG
GAGAGGCTGTCGCTGCGCTTCTGGGCCCGCGGCGGGCGTGGGGCTGCCCG
GGCCGGTCGACCAGCGCGCCGTAGCTCCCGAGGCCCGAGCCGCGACCCGC
GGGGACCCGCCGCGCGTGGCGCGGGAGGCTGGGGACGCCCTTCCCGGCCC
GGTCGCGGGTCCGCGCTCATCCTGGCCGTCTGAGGCGGCGGCCGAATTCG
TTTCCGAGTCCCCGTGGGGAGCCGGGGACCGTCCCGCCCCCGTCCCCCGG
GTGCCGGGGAGCGGTCCCCGGGCCGGGCCGCGGTCCCTCTGCCGCGATCC
TTTCTGGCGAGTCCCCGTGCGGAGTCGGAGAGCGCTCCCTGAGCGCGCGT
GCGGCCCGAGAGGTCGCGCCTGGCCGGCCTTCGGTCCCTCGTGTGTCCCG
GTCGTAGGAGGGGCCGGCCGAAAATGCTTCCGGCTCCCGCTCTGGAGACA
CGGGCCGGCCCCCTGCGTGTGGCACGGGCGGCCGGGAGGGCGTCCCCGGC
CCGGCGCTGCTCCCGCGTGTGTCCTGGGGTTGACCAG

In [4]:
## These were manually put together into
## hgr0_draft2.fa
#cat hgr0_draft2.fa
md5sum hgr0_draft2.fa

58e6df397fee0615ff081e091ccddd26  hgr0_draft2.fa


One issue in the alignments of the 5S repeat is a simple-sequence repeat of (TC)n and (TG)n between r5s:1797-2038.

For this initial run, this will be masked to 'N's. I see the irony of masking a repeat when studying a region which was masked as a repeat for years.


In [3]:
## simple repeat masked 5s
cat 5s_nAlu_nSR.fa

>r5s_nAlu
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
gtgggccctgggccctgacgcctcggagcactccctgctccgagcgggcc
cgatgtggtggaagctcgggagcgcgggagccgggggaaggccgcgggcc
agcggctcgggggtccccgatccgagccccgcggccccgggctggcggtg
tcggctgcaatccggcgggcacggccgggccgggctgggctcttggggca
gccaggcgcctccttcagcGTCTACGGCCATACCACCCTGAACGCGCCCG
ATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTTG
GATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTTTttctttggct
ttttgctgtttctttccttttcttccNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ccgcccggcctgctgtaggcttttgtggcttccccgctgcctcccttccc
cccacagtcgccatgcttcccaacctcccctgactctgctccccctttac
c

## Results

Putting all of this together; hgr0 draft rDNA reference for alignment is below:

In [5]:
#cat hgr0_draft3.fa
md5sum hgr0_draft3.fa

4e5964cf2d536b7bad1f400d18e75993  hgr0_draft3.fa


### Corrections (for hgr0)

There is an extra 'AT' at 999,995 and 999,996. Line 20,006 changed.

Line 20,017
`AACGGGACCGTCCTTCTCGCTCCGCCCCGCTGGGGTCCCCTCGTCTCTCC` changed to
`AACGGGACCGTCCTTCTCGCTCCGCCCCGCGGGGGTCCCCTCGTCTCTCC`

Line 20,076
`TCCGGTTCGCCGCGCCCCGCCCCGGCCCCACCTGTCCCGGCCGCCGCCCC` changed to
`TCCGGTTCGCCGCGCCCCGCCCCGGCCCCACCGGTCCCGGCCGCCGCCCC`

Line 20,153, single C insertion. (post 5.8S, pre-28S)
`GCGTCCCGGTCGCCGCGGTTCCGCCGCCCGCCCCCGGTGGCGGCCCGGCG` changed to
`GCGTCCCGGTCGCCGCGGTTCGCCGCCCGCCCCCGGTGGCGGCCCGGCG`
(also have to change all subsequent lines in hgr0_draft4, but do this last.

line 20,246, single C insertion.
`GGCGGGGTCCGCCGGCCCTGCGGGCCGCCCGGTGAAATACCACTACTCTG` changed to
`GGCGGGGTCCGCCGGCCCTGCGGGCCGCCCGGTGAAATACCACTACTCTG`

#### Post 18S, AG simple repeat.

There is an (AG)n simple repeat after 18S, lots of reads map here so to simplify alignments and variant analysis mask this to 'N's for hgr0
```
CGTTCGTTCGCCGCCCGGCCCCGCCGCCGCGAGAGCCGAGAACTCGGGAG
GGAGACGGGGGGGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
GAGAAAGAAGGGCGTGTCGTTGGTGTGCGCGTGTCGTGGGGCCGGCGGGC
```
changed to
```
CGTTCGTTCGCCGCCCGGCCCCGCCGCCGCGAGAGCCGAGAACTCGGGAG
GGAGACGGGGGGGAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNAAAGAAGGGCGTGTCGTTGGTGTGCGCGTGTCGTGGGGCCGGCGGGC
```


### Inspecting Draft 4 of hgr0

Re-aligning the test files form NA19240 and manually going over the alignments I'm confident that hgr0_draft4.fa represents a good 'consensus' model for RNA45S / 5S. This can be used as starting point for analysis.

In [6]:
#draft4
md5sum hgr0_draft4.fa

a4186427240b5491c7bc9eb2b2b4219e  hgr0_draft4.fa


### Bed file (draft 3 coords). Zero-based
```
chr13	10220	10340	5S	cctccttcagc|GTCTACGGCCA	TGTAGGCTTT|ttctttggctttt
chr13	1000000	1013400	45S gggttataatt|GCTGACACGCT	[Note should read gggttatt|GCTG...]	GGGTCGACCAGC|agaccgcgggtgg
chr13	1003653	1005522	18S tctaccttacc|TACCTGGTTGAT TGCGGAAGGATCATTA|acggagcccggaggg
chr13	1006615	1006772	5.8S cgacctgcgta|CGACTCTTAGCGG TGTCTGAGCGTCGCTT|gccgatcaatcgc
chr13	1007940	1013009	28S gtccccctccgaga|CGCGACCTCA CACAAGGGTTTGTC|cgcgcgcgcgtgc
```

### Bed file (draft 4 coords)

```
chr13	10219	10340	5S
chr13	1000000	1013398	45S
chr13	1003653	1005522	18S
chr13	1006615	1006772	5.8S
chr13	1007939	1013007	28S
```

### Clean-up and Renaming

Moved the plethora of sequences into draft_sequences. In the folder `testAlign` the bam files and assocaited bowtie2-build files from `testAlign/runcmd.sh` were deleted; just the hgr0_draft4 alignment is kept.


`hgr0_draft4.fa` --> `hr0.fa`

Initaialized and copied to `~/Crown/resources/hgr0`


Made a GATK-compatible reference, `hgr0.gatk.fa` with no empty chromosomes.


In [None]:
# Command to set-up reference
cd ~/Crown/resources/

samtools faidx hgr0.fa

bowtie2-build hgr0.fa hgr0

picard CreateSequenceDictionary R=hgr0.gatk.fa O=hgr.gatk.dict


In [1]:
# list resources
cd ~/Crown/resources/hgr0

ls -alh

total 11M
drwxrwxr-x 2 artem artem  4.0K Feb 20 20:10 .
drwxrwxr-x 9 artem artem  4.0K Feb 20 19:24 ..
-rw-rw-r-- 1 artem artem  4.1M Feb 20 19:36 hgr0.1.bt2
-rw-rw-r-- 1 artem artem  4.0K Feb 20 19:36 hgr0.2.bt2
-rw-rw-r-- 1 artem artem    71 Feb 20 19:36 hgr0.3.bt2
-rw-rw-r-- 1 artem artem  4.0K Feb 20 19:36 hgr0.4.bt2
-rw-rw-r-- 1 artem artem  1.1M Feb 18 14:05 hgr0.fa
-rw-rw-r-- 1 artem artem   16K Feb 20 18:10 hgr0.fa.fai
-rw-rw-r-- 1 artem artem 1020K Feb 20 20:09 hgr0.gatk.fa
-rw-rw-r-- 1 artem artem  4.1M Feb 20 19:36 hgr0.rev.1.bt2
-rw-rw-r-- 1 artem artem  4.0K Feb 20 19:36 hgr0.rev.2.bt2
-rw-rw-r-- 1 artem artem   137 Feb 20 20:10 hgr.gatk.dict
-rw-rw-r-- 1 artem artem   126 Feb 20 18:10 rRNA.bed


## Discussion

This sequence `hgr0` is now the first, refined prototype of the RNA45S and 5S-repeat for variant analysis. Simple repeat sequences were removed to speed up alignment and reduce file-sizes and increases processivity of a larger analysis.

A sequence exactly as represented in `hgr0` may not exist in NA19240, there are many variants even within this single genome and there is no long-distance phasing information from the Illumina sequencing which means this is OK for alignment / variant detection, but further investigation is needed to reproduce what variant 45S molecules may look like.

As of `hgr0` there is no alternative haplotypes for aligning. This may pose a problem for local regions of high variation where known alternative haplotype will have difficulty with alignment.

For now I accept this limitation of `hgr0`, I have partial haplotypes re-built in previous notebooks, with `hgr1` the main focus will be on assembling known haplotypes and including a way of aligning to all the haplotypes simultaniously. This is a second-iteration analysis, now just align to many genomes!

In [None]:
QED

## Addendum - 170221. rDNA sequence description

The secondary sequence descriptions are very helpful when interpreting rDNA variation. Re-make them for hgr0.

- GC-content
- Evolutionary Conservation
- Shannon Entropy
- Domain descriptions
- Protein Contact Points
- rRNA base-pairing
- Sequence annotations



### GC Content

In [5]:
CROWN='/home/artem/Crown'
cd ~/Crown/data/rDNA_stats/hgr0/

cp ~/Crown/resources/hgr0/hgr0.fa ./
cp ~/Crown/resources/hgr0/rRNA.bed ./

# cat rRNA_gc.bed
  #5s_gc.bed
  #chr13	9800	13000	5S
  #45s_gc.bed
  #chr13	998000	1014000	45S



In [14]:
#!/bin/bash
# gcContent Calculator
# for rDNA
#
# Calculated gc of rDNA for 30,50,75 bp windows

# index =
# chr13	1023550

WINDOW='70'
SLIDE='1'
NAME="gc.w$WINDOW"

#bedtools makewindows -g hgr0.fa.idx -w $WINDOW -s $SLIDE > $NAME.bed
#bedtools makewindows -b rRNA_gc.bed -w $WINDOW -s $SLIDE > $NAME.bed

bedtools makewindows -b 5s_gc.bed -w $WINDOW -s $SLIDE > 5s.bed
bedtools makewindows -b 45s_gc.bed -w $WINDOW -s $SLIDE > 45s.bed


# make start=1,000,000 + (0.5 * $WINDOW)
echo "fixedStep chrom=chr13 start=9835 step=$SLIDE" > $NAME.wig
bedtools nuc -fi hgr0.fa -bed 5s.bed | cut -f 5 - | sed 1d - >> $NAME.wig

echo "fixedStep chrom=chr13 start=998035 step=$SLIDE" >> $NAME.wig
bedtools nuc -fi hgr0.fa -bed 45s.bed | cut -f 5 - | sed 1d - >> $NAME.wig

rm 5s.bed 45s.bed

