# rDNA Conservation and Statistics
```
pi:ababaian
start: 2016 10 31
complete : 2016 11 06
addendum : 2017 12 26 (hgr0)
```
## Introduction

It's difficult to interpret rDNA variation at times, I'd like to make conservation / GC / other information graphs to IGV or USCS so that I can look at the relationship between conservation  and variation for information.



## Objective

- Generate 'Conservation' plots for SSU and LSU in in wig/bigwig format
- This will let variants to be intepreted based on their conservation quickly


## Materials and Methods


- riboZone alignment files
```
euk_LSU_rRNA.fa
euk_SSU_rRNA.fa
LSU_rRNA_alignment.fa
SSU_rRNA_alignment.fa
```


#### Getting Sequence and Consensus Values
- Jalview was used to quickly get a value for 'consensus' across the multiple alignments

SSU_align.fa:
    This file is a gapped alignment extracted from Homo Sapiens 
    
SSU_cons.csv
```
Annotations > Autocalculated Annotations > Consensus

Right click "Consensus" > Export Annotation
```


- `cons.sh` script below ran for SSU and LSU subsets...
- Life: All aligned sequences
- Euk: Eukaryotes
- Vert: Vertebrates (to xenopus)

![Example of SSU Conservation](../figure/20161031_SSU_conservation_example.png)

#### Differences between hgr 28s and riboZone 28s

In the process of calculating conensus and applying it to hgr the 28S sequences did not add up. The riboZone sequence is longer then the hgr sequence. I aligned the two sequences to one another. Output is in CROWN/data/riboZone/SSU_rRNA_alignment.fa

The first 157 nt missing in hgr is the 5.8S rDNA, otherwise it's differences between the 'human' rRNA.

#### Deleted in hgr (28S) relative to riboZone
1,157
293,295
407,409
417,419
939,942
3119,3120
3137
3274
3319
3332,3335
3348,3349
3539
3730
4182
4260
4297,4303
4879
4963
4978,4980

#### Insert in hgr (28S) relative to riboZone
3309
3367
3481
3494
3512


In [None]:
#!/bin/bash
# Converts a gapped ALIGN and its CONSERVATION to a wig file
ALIGN='lsu_vert_align.fa' # Gapped Fasta Alignment
CONS='lsu_vert_cons.csv' # JalView Consensus Output
OUTPUT='VertCons_LSU.wig'

# Start position of SSU (1003657) or 5.8S (1006623) or LSU (1007935)

# Convert ALIGN.fa into one nucleotide per row

sed 1d $ALIGN |
sed 's/\./\-\n/g' - |
sed 's/[ATGCU]/\N\n/g' - |
sed '/^\s*$/d' - > align.csv


sed 2d $CONS |
sed 's/,/\n/g' - |
sed 1d - |
head -n -1 - > cons.csv
 
paste align.csv cons.csv |
grep -e '^N' - |
cut -f 2 - |
sed 's/ //g' - > cons.values


# SSU Alignments

	#echo "fixedStep chrom=chr13 start=1003657 step=1" > $OUTPUT
	#cat cons.values >> $OUTPUT
	# rm align.csv cons.csv cons.values 

# LSU Alignments
# there are differences between hgr and riboZone LSU

	# Delete in hgr (28S) relative to riboZone
	# 40 nucleotides removed
	sed 4978,4980d cons.values |
	sed 4963d - |
	sed 4879d - |
	sed 4297,4303d - |
	sed 4260d - |
	sed 4182d - |
	sed 3730d - |
	sed 3539d - |
	sed 3348,3349d - |
	sed 3332,3335d - |
	sed 3319d - |
	sed 3274d - |
	sed 3137d - |
	sed 3119,3120d - |
	sed 939,942d - |
	sed 417,419d - |
	sed 407,409d - |
	sed 293,295d - > cons.d.values

	# Insert in hgr (28S) relative to riboZone (offset by 40nt)
	sed 3472i0.1 cons.d.values |
	sed 3454i0.1 - |
	sed 3441i0.1 - |
	sed 3327i0.1 - |
	sed 3269i0.1 - > cons.di.values

	# 5.8S subunit (1-157 nt)
	sed -i '158ifixedStep chrom=chr13 start=1007935 step=1' cons.di.values
	sed -i '1ifixedStep chrom=chr13 start=1006623 step=1' cons.di.values
	mv cons.di.values $OUTPUT

	rm align.csv cons.csv cons.values cons.d.values


# NOTE: There seems to be about a 2bp mis-alignment with the above script
# but it's kind of hit and miss. Don't take the bp value directly (or refer
# back to riboZone) but the general consensus trend does hold pretty well.

In [6]:
#!/bin/bash
# gcContent Calculator
# for rDNA
#
# Calculated gc of rDNA for 30,50,75 bp windows
CROWN='/home/artem/Crown'

cd $CROWN/resources/rDNA/

WINDOW='75'
SLIDE='1'
NAME="w$WINDOW.s$SLIDE"

bedtools makewindows -g rDNA.fa.idx -w $WINDOW -s $SLIDE > rDNA.$NAME.bed

# make start=1,000,000 + (0.5 * $WINDOW)
echo "fixedStep chrom=chr13 start=1000037 step=$SLIDE" > rDNA.gc.$NAME.wig
bedtools nuc -fi rDNA.fa -bed rDNA.$NAME.bed | cut -f 5 - | sed 1d - >> rDNA.gc.$NAME.wig

rm rDNA.$NAME.bed




In [2]:
# rDNA Mappability
# rDNA aligned to hgr genome

# On AWS
# Requires HMMcopy bins to run
# - fastaToRead

# Convert fasta to reads; 75 bp slide by 1
fastaToRead -w rDNA.fa > rDNA.w75.fa

# Re-align rDNA to HGR 
bowtie hgr -S -k 10 -f rDNA.w75.fa > rDNA.w75.k10.sam

# Downloaded and converted to bam
samtools view -bh rDNA.w75.k10.sam | samtools sort -t temp -O bam -o rDNA.w75k10.bam -

# Moved to Crown directory
# ~/Crown/data/rDNA_stats/map/



fastaToRead: command not found
Could not locate a Bowtie index corresponding to basename "hgr"
Command: bowtie --wrapper basic-0 -S -k 10 -f hgr rDNA.w75.fa 
sort: invalid option -- 't'
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]

Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above
samtools: error clo

## Results

#### Conservation and low mapping of reads

![rDNA Conservation and Alignment](../figure/20161031_rDNA_conservation_alignment.png)

Conservations shown for all Life, Eukaroytes and Vertebrates (top to bottom) and alignment from NA19240.

There is a really exciting correlation here. Regions which are less conserved are showing less alignment (and generally a more even GC% as well)

![rDNA Conservation and Alignment and GC Contrent](../figure/20161031_rDNA_conservation_alignment2.png)

## Discussion


- These look good. A very obvious correlation is visible where high GC regions are less conserved throughout life. If that's where lots of the variation will be it'll be difficult to analyze =D


![Deletion at 1,005,294 in NA19240](../figure/20161104_g.5294del_NA19240.png)

- That being said some of the variation (see above) falls into mid-GC ranges in areas with low/moderate conservation. This is actually really informative for interpreting

NOTE: the 'mappability' track generated here had a second copy of rDNA at chr13:2,000,000. This was fixed in the genome file but not going to remake the alignment since it won't influence the output.

## Addendum - hgr0

Using the same data as above I re-generated rDNA statistics/annotation tracks for the hgr0 reference sequence.

- GC content, 30, 50 and 70 bp windows
- Shannon Entropy
- Protein Contact Points (rvis)
- Nucleotide (rRNA) base pairing (rvis).
- Ribosomal Domains

Files stored in ~/Crown/data/rDNA_stats/hgr0/

