# Initial Variant Calling - total RNA (Adaconis)
```
pi:ababaian
start: 2016 04 28
complete : 2016 05 04
```

## Objective
Do a first pass analysis of the variants, get a rough idea of which positions to look at and then work on methods to 'better' describe a variant

## Materials and Methods


In [None]:
# Move to working directory
cd ~/Crown/data/adaconis/totalRNA1
mkdir -p vcf


In [None]:

# samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf  
#  bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf  

samtools mpileup -f ~/Crown/resources/hgr/hgr.fa -d 10000 -uv accepted_hits.bam > vcf/total1.uv.vcf

## Results

VCF is quite detailed and not super useful at the moment. I'll go over the results using IGV and simply look at positions which are obviously variable. Low hanging fruit by IGV inspection

### Total RNA 2 - 18S Variants

#### U1248C | CU1248--
TotalRNA2
```
chr13:1,004,904
Total count: 118194
A      : 1749  (1%,     1366+,   383- )
C      : 32288  (27%,     20271+,   12017- )
G      : 537  (0%,     411+,   126- )
T      : 83620  (71%,     39198+,   44422- )
N      : 0
DEL: 4083
INS: 0
```
TotalRNA1 
```
chr13:1,004,904
Total count: 129476
A      : 1943  (2%,     1530+,   413- )
C      : 35441  (27%,     21938+,   13503- )
G      : 568  (0%,     416+,   152- )
T      : 91524  (71%,     42285+,   49239- )
N      : 0
---------------
DEL: 4470
INS: 0
```
![U1248C Conservation](../figure/20160428_18S_U1248C.png)
- U1248 is "C" in two other species; not perfectly evolutionary conserved

- Also present is a deletion of C1247 and U124 at about 3-4%

- 4UG0 Sequence: ...UUUGAC**U**CAACACGGG...

![U1248C Secondary Structure](../figure/20160502_secondary_18S_U1248C.png)
   
- Helix 31: UUUG AC**U**CAACA CGGG

- There aren't any base-stacking or base-interactions with U1248 from the Petrov SSU (RiboZone) structural data.

- While helix 30 assocaites with eIF1 below P-site for start-codon selection

This likely means that this variant is a 'normal variant', possibly present in many species at a polymorphic level. If it's a mutation then it's incredibly interesting to see that it 'reverts' or 'converts' to a form (C) that is present in some other species like Lung Fish. This could mean that the 'deregulation' occurs within an acceptable mutation-space for 18S. Needs human DNA data from normal people to confirm or refute. Worth following up on.

Note: This position may be a hyper-modified pseudouridine which means that the "variant" being observed could be an artifact of faulty reverse transcription at such a base.

#### -1638G or G1638-
```
chr13:1,005,294
Total count: 36962
A      : 490  (1%,     162+,   328- )
C      : 45  (0%,     0+,   45- )
G      : 36369  (98%,     194+,   36175- )
T      : 31  (0%,     1+,   30- )
N      : 27  (0%,     1+,   26- )
---------------
DEL: 111134
INS: 0
```
- Secondary Structure 
Note: In Petrov Alignments this 'G' at 1638 is absent. It looks like the reference sequence and this one are not the same. It will be worth aligning different rRNA sequences from different sources to check these. The above positions are from Petrov alignment for human

#### G1647-
```
chr13:1,005,304
Total count: 14953
A      : 11640  (78%,     222+,   11418- )
C      : 3081  (21%,     705+,   2376- )
G      : 180  (1%,     12+,   168- )
T      : 40  (0%,     1+,   39- )
N      : 12  (0%,     0+,   12- )
---------------
DEL: 86629
INS: 0
```

### Total RNA 2 - 28S Variants
chr:1,007,935

####  A60G (A217G)
Note: In Petrov LSU alignment, this base is a 'G'.
In the Petrov secondary structure this 'G' is at position 60.
In the genome browswer this is at position 60 of the 28S
```
chr13:1,007,994
Total count: 997220
A      : 635482  (64%,     632964+,   2518- )
C      : 117  (0%,     117+,   0- )
G      : 361491  (36%,     360145+,   1346- )
T      : 101  (0%,     100+,   1- )
N      : 29  (0%,     29+,   0- )
---------------
DEL: 8
INS: 3
```

#### G772 (?)
'G' insertion at 772
```
chr13:1,008,706
Total count: 5325
A      : 1  (0%,     1+,   0- )
C      : 5262  (99%,     3941+,   1321- )
G      : 62  (1%,     4+,   58- )
T      : 0
N      : 0
---------------
DEL: 0
INS: 1301
```
```
chr13:1,008,707
Total count: 4968
A      : 4  (0%,     1+,   3- )
C      : 20  (0%,     15+,   5- )
G      : 536  (11%,     323+,   213- )
T      : 4408  (89%,     3298+,   1110- )
N      : 0
---------------
DEL: 0
INS: 1300
```

#### A1479T

- A1322 on Petrov Structure
- Yeas A645 or A1309 by [Waku et al.,](http://www.ncbi.nlm.nih.gov/pubmed/?term=27149924)
- N1 - MethylAdenosine modification at this position according to PMID:27149924
- Nucleomethylin mediated modification; deficiency inhibits cell proliferation.
- Waku also shows methylation at A1136 by their numbering; should be ~**A1652** on my alignment


```
chr13:1,009,243
Total count: 6341
A      : 4779  (75%,     3296+,   1483- )
C      : 36  (1%,     34+,   2- )
G      : 360  (6%,     304+,   56- )
T      : 1166  (18%,     1015+,   151- )
N      : 0
---------------
DEL: 24
INS: 0
```
- A1322 on secondary structure
- Helix 25a: CCGU CUUG**A**AAC ACGG
![A1479T Consevartion](../figure/20160504_A1479_Conservation.png)


#### G4641A
```
chr13:1,012,388
Total count: 143874
A      : 143382  (100%,     97285+,   46097- )
C      : 274  (0%,     69+,   205- )
G      : 110  (0%,     35+,   75- )
T      : 104  (0%,     17+,   87- )
N      : 4  (0%,     3+,   1- )
---------------
DEL: 6
INS: 0
```
 - 100% of rRNA is A-variant
 - Petrov alignment sequence is 'A' not 'G'
 - In some species like Yeast, it's a G, human/chimp it's A
 - Secondary Structure: 4484 Helix 91: GUGAA GCAGAA UUC**A**C
![AG4641A Conservation](../figure/20160504_G4641A_Conservation.png)


#### T4687A/G
```
chr13:1,012,434
Total count: 163870
A      : 31777  (19%,     20843+,   10934- )
C      : 779  (0%,     571+,   208- )
G      : 26071  (16%,     16590+,   9481- )
T      : 105243  (64%,     69732+,   35511- )
N      : 0
---------------
DEL: 2063
INS: 4
```
- Secondary Structure U4530, between helix 90,93: 90GGG **U**UUA GACC93
- Strong evolutionary conservation (perfect) around this region
![U4687A Conservation](../figure/20160504_U4687A_Conservation.png)

#### C5063U (Expansion Loop Area)
```
chr13:1,012,805
Total count: 445143
A      : 272  (0%,     1+,   271- )
C      : 267568  (60%,     1557+,   266011- )
G      : 944  (0%,     0+,   944- )
T      : 176357  (40%,     1114+,   175243- )
N      : 2  (0%,     0+,   2- )
---------------
DEL: 0
INS: 4
```

- Secondary Structure Stem 98ES39b: C4906 


#### Variations in CCG / poly-C / simple / expansion Loops
U1025C (?)
```
chr13:1,008,804
Total count: 16469
A      : 10  (0%,     9+,   1- )
C      : 15396  (93%,     4558+,   10838- )
G      : 1049  (6%,     464+,   585- )
T      : 14  (0%,     7+,   7- )
N      : 0
---------------
DEL: 2
INS: 8

chr13:1,008,921
Total count: 8594
A      : 5788  (67%,     2524+,   3264- )
C      : 2623  (31%,     208+,   2415- )
G      : 131  (2%,     2+,   129- )
T      : 52  (1%,     2+,   50- )
N      : 0
---------------
DEL: 2
INS: 2

chr13:1,008,935
Total count: 13497
A      : 56  (0%,     34+,   22- )
C      : 2656  (20%,     12+,   2644- )
G      : 171  (1%,     15+,   156- )
T      : 10614  (79%,     2543+,   8071- )
N      : 0
---------------
DEL: 1
INS: 2

chr13:1,008,953
Total count: 25663
A      : 61  (0%,     2+,   59- )
C      : 4468  (17%,     7+,   4461- )
G      : 208  (1%,     6+,   202- )
T      : 20926  (82%,     2948+,   17978- )
N      : 0
---------------
DEL: 0
INS: 1

chr13:1,009,035
Total count: 8884
A      : 89  (1%,     0+,   89- )
C      : 1321  (15%,     14+,   1307- )
G      : 50  (1%,     2+,   48- )
T      : 7424  (84%,     1885+,   5539- )
N      : 0
---------------
DEL: 4
INS: 49

chr13:1,010,959
Total count: 11440
A      : 8817  (77%,     1166+,   7651- )
C      : 1364  (12%,     14+,   1350- )
G      : 1222  (11%,     241+,   981- )
T      : 37  (0%,     0+,   37- )
N      : 0
---------------
DEL: 0
INS: 1
```

CG --> CG to make poly-CGG
```
chr13:1,011,149
Total count: 2461
A      : 0
C      : 10  (0%,     0+,   10- )
G      : 2442  (99%,     82+,   2360- )
T      : 9  (0%,     0+,   9- )
N      : 0
---------------
chr13:1,011,150
Total count: 2497
A      : 2  (0%,     1+,   1- )
C      : 2451  (98%,     90+,   2361- )
G      : 39  (2%,     0+,   39- )
T      : 5  (0%,     0+,   5- )
N      : 0
---------------
DEL: 0
INS: 1

chr13:1,012,038
Total count: 25792
A      : 22000  (85%,     4177+,   17823- )
C      : 411  (2%,     35+,   376- )
G      : 3275  (13%,     0+,   3275- )
T      : 93  (0%,     0+,   93- )
N      : 13  (0%,     0+,   13- )
---------------

chr13:1,012,057
Total count: 31855
A      : 98  (0%,     1+,   97- )
C      : 20147  (63%,     2023+,   18124- )
G      : 11295  (35%,     101+,   11194- )
T      : 315  (1%,     8+,   307- )
N      : 0
---------------
DEL: 3
INS: 167
```


## Discussion

### Variant Summary
The most interesting variant is the 18S U1248C in helix 31. From E. coli, this position.

Many variants occured in the so called Expansion Loops, especially at 'CGG' positions. These seem to be in general quite variable and may represent a good chunk of the 'difficult' parts of rRNA to sequence. I won't focus all that much on them initially since if a mutation is oncogenic it probably will be variable at a more "important" place.


### The Irony

I've essentially shalved this project since before I found the variation too difficult to deal with. The irony is that in 2013 my 'sketchier' analysis where I was less confident in the results (based on mRNA-seq), the example I was dealing with was r.U1248 or 18S U1248, the same position I "found" from the initial analysis. Direct quote from May 2013 below

```
May 13 2013
 	This is cool because it means that unless you're certain there
	are variants then automated processes will miss this.

Example: Position r.4904 (chr13:1004904)
	In the manual annotation I had
                r.4904 del      0.036   *
                r.4904 T > C    0.34    *
                r.4904 T > A    0.030   *
```

and May 25th 2013
```
In the human Ribosomal structure paper by Anger et al,. (nature, may 2013)

I have looked at position r.4904 which is variable in Hodgkin's lymphoma and
K562 to see how it fits into the structure of the ribosome.
	
	1) The paper refers to this position as 18S 1248 U
	2) 1248 U is within stem-loop H31 of the 18S human rRNA
	3) 18S:1246-1248 [ACU] interact with residues 149,151 of SERPB1 (serpine1)

I'm not sure as of right now how the U > C change of the rRNA will effect the interaction
but if it's recurently there then that may suggest significance.
I need to run the normal B-cells immediatly and see if the variation is absent in
those cells. Anger used human peripheral blood as the source of 80S ribosome
so we may be on to something!!!!!!
```

### IGV at Colorectal Cancer

For LIONS I have access to some colorectal cancer transcriptomes from a patient (opposed to cell line). Since hg19r is the standard transcriptome I use, I have information about this position there.
```
chr13:1,004,904
Total count: 53585
A      : 2397  (4%,     1159+,   1238- )
C      : 23565  (44%,     11707+,   11858- )
G      : 3402  (6%,     1495+,   1907- )
T      : 24221  (45%,     11718+,   12503- )
N      : 0
---------------
DEL: 2188
INS: 1
```
Which is also variable but at a higher level (45% C) and there are 4% and 6% variants of A and G. This isn't background noise, when you look at variation in the immediatly surrounding area it's substantially lower.

![Colorectal Cancer. 587294.bam 18S U1248 in IGV](../figure/20160504_Colorectal_18S_U1248.png)

### IGV at Hodgkins

I have A05254r.bam on my local computer. Took a look on IGV. Same general pattern, only a bit lower C at 30%.
```
chr13:1,004,904
Total count: 12174
A      : 339  (3%,     186+,   153- )
C      : 3704  (30%,     1731+,   1973- )
G      : 463  (4%,     237+,   226- )
T      : 7667  (63%,     3682+,   3985- )
N      : 1  (0%,     1+,   0- )
---------------
DEL: 504
INS: 0
```

### IGV At B-Cell

Downloaded the HGR region from the normal B-cell A05247r.bam and HS2254r.bam transcriptomes.
```
chr13:1,004,904
Total count: 38694
A      : 1326  (3%,     626+,   700- )
C      : 15820  (41%,     7501+,   8319- )
G      : 1867  (5%,     948+,   919- )
T      : 19680  (51%,     9413+,   10267- )
N      : 1  (0%,     0+,   1- )
---------------
DEL: 2452
INS: 0

Total count: 32550
A      : 1237  (4%,     734+,   503- )
C      : 14700  (45%,     8292+,   6408- )
G      : 1532  (5%,     909+,   623- )
T      : 15081  (46%,     8371+,   6710- )
N      : 0
---------------
DEL: 1465
INS: 0
```

It's possible this variant is present in many cells. But the base is also 'hypermodified' which may mean that this is simply a reverse-transcritase artifact. If it is that may be interesting since the 'error profile' at this position could be used to 'predict' other modified sites in RNA samples.

### U1248 in E. Coli

- G966 in E. Coli or U1191 in Yeast 
- Secondary Structure: 16S helix 31 (also 970 loop)
- 960... UUCG AU**G**CAACG CGAA
- 2me-G966 and 5me-C967 are methylated in E. coli [PMID:21278676](http://www.ncbi.nlm.nih.gov/pubmed/21278676)
- Authors report that 'corresponding base in eukarya is modified' 1-methyl-3-(3-amino-3-carboxypropyl)pseudouridine.
- "Single mutations at m2-G966 or m5-C967 produce more protein in vivo than wt ribosomes"
- "Binding site for IF3" (eukaryotic eIF1)

#### h31-tRNA Interaction
h31 and more specifically me2-G966 interacts with the Anti-Codon of tRNA at position 34. This means that the first base of the genetic code may somehow be involved at the P-site if these are really variant rRNA molecules.

E.coli helix 31 structure. From PMID:21278676
![E.coli helix 31 structure. From PMID:21278676)](../figure/20160504_Ecoli_h31.png)
P-site structure. H31 - AntiCodon Interaction. PMID:11283358
![P-site structure. H31 - AntiCodon Interaction. PMID:11283358](../figure/20160504_Ecoli_Psite_5.5A.png)
Yeast tRNA-Phe. PMID:10943889
![Yeast tRNA-Phe. PMID:10943889](../figure/20160504_Yeast_tRNA_Phe.png)

- U1248 is U1191 in Yeast


