# 18S : 18SE Assay
```
pi:ababaian
files: /home/artem/Desktop/Crown/data2/tcga_analysis/18SE_Assay
start: 2018 11 28
complete : YYYY MM DD
```
## Introduction

The 1248.U modification to 1248.macpPsi is a late modification in the maturation of 18S rRNA. It occurs in the cytoplasm (presumably) before 80S assembly (otherwise it wouldn't be accesible to TSR3).

One hypothesis which could explain the hypo-macp phenotype is that there is a defect in rRNA biogenesis which results in substantial increases in the levels of pre-mature 18S rRNA accumulating in the cell. This would have to happen to quite extreme levels to see the extent of hypo-macp which is seen in cancers like CRC (upto 95+% unmodified), which would imply that there is a greater proportion of pre-rRNA than ther is mature rRNA. Maybe this is possible, maybe it's not but it will be possible to measure the rate of 18S biogenesis by comparing the coverage of 18S to it's precursor 18S-E which includes an ~80 bp 3' extension.



## Objective

Create a rapid pipeline to measure 18S coverage and 18S-E coverage. For the TCGA cohorts, compare the 18S:18SE ratio to macp-modification levels.

### Hypothesis

The hypo-macp phenotype (decrease in VAF at 18S.1248 U) is the result of the accumulation of pre-rRNA and will correlate positively, and linearly (Pearson correlation) with the ratio of 18SE : 18S.

The mild hypo-macp phenotype in some normal libraries will correlate with higher 18SE.

The hypo-macp phenotype in cancer libraries will correlate with higher 18SE.



## Materials and Methods


### rRNA Precursor Coordinates

End coordinates for 18S-E taken from [PMC3632142](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3632142/) and 21S from [PMC3017594](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3017594/).


```
   |----------------------------------------45S Unit--------------------------------------------|
   
...|-------|=========18S======|--E-----------|==5.8S=|-------|========28S===================|---|...
     5 ETS                           ITS1               ITS2                                 3 ETS
   
                                       32S   |----------------------------------------------|
                                       
     21S-C |----------------------------|
     
     18S-E |------------------***|
     
```


The `rRNA.bed`

```
chr13	10219	10340	5S
chr13	1000000	1013408	45S
chr13	1003660	1005529	18S
chr13	1006622	1006779	5.8S
chr13	1007947	1013018	28S
chr13	1006622	1013018	32S
chr13	1003660	1003683	21C
chr13	1003660	1005608	18SE
chr13	1005529	1005608	18SE_frag
```


In [41]:
# Testing bedtools
## Workspace
WORKDIR='/home/artem/Desktop/Crown/data2/tcga_analysis/18SE_Assay'
cd $WORKDIR

#test data
#ln -s ~/Desktop/var/TCGA-22-4593-01A.hgr1.bam ./
#ln -s ~/Desktop/var/TCGA-22-4593-01A.hgr1.bam.bai ./
#ln -s ~/Desktop/var/TCGA-22-4593-11A.hgr1.bam ./
#ln -s ~/Desktop/var/TCGA-22-4593-11A.hgr1.bam.bai ./

bedtools --version

bedtools v2.26.0


In [13]:
# Test multiBamCov program (gives total coverage only)
start=`date +%s`

  multiBamCov -split -bams $(ls *.bam) -bed rRNA.bed
  
  
end=`date +%s`
runtime=$(($end-$start))
echo "\\n Runtime (s): $runtime"

chr13	10219	10340	5S	198	357
chr13	1000000	1013408	45S	2563659	4624354
chr13	1003660	1005529	18S	408048	1355474
chr13	1006622	1006779	5.8S	4690	19402
chr13	1007947	1013018	28S	2128864	3245891
chr13	1006622	1013018	32S	2136068	3266217
chr13	1003660	1003683	21C	5929	24012
chr13	1003660	1005608	18SE	408252	1355511
chr13	1005529	1005608	18SE_frag	1381	272
113


In [44]:
# #!/bin/bash
#
# bamCov.sh < BAMLIST FILE >
#

# Test genomeCoverageBed in serial (per base output)
start=`date +%s`

# BAMLIST=$(cat $1)
BAMLIST=$(ls *.bam)
#START_COORD='1003660'
#END_COORD='1005608'

rm test2.tmp # if it exists

for FILE in $BAMLIST
do
  genomeCoverageBed -d -split -ibam $FILE > test.tmp
  
  # cut coverage column only for 18S-E coordinates
  if [ -e test2.tmp ]
  then
    sed -n 1003660,1005608p test.tmp | cut -f3 - | paste test2.tmp - > test3.tmp
  else
    sed -n 1003660,1005608p test.tmp | cut -f3 - > test3.tmp
  fi
    
  mv test3.tmp test2.tmp
  
done

echo $BAMLIST | tr '\n' '\t' > header.tmp
cat header.tmp test2.tmp | sed 's/bam\t/bam\n/g' - > 18SE.coverage.tsv

rm tmp

end=`date +%s`
runtime=$(($end-$start))
echo "Runtime (s): $runtime"

rm: cannot remove 'test2.tmp': No such file or directory
rm: cannot remove 'tmp': No such file or directory
Runtime (s): 58


In [None]:
# Launch crown-tcga instance
# crown-tcga-181124 (ami-053dfb448b82492ac)
## REMOTE:

# Installing bedtools on REMOTE
cd ~/software/
wget https://github.com/arq5x/bedtools2/releases/download/v2.27.1/bedtools-2.27.1.tar.gz

tar -xvf bedtools-2.27.1.tar.gz
make

cd ~/bin/
ln -s /home/ubuntu/software/bedtools2/bin/* ./

In [39]:
## vim ~/scripts/bamCov.sh {insert}
## cd tcga

### RUN1------------------------------
## screen
## ls TCGA-COA*/*.bam > coad.list
##
## bash ~/scripts/bamCov.sh coad.list
### runtime: 14296
### aws s3 cp 18SE.coverage.tsv s3://crownproject/tcga/181128_18SE/18SE.coverage.coad.tsv

### RUN2------------------------------
## screen
## mkdir run2; cd run2
## ls ../TCGA-[BC]*/*.bam > bc.list
## bash ~/scripts/bamCov.sh bc.list
### runtime: 13625
### aws s3 cp 18SE.coverage.tsv s3://crownproject/tcga/181128_18SE/18SE.coverage.r2.tsv

### RUN3------------------------------
## screen
## mkdir run3; cd run3
## ls ../TCGA-[DEHK]*/*.bam > dehk.list
## bash ~/scripts/bamCov.sh dehk.list
### runtime: 7859
### aws s3 cp 18SE.coverage.tsv s3://crownproject/tcga/181128_18SE/18SE.coverage.r3.tsv

### RUN4------------------------------
## screen
## mkdir run4; cd run4
## ls ../TCGA-[LPRST]*/*.bam > lt.list
## bash ~/scripts/bamCov.sh lt.list
### runtime: 29781
### aws s3 cp 18SE.coverage.tsv s3://crownproject/tcga/181128_18SE/18SE.coverage.r4.tsv

## ls */*.bam > bam.list
##
## bash ~/scripts/bamCov.sh bam.list

## Requires: ~16h for complete set

## paste run2/18SE.coverage.tsv run3/18SE.coverage.tsv run4/18SE.coverage.tsv > 18SE.coverage.all.tsv
## gzip 18SE.coverage.all.tsv
## aws s3 cp 18SE.coverage.all.tsv.gz s3://crownproject/tcga/181128_18SE/





## Results

Analysis R Markdown file: `~/Crown/data2/tcga_analysis/18SE_Assay/18SE.rmd`

For 457 `01A` cases (red), with 40 paired `11A` control cases (blue) and 19 other `01B, 01C or 06A` (green).


### Calculated using 18S Tail Region
Using: 
```
# Coordinates
xy_18 = c(1850:1869) # 18S tail region
xy_E  = c(1870:1949) # 18S-E unique region

SE_ratio = coverage(xy_E) / coverage(xy_18)
```

#### Coverage vs. VAF
![Coverage vs. VAF](../../data2/tcga_analysis/18SE_Assay/plots/coad_cov_v_vaf.png)

There doesn't appear to be an association with cancerA `01A` library coverage at 18S tail region and the variant allele frequency (VAF) at 18S.1248U. Notably the `01B/C` libraries have higher coverage and consistantly very low VAF (near 5%) which suggests these samples are chemically distinct the the rest. I cannot find what the exact distinction between 01A and 01B library preperation is (if it is library prep), even within the same patient there is discordance in VAF, all B libraries are ultra-low VAF.

Focusing on only `01A` libraries, there is a discontinuity in the data along the diagonal between cov = 9 and VAF = 40, with seemingly two clusters of data on either side of this line. Comparing `01A` libraries to the normal control `11A` libraries, it appears as though `01A and 11A` data is continuously distributed, with `01A` showing lower VAF values and slightly elevated coverage, although the cancer range of coverage at normo-VAF levels is greater then normal range, suggesting higher variability in the cancer data.

#### SE Ratio vs. VAF
![SE Ratio vs. VAF](../../data2/tcga_analysis/18SE_Assay/plots/coad_se_v_vaf)

Calculating 18S-E / 18S Ratio, the majority of `01A` libraries show a low ratio (>0.25) which is consistant with normal 18S rRNA processing. This includes a large bulk of samples which are low-VAF (hypo-macp). This excludes that hypo-macp is a result of 18S rRNA biogenesis deficiency in CRC.

Notably, there is a sub-set of libraries, both in normo-macp and hypo-macp range (arb. cut-off VAF = 40%) which have SE_ratio elevavtion which is consistent with a ribosome biogenesis deficiency phenotype. It is possible this is simply a carry-over of pre-rRNA in library preperation although the lack of association with macp-modification suggests this is not the case (also check other 18S rRNA modifications perhaps?).

Note: SE-ratio is calculated using the tail region of 18S, this is why the ratio can approach or exceed 1.0.


### Calculated using entire 18S coverage

Using: 
```
# Coordinates
xy_18 = c(1:1869)    # 18S whole
xy_E  = c(1870:1949) # 18S-E unique region

SE_ratio = coverage(xy_E) / coverage(xy_18)
```

#### Coverage vs. VAF
![Coverage vs. VAF](../../data2/tcga_analysis/18SE_Assay/plots/coad_cov18S_v_vaf.png)

Same trend as above, there does not appear to be a correlation between VAF and mean coverage over the entire 18S.

#### SE Ratio vs. VAF
![SE Ratio vs. VAF](../../data2/tcga_analysis/18SE_Assay/plots/coad_se18S_v_vaf.png)

This plot is log-scaled 18S-E tail to 18S ratio. There is a distinct correlation here between 18S-E and VAF. Note the 'shadow' of high 18S-E libraries which have higher (shift left) modification. This is consistant with libraries which have comparable levels of modification (x-axis) but higher relative levels of 18S-E (precursor accumulation) than the majority of libraries. The reasoning for this can be that these are very-late stage biogenesis defects, or perhaps the exonucleases responsible for 18S-E processing is disrupted but ribogenesis continues. Parsing out the difference between these populations at RNA-seq expression level may be of significant interest.




## Discussion
