# Colorectal Carcinoma RNAseq - 2
```
pi:ababaian
files: ~/Crown/data/CRC/
start: 2017 03 31
complete : 2017 04 01
```
## Introduction

[Continuation of CRC_1](./20170325_CRC_RNAseq_PRIVATE.ipynb)

If variant allele frequency changes within a patient between cancer and normal, deviations in allele frequency, specifically deviations between the mean RAF of the cancer and normal samples is indicitive of directional selection. This can be used to identify oncogenic/tumour suppressor variations in rRNA if they exist.

## Hypothesis

1) Null: An allele is not under directional selection if the mean reference allele frequency (or variant allele frequency) is the same between the cancer and normal cell populations.


## Materials and Methods

1) [Data generated previously](./20170325_CRC_RNAseq_PRIVATE.ipynb) for all of 18S and 28S rRNA in each of the 138 cancer/normal RNAseq libraries.

- `18S_crc.gvcf` - DP:AD for chr13:1003660-1005529 across 138 samples (69 patient-tumour matched)
- `28S_crc.gvcf` - DP:AD for chr13:1007948-1013560 across 138 samples (69 patient-tumour matched)

NOTE: 28S is incorrectly calculated to the end of 45S not the end of 28S, as such for this analysis the extra bases are trimmed in the code manually at 1013018.

2) Ran crcAnalysis.r script

3) Molecule-wide significance

3) Measuring 18E pre-rRNA content

- Ran adCalc.sh on gsc on `REGION='chr13:1005530-1005724'` and `OUTPUT='18E_crc.gvcf'`
- Ran adCalc.sh on gsc on `REGION='chr13:1006622-1006779'` and `OUTPUT='5.8S_crc.gvcf'`

Note: I'm not predicting any change in 5.8S, I just will look but don't consider it part of the screen. The 18E is the ~200 bases downstream of 18S which can be used to measure the relative proportion of 18S-e (pre-rRNA accumulation) which can also explain a decrease in the amount of modified base being read out.


### crcAnalysis.r

```
# crcAnalysis.R
#
# Analysis of adCalc.sh
# output gvcf files
#
library(ggplot2)

# Import
GVCF = read.table('18S_crc.gvcf')
  GVCF = data.frame(t(GVCF))
  colnames(GVCF) = seq(1, length(GVCF[1,]))

# Cut 28S to just 28S (5071 bases)
# chr13    1007948    1013018    28S
  # trim = which( (apply(GVCF[2,], 1, as.numeric) < 1007949) | 
  #                 (apply(GVCF[2,], 1 , as.numeric) > 1013018) )
  # GVCF = GVCF[, -trim]  

  
# Cut 18S to just 18S (1869 bases)
# chr13    1003661    1005529    18S
  trim = which( (apply(GVCF[2,], 1, as.numeric) < 1003661) | 
                (apply(GVCF[2,], 1 , as.numeric) > 1005529) )
  GVCF = GVCF[, -trim]  
  

refAllele = GVCF[4,]
altAllele = GVCF[5,]
genCoord  = GVCF[2,]
rnaCoord  = seq(1, length(GVCF[2,]))

sampleN = length(GVCF[,1]) - 9 # remove 9 header vcf rows
bpN     = length(genCoord) # 1869 for 18S; 5071 for 28S

# Functions =========================================================

# Convert DP:AD string to numeric DP (Total Depth)
dpCalc = function(inSTR){
# inSTR is from vcf
# in format DP:AD
# 2000:1500,400,50,50
# extract 2000
inSTR = as.character(inSTR)
as.numeric(unlist(strsplit(inSTR,split=':'))[1])

}

# Convert DP:AD string to numeric RD for the REFERENCE ALLELE DEPTH
# Thus Alternative_Allele_Depth = Total_Depth - Reference_Allele_Depth
# for all alternative alleles.
rdCalc = function(inSTR){
  # inSTR is from vcf
  # in format DP:AD
  # 2000:1500,400,50,50
  # extract 1500
  inSTR = as.character(inSTR)
  as.numeric(unlist(strsplit(unlist(strsplit(inSTR,split=":"))[2], split = ","))[1])
  
}


# Calculations ======================================================
# Calculate Depth of Coverage (baq > 30)
# for all positions

#Initialize DP vector
DP = vapply( GVCF[-c(1:9),1], dpCalc, 1)

#Extend the DP vector for all positions
for (i in 2:bpN){
DP = cbind(DP,
           vapply( GVCF[-c(1:9),i], dpCalc, 1) )
}

# Calculate Reference Depth of Coverage (baq > 30)
# for all positions
#
#Initialize
RD = vapply( GVCF[-c(1:9),1], rdCalc, 1)

#The rest
for (i in 2:bpN){
  RD = cbind(RD,
             vapply( GVCF[-c(1:9),i], rdCalc, 1) )
}


# Reference Allele Frequency
# Intra-Library
# RD / DP
RAF = RD / DP

# NOTE: division by zero is possible here and will introduce NAs


# Deconvolute Cancer samples from normal samples
# Odd Rows = Cancer Sample
# Even Rows = Normal Sample
# Paired for CRC
canRAF = RAF[seq(1,sampleN,2),]
normRAF = RAF[seq(2,sampleN,2),]
              
# Change in Reference Allele Frequency
# of Cancer from Normal
dRAF = canRAF - normRAF


# Calculate some descriptive statistics
# about the change in Reference Allele Frequency
# Remove NA from calculations (no sequencing depth in a library)
mean_dRAF = apply(dRAF,2,mean, na.rm = TRUE)
sd_dRAF   = apply(dRAF,2,sd, na.rm = TRUE)
var_dRAF  = apply(dRAF,2,var, na.rm = TRUE)
mean_DP = apply(DP,2,mean, na.rm = TRUE)

# Remove poorly 'covered' positions (i.e. less then 1000x coverage on average)
# the magnitude of bias is simply too high at such regions
dropPOS = (mean_DP < 1000)

canRAF[,dropPOS]  = 0
normRAF[,dropPOS] = 0

```

### screenPlot_crc.r
Script to process crcAnalysis.r data files and plot molecule-wide changes in RAF and significance.

```
# screenPlot_crc.r
#
# Screen for changes between allele frequency
# between two biological samples
# using adCalc.sh / crcAnalysis.r
#

library(ggplot2)
library(rgl)
library(grid)

# Calculate P-value for difference of means in RAF
# between cancer and normal
# use as a 'score' 

# t.test based
Pval = t.test(canRAF[,1], normRAF[,1], paired = TRUE)$p.value

for (i in 2:bpN){
  Pval = cbind(Pval,
               t.test(canRAF[,i], normRAF[,i], paired = TRUE)$p.value)
}

# Score
Pscore = -log(Pval)
Pscore[is.na(Pscore)] = 0
Pscore = as.numeric(Pscore)


# Bonferonni correction
# p < alpha / m
# p < signifiance_cutoff / numberTests
# p * numberTests < significance_cutoff

Pscore_bon = -log(Pval*bpN)
Pscore_bon[is.na(Pscore_bon)] = 0
Pscore_bon[Pscore_bon < 0] = 0
Pscore_bon = as.numeric(Pscore_bon)

# NOTE: this should be re-calculated as a Manhatten plot
# for the publication. T-test is a bit basic.


# PLOT ==================================================

DATA = data.frame(1:bpN) # for 18S
  colnames(DATA) = c("RNA")

DATA$mean_dRAF = mean_dRAF    
DATA$var_dRAF = var_dRAF
DATA$mean_DP = mean_DP
DATA$Pscore = Pscore
DATA$Pscore_bon = Pscore_bon

# plot(mean_dRAF)
#  plot(var_dRAF)
#  plot(mean_DP)
#  plot(log(mean_DP))
#  plot(Pscore)
#  plot(Pscore_bon)
# plot3d(mean_DP, Pscore)
 
# PLOTS for 18S
PLOT1 = ggplot(DATA, aes(RNA, mean_dRAF)) +
  geom_point() + theme_minimal() +
  ylim(c(-0.11,0.11))
#PLOT1

PLOT2 = ggplot(DATA, aes(RNA, var_dRAF)) +
  geom_point() + theme_minimal()
#PLOT2

PLOT3 = ggplot(DATA, aes(RNA, mean_DP)) +
  geom_point() + theme_minimal() +
  scale_y_log10()
#PLOT3

PLOT4 = ggplot(DATA, aes(RNA, Pscore)) +
  geom_point() + theme_minimal() +
  geom_hline(yintercept = -log(0.001/bpN))
#PLOT4

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

multiplot(PLOT1, PLOT2, PLOT3, PLOT4, cols=1)

```

## Results

### 18S rRNA Change in Reference Allele Frequency

![dRAF 18S](../../data/CRC/plot/18S_dRAF.png)

A single position 18S.1248U reaches molecule-wide signifiance by t.testing for difference of means at (0.001) significance.

This is the hyper-modified base macp-Psi, the increase in reference allele frequency is consistent with a decrease in modification of this base.

### 28S rRNA Change in Reference Allele Frequency

![dRAF 28S](../../data/CRC/plot/28S_dRAF.png)

In 28S another single position also reaches molecule wide signifiance 28S.470A (hgr1.1008418) Note that in 28S there is a region between 3000 and 3600 which is poorly covered and mean coverage drops below 1000 and thus is excluded as a bias-region. 


In this way I fail to reject the null hypothesis and two positions reach molecule-wide significance, 18S.1248U and 28S.470A for changes in reference allele frequency.


### 18S.1248U
`basePlot_crc.r` analysis

![18S.1248U RAF](../../data/CRC/plot/18S.1248U_RAF.png)

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = 8.331, df = 68, p-value = 5.491e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.08220607 0.13398997
sample estimates:
mean of the differences 
               0.108098 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 11.003, num df = 68, denom df = 68, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
  6.812664 17.770178
sample estimates:
ratio of variances 
          11.00283 
```

Which is 1.0256 x 10^-8 molecule-wide significance and ~3.8437 x 10^-8 experiment-wide signifiance which is less then alpha of 0.001. This position is different between cancer and normal by a wide margin and the change is directional towards the reference allele (U). This is likely a marker of strong selection at this position but it's slightly more complicated then that since this is a hyper-modified base...

![18S_1248U dRAF](../../data/CRC/plot/18S.1248U_dRAF.png)


```
> t.test(POS$dRAF, POS$sim_dRAF, paired = FALSE)

	Welch Two Sample t-test

data:  POS$dRAF and POS$sim_dRAF
t = 5.4939, df = 135.51, p-value = 1.881e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.06654615 0.14139895
sample estimates:
  mean of x   mean of y 
0.108098019 0.004125469 

> var.test(POS$dRAF, POS$sim_dRAF)

	F test to compare two variances

data:  POS$dRAF and POS$sim_dRAF
F = 0.88707, num df = 68, denom df = 68, p-value = 0.6226
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.549253 1.432674
sample estimates:
ratio of variances 
         0.8870739 
```


### 28S.470A
`basePlot_crc.R` analysis

![28S.470A RAF](../../data/CRC/plot/28S.470A_RAF.png)

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = -5.8538, df = 68, p-value = 1.528e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.020305158 -0.009981117
sample estimates:
mean of the differences 
            -0.01514314 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 2.0824, num df = 68, denom df = 68, p-value = 0.002877
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.289365 3.363186
sample estimates:
ratio of variances 
          2.082397 
```

Bonferroni correcting for multiple testing the P-value is 7.886 x 10^-4 across 28S and 1.074 x 10^-3 for the entire experiment. Which by the book rules this position out as being significantly different. I think that's a super conservative intepretation though and something probably is involved with 470A.

![28S.470A dRAF](../../data/CRC/plot/28S.470A_dRAF.png)

```
> t.test(POS$dRAF, POS$sim_dRAF, paired = FALSE)

	Welch Two Sample t-test

data:  POS$dRAF and POS$sim_dRAF
t = -4.8017, df = 135.81, p-value = 4.098e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02434797 -0.01014286
sample estimates:
   mean of x    mean of y 
-0.015143137  0.002102281 

> var.test(POS$dRAF, POS$sim_dRAF)

	F test to compare two variances

data:  POS$dRAF and POS$sim_dRAF
F = 1.0781, num df = 68, denom df = 68, p-value = 0.7574
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6675255 1.7411761
sample estimates:
ratio of variances 
          1.078091
```

## Conclusion

I failed to reject the null hypothesis. There is a significant difference between 18S.1248U in cancer and normal. This is an incredibly important position in the ribosome, it's at the core of the P-site and a hyper-modified base to macp-Psi which convolutes the analysis but I don't believe this is simply chance or noise, this is the biochemical signiture of a difference between a sub-population of rRNA in cancer cells and normal cells.

[Continued in CRC 3](./20170401_CRC_RNAseq_3_PRIVATE.ipynb)

### 5.8S dRAF

Just because I can't not look at this point.

![5.8S dRAF](../../data/CRC/plot/5.8S_dRAF.png)

No signifiance points : )



## Addendum -- 170423

In another parallel analysis of T-ALL, I identified two more modifications in detectable by RNA-seq which are likely RNA nucleotide differences [See Database](https://people.biochem.umass.edu/fournierlab/3dmodmap/hum28sseq.php).

I hypothesized that if there is a pan-ribobiogenesis phenotype in the CRC samples then other modifications should also be 'hypo-modified'.

### 18S.1851A
Modification to di-me6_A

`basePlot_crc.R` analysis

![18S.1851A RAF](../../data/CRC/plot/18S.1851A_RAF.png)

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = -1.5446, df = 68, p-value = 0.1271
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.010416566  0.001326732
sample estimates:
mean of the differences 
           -0.004544917 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 0.51365, num df = 68, denom df = 68, p-value = 0.006684
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3180367 0.8295680
sample estimates:
ratio of variances 
         0.5136468 
```

There is no evidence that this position is 'hypo-modified', if anything it's reading out more (although non-significant) error reads.

### 28S.1321A
Modification to me1_A.

`basePlot_crc.R` analysis

![28S.1321A RAF](../../data/CRC/plot/28S.1321A_RAF.png)

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = -2.4384, df = 68, p-value = 0.01737
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0083813933 -0.0008371938
sample estimates:
mean of the differences 
           -0.004609294 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 1.2623, num df = 68, denom df = 68, p-value = 0.3392
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7815606 2.0386256
sample estimates:
ratio of variances 
          1.262264 
```

This position is not molecule-wide or experiment-wide significant but by itself there is a difference. It's key to note that the order of this change is much less then that of the 1248U difference.

### 28S.4532U
Modification to me3_U


`basePlot_crc.R` analysis

![28S.4532U RAF](../../data/CRC/plot/28S.4532U_RAF.png)

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = -0.19977, df = 68, p-value = 0.8423
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.002957874  0.002419535
sample estimates:
mean of the differences 
          -0.0002691691 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 0.7182, num df = 68, denom df = 68, p-value = 0.1749
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4446933 1.1599398
sample estimates:
ratio of variances 
         0.7182043 
```

And this position is not different between cancer and normal.


Altogether this supports the notion that there isn't an over-whemling ribosome biogenesis phenotype which can explain the 1248U reference allele frequency.


### Controls --- Bona-fide unmodified bases

One could argue that this is 'read error'. That's in a different order of magnitude. Positions 4030, 4899 (high GC), and 4900 (high conservation) are shown below.


![Control position 4030](../../data/CRC/plot/18S.4030_ctrl.png)

![High GC Control Position 4899](../../data/CRC/plot/18S.4899_hiGC_ctrl.png)

![Control position 4900](../../data/CRC/plot/18S.4900_ctrl.png)