# Colorectal Carcinoma RNA-seq - 1
```
pi:ababaian
files: ~/Crown/data/CRC/
start: 2017 03 25
complete : 2017 03 31
```
## Introduction

I'm fairly certain that 28S.59G/A is conserved across individuals from the DNAseq/RNAseq. I emailed the manuscript to Jon Dinmen, and he had the same question as everyone else. What's the biological function?

Casually observing the AzaSen experiment, there wasn't any all-or-none effects but there is a pretty consistent change at 28S.59 between Replicating and Non-Replicating Cells and between primary fibroblasts (p53 -/+) and immortalized fibroblasts (p53 -/-)

![28S 59 G / A Alleles](../../data/CRC/plot/azaSen_28S.png)

I think this position may be a pretty decent 'assay', while I don't know what the consequences of either of these alleles are, and they may simple be in linkage with the relevant variants but since they are so easy to study (moderate GC content at this area) then it's worth trying.

The trend is: A% increases between primary vs. transformed and decreases with cell senescence. Thus...

### Hypothesis

1) In primary colorectal carcinoma (CRC) RNA-seq, the relative abundance of 28S.A will increase in the cancer sample relative to adjacent normal tissues.

2) CRC samples which are p53 mutated (I should be able to look this up) will have an increased 28S.A allele frequency.

## Materials and Methods
This analysis will be done on the GSC; data is already there just need to pipe it in.

`crc_align_hgr1.fa`

In [None]:
#!/bin/bash
# crc_align_hgr1.fa
# rDNA alignment pipeline
# for CRC data on GSC to hgr1
# 170325 -- 1750 build
# xhost10 

# Control Panel -------------------------------

# Project Dir
  BASE='/home/ababaian/projects/rDNA/CRC'
  cd $BASE

# Sequencing Data
  CRC_DIR='/projects/magerlab/mkarimi/Colon/human/Colon'
  LIB_LIST='crc_data.txt' # list of crc data fastq files
  
# CPU
  THREADS='4'
  
# Initialize start-up sequence ----------------
# Make working directory
  mkdir -p align
  mkdir -p flagstat
  
#Resources
  aws s3 cp s3://crownproject/resources/hgr1.fa ./
  samtools faidx hgr1.fa
  bowtie2-build hgr1.fa hgr1

# GATK variant calling resources
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa.fai ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.dict ./


# ---------------------------------------------
# SCRIPT LOOP ---------------------------------
# ---------------------------------------------
# For each line in input LIB_LIST; run the pipeline

cat $LIB_LIST | while read LINE
do
    #Initialize Runls
    echo "Start Iteration:"
    echo "  $LINE"
    echo ''
    
    LIBRARY=$(echo $LINE | cut -f1 -d' ' -) # Library Name
    RGSM=$(echo $LINE | cut -f2 -d' ' -)    # Sample / Patient Identifer
    RGID=$(echo $LINE | cut -f3 -d' ' -)    # Read Group ID
    RGLB=$(echo $LINE | cut -f3 -d' ' -)    # Library Name. Accession Number
    RGPL='ILLUMINA'                   # Sequencing Platform.
    RGPO=$(echo $LINE | cut -f4 -d' ' -)    # Patient Population

    FASTQ1=$(echo $LINE | cut -f5 -d' ' -)  # Filename Read 1
    FASTQ2=$(echo $LINE | cut -f6 -d' ' -)  # Filename Read 2
    
    FQ1="$CRC_DIR/$FASTQ1"            # Fastq1 Filepath
    FQ2="$CRC_DIR/$FASTQ2"            # Fastq2 Filepath
    
    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')
    
    echo " Library: $LIBRARY"
    echo " SAMPLE: $RGSM"
    echo " ID: $RGID"
    echo " RGLIB: $RGLB"
    echo " Platform: $RGPL"
    echo " Population: $RGPO"
    echo " RunName: $RGPU"
    echo " Fastq1: $FQ1"
    echo " Fastq2: $FQ2"
    
    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID \
      --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -1 $FQ1 -2 $FQ2 | \
      samtools view -bS - > aligned_unsorted.bam
      
    # Calcualte library flagstats
    samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat

    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Unmapped reads with mapped pairs
      # Extract Mapped Reads
      # and their unmapped pairs
      samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
      samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs

      # Extract just the 45S unit
      #aws s3 cp s3://crownproject/resources/rDNA_45s.bed ./
      #samtools view -b -L rDNA_45s.bed align.F4.bam > align.F4.45s.bam

      # What are the mapped readnames
      samtools view align.F4.bam | cut -f1 - > read.names.tmp

      # Extract mapped reads
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


      # Extract cases of read pairs mapped on edge of region of interest
      # -------|======= R O I ======| ----------
      # read:                  ====---====
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

      # Complete mapped reads list
      #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

      # Extract unmapped reads with a mapped pair
      samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

      # Re-compile bam file
      cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
        samtools sort align.hgr1.tmp.bam align.hgr1
        samtools index align.hgr1.bam
        samtools flagstat align.hgr1.bam > align.hgr1.flagstat

      # Clean up 
      rm *tmp* align.F4.bam align.f4F8.bam

    # Rename/remove the total Bam Files
      rm aligned_unsorted.bam
      mv aligned_unsorted.flagstat flagstat/$LIBRARY.flagstat

    # Rename the hgr Bam files
      mv align.hgr1.bam align/$LIBRARY.hgr1.bam
      mv align.hgr1.bam.bai align/$LIBRARY.hgr1.bam.bai
      mv align.hgr1.flagstat flagstat/$LIBRARY.hgr1.flagstat

done

# Primary VCF ----------------------------
# N/A

# Script complete

#### crc_data.xls / crc_data.txt
```
ID	Individual	Accession	Population	Read_1	Read_2
220_c	587220	587220	CRC	587220_1_1.fastq.gz	587220_1_2.fastq.gz
220_n	587220	587221	CRC	587221_1_1.fastq.gz	587221_1_2.fastq.gz
222_c	587222	587222	CRC	587222_1_1.fastq.gz	587222_1_2.fastq.gz
222_n	587222	587223	CRC	587223_1_1.fastq.gz	587223_1_2.fastq.gz
...
```


## Results

### Sample 220 - Pilot

In the first sample ran, manually inspected position 1008007 (28S.59). There is a 5% relative increase in 28S.59A expression.

Each library is taking ~20 minutes to analyze;  est. 69 will be done in a ~23 hours

#### Normal
```
chr13:1,008,007
Total count: 63472
A : 28190 (44%, 26995+, 1195- )
C : 30 (0%, 30+, 0- )
G : 35239 (56%, 33890+, 1349- )
T : 11 (0%, 10+, 1- )
N : 2 (0%, 2+, 0- )
---------------
DEL: 22
INS: 3
```

#### Cancer
```
chr13:1,008,007
Total count: 21730
A : 10703 (49%, 10019+, 684- )
C : 16 (0%, 14+, 2- )
G : 10996 (51%, 10398+, 598- )
T : 14 (0%, 14+, 0- )
N : 1 (0%, 1+, 0- )
---------------
DEL: 4
INS: 1
```



## Methods - Automating Allele Counts
I can't (read: don't want to) go into each of the 138 libraries and manaully extract counts for 1,008,007 (28S.59). That's a computers job.

28s58.bed: `chr13 1008006 1008007`



In [7]:
# Updated bctools to version 1.4
bcftools --version

bcftools 1.4
Using htslib 1.4
Copyright (C) 2016 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


In [27]:
cd ~/Crown/data/CRC/

#samtools mpileup -vu -f hgr1.fa \
#  --max-depth 1000000 --min-BQ 30 \
#  -t DP,AD \
#  -l 28s58.bed 220_n.hgr1.bam | 
# tail -n1 - | cut -f 4,5,9,10 -

BAM='220_c.hgr1.bam'
# If using multiple samples per run
# add: --ignore-RG

bcftools mpileup -f hgr1.fa \
  --max-depth 1000000 --min-BQ 30 \
  -a FORMAT/DP,AD \
  -r chr13:1008007 \
  $BAM |
  bcftools annotate -x INFO,FORMAT/PL - |
  bcftools view -O v -H -
  

[mpileup] 1 samples in 1 input files
chr13	1008007	.	G	A,T,C	0	.	.	DP:AD	20257:10361,9868,14,13


### adCalc.sh

This method slowly processes through each library indepenently, this can be greatly sped-up if all the libraries are read in one pass. Thus the adCalc.sh script was written.

I'm showing an example run locally here but the same thing can be done to a region as shown below.

Initially just the 59G position was analyzed but on the GSC I also ran 18S, 28S on each of the 138 libraries. (After initial analysis of 59G shown below). Samples took ~20 minutes for alignment per library and the gvcf file generation was ~10 minutes each. Only base calls of quality BAQ > 30 were used for the analysis.

#### OUTPUT FILES

- `18S_crc.gvcf` - DP:AD for chr13:1003660-1005529 across 138 samples (69 patient-tumour matched)
- `28S_crc.gvcf` - DP:AD for chr13:1007948-1013560 across 138 samples (69 patient-tumour matched)

The data is alpha-sorted the same as `crc_data.txt`

In [2]:
#!/bin/bash
# ADcalc.sh
# Allelic Depth Calculator
# for a position (28S.59G>A)

# Controls -----------------
REGION='chr13:1008007'
OUTPUT='28S_59G.vcf'
DEPTH='100000'
BAMLIST='bam.list'

ls *.bam > bam.list

# Iterate through every bam file in directory
# look-up position and return VCF

    bcftools mpileup -f hgr1.fa \
  --max-depth $DEPTH --min-BQ 30 \
  -a FORMAT/DP,AD \
  -r $REGION \
  --ignore-RG \
  -b $BAMLIST |
  bcftools annotate -x INFO,FORMAT/PL - |
  bcftools view -O v -H -


[mpileup] 2 samples in 2 input files
chr13	1008007	.	G	A,C,T	0	.	.	DP:AD	20257:10361,9868,13,14	60045:33639,26366,30,10


In [None]:
#!/bin/bash
# ADcalc.sh
# Allelic Depth Calculator
# for a range of coordinates

# Controls -----------------
REGION='chr13:1007948-10130018'
OUTPUT='28S_crc.gvcf'
DEPTH='100000'
BAMLIST='bam.list'

# Iterate through every bam file in directory
# look-up position and return VCF

    bcftools mpileup -f hgr1.fa \
  --max-depth $DEPTH --min-BQ 30 \
  -a FORMAT/DP,AD \
  -r $REGION \
  --ignore-RG \
  -b $BAMLIST |
  bcftools annotate -x INFO,FORMAT/PL - |
  bcftools view -O v -H - >> $OUTPUT


## R Analysis of Change in Variant Allele Frequency

The processed data was downloaded from the GSC and analyzed in R.

`~/Crown/data/CRC/crcAnalysis.r`

### crcAnalysis Script Core


```
# crcAnalysis.R
#
# Analysis of adCalc.sh
# output gvcf files
#
library(ggplot2)

# Import
GVCF = read.table('28S_crc.gvcf')
GVCF = data.frame(t(GVCF))

colnames(GVCF) = seq(1, length(GVCF[1,]))

refAllele = GVCF[4,]
altAllele = GVCF[5,]
genCoord  = GVCF[2,]
rnaCoord  = seq(-1, (length(GVCF[1,]) - 2))

sampleN = length(GVCF[,1]) - 9 # remove 9 header vcf rows
bpN     = length(genCoord)

# Functions =========================================================

# Convert DP:AD string to numeric DP (Total Depth)
dpCalc = function(inSTR){
# inSTR is from vcf
# in format DP:AD
# 2000:1500,400,50,50
# extract 2000
inSTR = as.character(inSTR)
as.numeric(unlist(strsplit(inSTR,split=':'))[1])

}

# Convert DP:AD string to numeric RD for the REFERENCE ALLELE DEPTH
# Thus Alternative_Allele_Depth = Total_Depth - Reference_Allele_Depth
# for all alternative alleles.
rdCalc = function(inSTR){
  # inSTR is from vcf
  # in format DP:AD
  # 2000:1500,400,50,50
  # extract 1500
  inSTR = as.character(inSTR)
  as.numeric(unlist(strsplit(unlist(strsplit(inSTR,split=":"))[2], split = ","))[1])
  
}


# Calculations ======================================================
# Calculate Depth of Coverage (baq > 30)
# for all positions


#Initialize DP vector
DP = vapply( GVCF[-c(1:9),1], dpCalc, 1)

#Extend the DP vector for all positions
for (i in 2:bpN){
DP = cbind(DP,
           vapply( GVCF[-c(1:9),i], dpCalc, 1) )
}

# Calculate Reference Depth of Coverage (baq > 30)
# for all positions
#
#Initialize
RD = vapply( GVCF[-c(1:9),1], rdCalc, 1)

#The rest
for (i in 2:bpN){
  RD = cbind(RD,
             vapply( GVCF[-c(1:9),i], rdCalc, 1) )
}


# Reference Allele Frequency
# Intra-Library
# RD / DP
RAF = RD / DP

# NOTE: division by zero is possible here and will introduce NAs


# Deconvolute Cancer samples from normal samples
# Odd Rows = Cancer Sample
# Even Rows = Normal Sample
# Paired for CRC
canRAF = RAF[seq(1,sampleN,2),]
normRAF = RAF[seq(2,sampleN,2),]
              
# Change in Reference Allele Frequency
# of Cancer from Normal
dRAF = canRAF - normRAF


# Calculate some descriptive statistics
# about the change in Reference Allele Frequency
# Remove NA from calculations (no sequencing depth in a library)
mean_dRAF = apply(dRAF,2,mean, na.rm = TRUE)
sd_dRAF   = apply(dRAF,2,sd, na.rm = TRUE)
var_dRAF  = apply(dRAF,2,var, na.rm = TRUE)
mean_DP = apply(DP,2,mean, na.rm = TRUE)

# Remove poorly 'covered' positions (i.e. less then 1000x coverage on average)
# the magnitude of bias is simply too high at such regions
dropPOS = (mean_DP < 10000)

canRAF[,dropPOS]  = 0
normRAF[,dropPOS] = 0


# Calculate P-value for difference of means in RAF
# between cancer and normal
# use as a 'score' 

# t.test based
Pval = t.test(canRAF[,1], normRAF[,1], paired = FALSE)$p.value

for (i in 2:bpN){
  Pval = cbind(Pval,
               t.test(canRAF[,i], normRAF[,i], paired = FALSE)$p.value)
}

# Score
Pscore = -log(Pval)
Pscore[is.na(Pscore)] = 0
Pscore = as.numeric(Pscore)


# Bonferonni correction
  # p < alpha / m
  # p < signifiance_cutoff / numberTests
  # p * numberTests < significance_cutoff

Pscore_bon = -log(Pval*bpN)
Pscore_bon[is.na(Pscore_bon)] = 0
Pscore_bon[Pscore_bon < 0] = 0
Pscore_bon = as.numeric(Pscore_bon)

# NOTE: this should be re-calculated as a Manhatten plot
# for the publication. T-test is a bit basic.

 plot(mean_dRAF)
 plot(var_dRAF)
 plot(mean_DP)
 plot(log(mean_DP))
 plot(Pscore)
 plot(Pscore_bon)
```

Plot a single position Script

```
# basePlot_crc.r
#
# Analysis and plot the data
# for a single base (after running crcAnalysis.r)
#

library(ggplot2)
library(reshape2)

# Position of interest
# in hgr1 / chr13 coordinates 

pos ='1008007' # G58A / hgr1:1008007
  pos = which(genCoord == pos)

  
# Change in allele frequency between cancer-normal
POS = data.frame(dRAF[,pos])
  colnames(POS) = "dRAF"

# Raw reference allele frequency
POS$cancer_RAF = canRAF[,pos]
POS$normal_RAF = normRAF[,pos]

# Depth of coverage at position of interest
POS$cancer_DP = DP[seq(1, sampleN, 2), pos]
POS$normal_DP = DP[seq(2, sampleN, 2), pos]


# Plot canRAF vs. normRAF
POSDATA = melt(POS[,2:3])

PLOT = ggplot(POSDATA, aes(variable, value)) +
  geom_boxplot(stat = 'boxplot') + 
  geom_jitter( width = 0.2)
PLOT

# Test for signifiance
t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)
var.test(POS$cancer_RAF, POS$normal_RAF)


# Plot change in reference allele frequency
# comapred to a normal distribution with the same
# standard deviation (null hypothesis)
POS$sim_dRAF = rnorm(sampleN/2, mean = 0,
                     sd = sd(POS$dRAF))

POSDATA = melt(POS[,c(1,6)])

# Plot dRAF alone
# PLOT = ggplot(POS, aes('delta 28S.r.59G', dRAF)) +
#   geom_boxplot(stat = 'boxplot') +
#   geom_jitter(width = 0.2)
# PLOT

# Plot dRAF vs. Normal Distribution
PLOT = ggplot(POSDATA, aes(variable, value)) +
  geom_boxplot(stat = 'boxplot') + 
  geom_jitter( width = 0.2)
PLOT

t.test(POS$dRAF, POS$sim_dRAF, paired = FALSE)
var.test(POS$dRAF, POS$sim_dRAF)
```

### 28S.r.G59A

Testing the hypothesis; there should be an increase in A59 in the colorectal carcinoma samples relative to the their normal controls.

#### Cancer vs. Normal Reference Allele Frequency (intra-sample)
![28S_59G_RAF Boxplot](../../data/CRC/plot/28S_59G_RAF.png)

```
> mean(POS$cancer_RAF)
[1] 0.6393323
> mean(POS$normal_RAF)
[1] 0.6445118
```

```
> t.test(POS$cancer_RAF, POS$normal_RAF, paired = TRUE)

	Paired t-test

data:  POS$cancer_RAF and POS$normal_RAF
t = -0.43754, df = 68, p-value = 0.6631
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02880163  0.01844265
sample estimates:
mean of the differences 
           -0.005179492 

> var.test(POS$cancer_RAF, POS$normal_RAF)

	F test to compare two variances

data:  POS$cancer_RAF and POS$normal_RAF
F = 1.2305, num df = 68, denom df = 68, p-value = 0.3946
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7618793 1.9872890
sample estimates:
ratio of variances 
          1.230477 
```


There is no difference between the mean or variance of 28S.59A frequency between Cancer and Normal as a population (and with paired testing).


#### Cancer-Normal matched change in Reference Allele Frequency (inter-patient)
The difference between (Cancer - Normal) RAF compared to a normal distribution (mean = 0, sd = 0.0983329)


![28S 59G dRAF](../../data/CRC/plot/28S_59G_dRAF.png)

```
> t.test(POS$dRAF, POS$sim_dRAF, paired = FALSE)

	Welch Two Sample t-test

data:  POS$dRAF and POS$sim_dRAF
t = 0.94326, df = 135.85, p-value = 0.3472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.01703680  0.04811067
sample estimates:
   mean of x    mean of y 
-0.005179492 -0.020716424 

> var.test(POS$dRAF, POS$sim_dRAF)

	F test to compare two variances

data:  POS$dRAF and POS$sim_dRAF
F = 1.0683, num df = 68, denom df = 68, p-value = 0.7861
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6614696 1.7253798
sample estimates:
ratio of variances 
           1.06831 
```

#### Discussion 28S.59G

The 28S.59G has a global-normal average allele frequency of 64.45% which is pretty consistent with the previous iVAF measurements ~34% for this position globally. Most interestingly is that there can be such substantial change in allele frequency in the matched cancers (upto 30%) which given the ~10+ years to develop + cytogenetic abnormalaties is not completely unexpected.

This experiment fails to reject the null hypothesis, there is no difference in RAF between matched cancer and normal samples. If anything this position changes perfectly neutrall.

This is pretty exciting because it means that this method can measure changes in variant/reference allele frequency  quite well and if there's neutral change, that also means deviations from neutrallity will be a good indicator for selection. The cancer-normal pairs can be analyzed as evolutionary trajectories for rDNA flucuations.

Notably; none of the samples reached fixation in either direction, it's unclear at the moment if this is because my sample size is too small or if this is because both alleles of 28S.59G|A are neccesary. Worth following up on.

```
> range(POS$normal_RAF)
[1] 0.1482212 0.9874421
> range(POS$cancer_RAF)
[1] 0.1427057 0.9807644
```



Follow-up Experiment: [ Selection of rRNA variant alleles in CRC ](./20170331_CRC_RNAseq_2_PRIVATE.ipynb)
