# CPTAC - CRC Cohort 5 hg38 align
```
pi:ababaian
files: ~/Crown/data2/crc_cptac/
start: 2019 10 11
complete : 2019 -- --
```
## Introduction

There isn't an available CPTAC-CRC RNA-seq quantification. Align to hg38 and perform differential expression analysis on these 66 RNA-seq samples.


In [1]:
# Initialize
WORKDIR='/home/artem/Crown/data2/crc_cptac/hg38'
mkdir -p $WORKDIR; cd $WORKDIR



## Objective

1. Pilot: Align 2x CRC5 RNA-seq libraries to hg38. Confirm OK.
2. Full : Align remaining CRC5 libraries to hg38


## Materials and Methods

### Data Initialization


See `20190918 CPTAC CRC`. 

### Scripts and Localization

#### 1 - Localization

In [2]:
WORKDIR='/home/artem/Crown/data2/crc_cptac/hg38'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/cptac'

cptac_2.input       droneB.sh               hgr1_align_v4.cptac.sh
cptac_pilot0.input  hg38_align_v4.cptac.sh  queenB.sh


In [3]:
INPUT='cptac_pilot0.input'
# Note the different column requirements from CCLE

cat $INPUT

01CO001	crc5	SAMN03453626	SRR1999563	SRX1011590
01CO005	crc5	SAMN03453627	SRR1999549	SRX1011576

#### 2 - Script Versions

In [4]:
cd $WORKDIR
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR

cat hgr38_align_v4.cptac.sh
echo 
echo
cat queenB.shS
echo 
echo
cat droneB.sh
echo 
echo 

cat: hgr38_align_v4.cptac.sh: No such file or directory


cat: queenB.shS: No such file or directory


#!/bin/bash
# droneB.sh
#

# This script-layer is neccesary to launch a screen session
# on each ec2-machine. The pipeline is run within that session
# and the output is logged. This allows 'looking in' on sessions
# as they are running.

# Commands to run on server-side

SCRIPTPATH=$1

SCRIPT=$(basename $1)

shift # drop first (TASK or SCRIPT variable)

# Download pipeline / droneB's function
  aws s3 cp $SCRIPTPATH ./

  chmod 777 *.sh

# open screen; run gather.sh function. -L logged
  screen -Ldmt sh ~/$SCRIPT $@






## Results - CCLE Pilot Run I

#### 3 - Copy local to S3

In [5]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hg38_align_v4.cptac.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
aws s3 cp dbgap.key $S3URL/scripts/


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/cptac/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/cptac/scripts/droneB.sh
Completed 4.6 KiB/4.6 KiB with 1 file(s) remainingupload: ./hg38_align_v4.cptac.sh to s3://crownproject/cptac/scripts/hg38_align_v4.cptac.sh
Completed 95 Bytes/95 Bytes with 1 file(s) remainingupload: ./cptac_pilot0.input to s3://crownproject/cptac/scripts/cptac_pilot0.input

The user-provided path dbgap.key does not exist.


In [8]:
# start
date
date -u

Thu Oct  3 09:47:47 PDT 2019
Thu Oct  3 16:47:47 UTC 2019


#### 4 - Launch and run master EC2 node

In [10]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/cptac/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh cptac_pilot0.input
#
# aws s3 cp screenlog.0 s3://crownproject/cptac/logs/cptac_pilot0.hg38.log

aws s3 cp s3://crownproject/cptac/logs/cptac_pilot0.hg38.log ./
cat cptac_pilot0.hg38.log
date -u

# Run completed successfully!


Completed 1.5 KiB/1.5 KiB with 1 file(s) remainingdownload: s3://crownproject/cptac/logs/cptac_pilot0.log to ./cptac_pilot0.log
kec2-user@ip-172-31-40-172:~\[?1034h[ec2-user@ip-172-31-40-172 ~]$ ls
ADcalc_ccle2.sh     CrownKey.pem  droneB.sh		  queenB.sh
cptac_pilot0.input  dbgap.key	  hgr1_align_v4.cptac.sh  screenlog.0
kec2-user@ip-172-31-40-172:~\[ec2-user@ip-172-31-40-172 ~]$ lsexitbash queenB.sh cptac_pilot0.input
Launch instance # 1
Thu Oct  3 16:50:00 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/cptac/scripts/hgr1_align_v4.cptac.sh
Parameters: 01CO001 crc5 SAMN03453626 SRR1999563 SRX1011590
Instance ID: i-0e576ba89c5776da1
Public DNS: ec2-54-185-143-115.us-west-2.compute.amazonaws.com
download: s3://crownproject/cptac/scripts/hgr1_align_v4.cptac.sh to ./hgr1_align_v4.cptac.sh


Launch instance # 2
Thu Oct  3 16:53:07 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a

## CRC - CPTAC Full Run



In [8]:
# Repeat above with entire cohort, 45 nodes ~3x run
cd $WORKDIR
INPUT="cptac_crc.input"

cat $INPUT

01CO006	crc5	SAMN04111321	SRR2518440	SRX1288089
01CO008	crc5	SAMN04111439	SRR2518441	SRX1288090
01CO013	crc5	SAMN05127283	SRR9861902	SRX6616551
01CO014	crc5	SAMN06208758	SRR9861950	SRX6616680
01CO019	crc5	SAMN05127298	SRR9862138	SRX6615968
01CO022	crc5	SAMN06208530	SRR9861951	SRX6616681
05CO002	crc5	SAMN03453636	SRR1999486	SRX1011513
05CO003	crc5	SAMN03453647	SRR1999570	SRX1011597
05CO005	crc5	SAMN06208707	SRR9861979	SRX6615528
05CO006	crc5	SAMN03453668	SRR1999616	SRX1011643
05CO007	crc5	SAMN03453645	SRR1999590	SRX1011617
05CO011	crc5	SAMN03453615	SRR1999580	SRX1011607
05CO014	crc5	SAMN03453622	SRR1999556	SRX1011583
05CO020	crc5	SAMN04111397	SRR2518460	SRX1288109
05CO026	crc5	SAMN05127186	SRR9862145	SRX6615975
05CO028	crc5	SAMN04111354	SRR2518461	SRX1288110
05CO029	crc5	SAMN04111362	SRR2518462	SRX1288111
05CO032	crc5	SAMN04111410	SRR2518463	SRX1288112
05CO033	crc5	SAMN04111339	SRR2518464	SRX1288113
05CO034	crc5	SAMN05127120	SRR9862146	SRX6615976
05CO035	crc5	SAMN051

In [11]:
aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hg38_align_v4.cptac.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
aws s3 cp dbgap.key $S3URL/scripts/


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/cptac/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/cptac/scripts/droneB.sh
Completed 4.6 KiB/4.6 KiB with 1 file(s) remainingupload: ./hg38_align_v4.cptac.sh to s3://crownproject/cptac/scripts/hg38_align_v4.cptac.sh
Completed 3.0 KiB/3.0 KiB with 1 file(s) remainingupload: ./cptac_crc.input to s3://crownproject/cptac/scripts/cptac_crc.input

The user-provided path dbgap.key does not exist.


In [13]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/cptac/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh cptac_pilot0.input
#
# aws s3 cp screenlog.0 s3://crownproject/cptac/logs/cptac_crc.log

aws s3 cp s3://crownproject/cptac/logs/cptac_crc.log ./
cat cptac_crc.log
date -u

# Run completed successfully!


Completed 40.5 KiB/40.5 KiB with 1 file(s) remainingdownload: s3://crownproject/cptac/logs/cptac_crc.log to ./cptac_crc.log
kec2-user@ip-172-31-40-172:~\[?1034h[ec2-user@ip-172-31-40-172 ~]$ cat queenB.sh 
#!/bin/bash
# queenB.sh
# 20180814 build
# EC2 Launch / Control Script
#

# 1. queenB script is initialized locally and input files
#    are parsed ready for cluster analaysis
# 2. queenB launches instances, logs in to it and runs the
#    droneB.sh script remotely.
# 3. The droneB script is executed on the instance and it
#    launches a `screen` on the instance and loads and 
#    starts to perform the $TASK (gather.sh) script.
# 4. TASK script should include a instance shut-down
#    command to close instance upon completion.
#

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/cptac'

# EC2 TASK Script - script for droneB to execute
TASK="$S3URL/scripts/hgr1_align_v4.cptac.sh"

# Parameter file:
# Each line of PARAMETERS will

In [13]:
aws s3 ls s3://crownproject/cptac/hg38/
aws s3 ls s3://crownproject/cptac/hg38/ > hg38.filelist


2019-10-12 20:37:41 6388802782 01CO001.crc5.hg38.bam
2019-10-12 20:37:55    3661208 01CO001.crc5.hg38.bam.bai
2019-10-12 20:38:24        452 01CO001.crc5.hg38.flagstat
2019-10-12 20:31:48        420 01CO001.crc5.hg38.stats
2019-10-12 20:38:36 4454509268 01CO005.crc5.hg38.bam
2019-10-12 20:39:07    3250664 01CO005.crc5.hg38.bam.bai
2019-10-12 20:39:17        450 01CO005.crc5.hg38.flagstat
2019-10-12 20:29:52        420 01CO005.crc5.hg38.stats
2019-10-12 21:22:31  133562641 01CO006.crc5.hg38.bam
2019-10-12 21:22:33    2062760 01CO006.crc5.hg38.bam.bai
2019-10-12 21:22:33        431 01CO006.crc5.hg38.flagstat
2019-10-12 22:36:44 1720106244 01CO008.crc5.hg38.bam
2019-10-12 22:37:00    4162704 01CO008.crc5.hg38.bam.bai
2019-10-12 22:37:01        446 01CO008.crc5.hg38.flagstat
2019-10-13 01:04:47 2710662377 01CO013.crc5.hg38.bam
2019-10-13 01:05:30    3074728 01CO013.crc5.hg38.bam.bai
2019-10-13 01:05:31        449 01CO013.crc5.hg38.flagstat
2019-10-13 01:20:41 3194160116 01

## Differential Expression - DESeq2

Fire up an instance, install R + DESeq2; switch to high performance system and run DESeq2 on the CRC-CPTAC samples.


#### make cptac.bam.list
Copy below to instance as `cptac.bam.list`

```
cptac/01CO001.crc5.hg38.bam	cptac.crc5	01CO001	84795329
cptac/01CO005.crc5.hg38.bam	cptac.crc5	01CO005	62583928
cptac/01CO006.crc5.hg38.bam	cptac.crc5	01CO006	486585
cptac/01CO008.crc5.hg38.bam	cptac.crc5	01CO008	11933501
cptac/01CO013.crc5.hg38.bam	cptac.crc5	01CO013	58595652
cptac/01CO014.crc5.hg38.bam	cptac.crc5	01CO014	54106405
cptac/01CO019.crc5.hg38.bam	cptac.crc5	01CO019	108110670
cptac/01CO022.crc5.hg38.bam	cptac.crc5	01CO022	54803753
cptac/05CO002.crc5.hg38.bam	cptac.crc5	05CO002	60715107
cptac/05CO003.crc5.hg38.bam	cptac.crc5	05CO003	61460879
cptac/05CO005.crc5.hg38.bam	cptac.crc5	05CO005	55063897
cptac/05CO006.crc5.hg38.bam	cptac.crc5	05CO006	60449439
cptac/05CO007.crc5.hg38.bam	cptac.crc5	05CO007	68929982
cptac/05CO011.crc5.hg38.bam	cptac.crc5	05CO011	60984306
cptac/05CO014.crc5.hg38.bam	cptac.crc5	05CO014	60333206
cptac/05CO020.crc5.hg38.bam	cptac.crc5	05CO020	458141
cptac/05CO026.crc5.hg38.bam	cptac.crc5	05CO026	56995432
cptac/05CO028.crc5.hg38.bam	cptac.crc5	05CO028	439542
cptac/05CO029.crc5.hg38.bam	cptac.crc5	05CO029	469877
cptac/05CO032.crc5.hg38.bam	cptac.crc5	05CO032	402848
cptac/05CO033.crc5.hg38.bam	cptac.crc5	05CO033	548880
cptac/05CO034.crc5.hg38.bam	cptac.crc5	05CO034	57530826
cptac/05CO035.crc5.hg38.bam	cptac.crc5	05CO035	57575161
cptac/05CO037.crc5.hg38.bam	cptac.crc5	05CO037	57420777
cptac/05CO039.crc5.hg38.bam	cptac.crc5	05CO039	71843026
cptac/05CO041.crc5.hg38.bam	cptac.crc5	05CO041	53702189
cptac/05CO047.crc5.hg38.bam	cptac.crc5	05CO047	54634173
cptac/05CO050.crc5.hg38.bam	cptac.crc5	05CO050	54132092
cptac/05CO053.crc5.hg38.bam	cptac.crc5	05CO053	56087556
cptac/05CO055.crc5.hg38.bam	cptac.crc5	05CO055	54200802
cptac/06CO001.crc5.hg38.bam	cptac.crc5	06CO001	498500
cptac/06CO002.crc5.hg38.bam	cptac.crc5	06CO002	536829
cptac/09CO005.crc5.hg38.bam	cptac.crc5	09CO005	56955952
cptac/09CO006.crc5.hg38.bam	cptac.crc5	09CO006	485420
cptac/09CO008.crc5.hg38.bam	cptac.crc5	09CO008	519252
cptac/09CO013.crc5.hg38.bam	cptac.crc5	09CO013	56509790
cptac/09CO014.crc5.hg38.bam	cptac.crc5	09CO014	55403403
cptac/09CO015.crc5.hg38.bam	cptac.crc5	09CO015	55708790
cptac/09CO019.crc5.hg38.bam	cptac.crc5	09CO019	54867148
cptac/11CO005.crc5.hg38.bam	cptac.crc5	11CO005	54594284
cptac/11CO018.crc5.hg38.bam	cptac.crc5	11CO018	54538190
cptac/11CO020.crc5.hg38.bam	cptac.crc5	11CO020	57107083
cptac/11CO021.crc5.hg38.bam	cptac.crc5	11CO021	55251050
cptac/11CO022.crc5.hg38.bam	cptac.crc5	11CO022	54369512
cptac/11CO027.crc5.hg38.bam	cptac.crc5	11CO027	56946657
cptac/11CO031.crc5.hg38.bam	cptac.crc5	11CO031	56092491
cptac/11CO032.crc5.hg38.bam	cptac.crc5	11CO032	54076443
cptac/11CO042.crc5.hg38.bam	cptac.crc5	11CO042	64239729
cptac/11CO044.crc5.hg38.bam	cptac.crc5	11CO044	54687567
cptac/11CO047.crc5.hg38.bam	cptac.crc5	11CO047	54512690
cptac/11CO053.crc5.hg38.bam	cptac.crc5	11CO053	55555524
cptac/11CO057.crc5.hg38.bam	cptac.crc5	11CO057	56218474
cptac/11CO058.crc5.hg38.bam	cptac.crc5	11CO058	55692957
cptac/11CO059.crc5.hg38.bam	cptac.crc5	11CO059	60834976
cptac/11CO072.crc5.hg38.bam	cptac.crc5	11CO072	53859044
cptac/11CO077.crc5.hg38.bam	cptac.crc5	11CO077	55447886
cptac/11CO079.crc5.hg38.bam	cptac.crc5	11CO079	54732361
cptac/14CO002.crc5.hg38.bam	cptac.crc5	14CO002	56414882
cptac/14CO003.crc5.hg38.bam	cptac.crc5	14CO003	56634777
cptac/15CO001.crc5.hg38.bam	cptac.crc5	15CO001	58270132
cptac/16CO012.crc5.hg38.bam	cptac.crc5	16CO012	53034108
cptac/20CO001.crc5.hg38.bam	cptac.crc5	20CO001	519522
cptac/20CO003.crc5.hg38.bam	cptac.crc5	20CO003	60072721
cptac/20CO004.crc5.hg38.bam	cptac.crc5	20CO004	54181924
cptac/20CO007.crc5.hg38.bam	cptac.crc5	20CO007	54729760
cptac/22CO006.crc5.hg38.bam	cptac.crc5	22CO006	58185486
```


#### AMI instance for DESseq2



In [None]:
# Loaded Crown-AMI on t2.small
# with 1.5 Tb of HD space
#
# To change instance type in GUI
# Instance > Stop; Intance Setting > Change Type

if FALSE; # catch accidental runs
then

# Download gencode v31 reference
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz
gzip -d *.gtf.gz

# Download RNA-seq files for CPTAC
mkdir -p cptac;
cd cptac
aws s3 cp --recursive s3://crownproject/cptac/hg38/ ./


# Install R Base + XML
sudo apt-get install r-base-core
sudo apt-get install libxml2-dev
sudo apt-get install r-cran-xml

sudo chmod -R 777 /usr/lib/R/
sudo chmod -R 777 /home/ubuntu/R/
  
R

fi


```
# {in R}
install.packages("BiocManager")
library("BiocManager")

install("Rsubread")
install("DESeq2")
install("biomaRt")
install("Rsamtools")
install("GenomicFeatures")
```

```
# {in R}
library("biomaRt")
# Import Read Counts for DEseq2
  ensembl = useMart("ensembl")
  ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

# Import bamlist
library("Rsamtools")
  bam.list = read.table('cptac.bam.list', header = F)
  bam.list = as.character(bam.list$V1)
  bamfiles <- BamFileList(bam.list, yieldSize = 200000000)


# Import annotation
library("GenomicFeatures")
  gtffile <- file.path("gencode.v31.basic.annotation.gtf")
  txdb <- makeTxDbFromGFF(gtffile, format="auto",
                          dataSource='gencode.31',
                          organism='Homo sapiens',
                          dbxrefTag = 'gene_id')


  gencode = transcripts(txdb, columns = c("tx_id", "tx_name", "gene_id", "cds_id"))
  gencode$gene_id2 = unlist(strsplit(as.character(gencode$gene_id),
                                     split = '\\.[0-9_PARY]*', fixed = F, perl = T))
    #gencode$id = gencode$tx_name

  # Subset to protein coding genes only
  gencode_pc = gencode[which(sum(is.na(gencode$cds_id)) == 0),]


  # Add Gene Metadata
  geneSymbols = getBM(attributes=c('ensembl_gene_id','hgnc_symbol','description'),
                      filters = 'ensembl_gene_id',
                      values = gencode_pc$gene_id2,
                      mart = ensembl)

  whichGrepl = function(lookup, VECTOR){
    key = which(grepl(lookup, VECTOR))

    if (length(key) == 0){
      return(0)
    }else{
      return(key[1])
    }}

  # genecode --> bioMart Conversion Key
  keyVector = unlist(lapply(gencode_pc$gene_id2,
                            FUN = whichGrepl,
                            VECTOR = geneSymbols$ensembl_gene_id))

    # Dealing with no-match in Ensembl
    noMatch = (length(geneSymbols$ensembl_gene_id) + 1)
    keyVector2 = keyVector
    keyVector2[which(keyVector2 == 0)] = noMatch

  gencode_pc$symbol = c(geneSymbols$hgnc_symbol,NA)[keyVector2]
  gencode_pc$description = c(geneSymbols$description,NA)[keyVector2]


# Count Reads per bamfile
library("Rsubread")
  # gencode = createAnnotationFile(gencode)
  # gencode1 = gencode[1:100,]
  
  # Feature counts is performed over the entire gencode file
  # including lincRNA
  # exon --> gene level mapping
  fc_SE <- featureCounts(bam.list,
                         nthreads = 7,
                         isPairedEnd = TRUE,
                         annot.ext = gtffile,
                         isGTFAnnotationFile = TRUE,
                         GTF.featureType = "exon",
                         GTF.attrType = "gene_id")
  
  
  # Save input calculations
  
  save(bam.list, bamfiles, fc_SE, gencode, gencode_pc,
       geneSymbols, keyVector,
       gtffile, txdb, file = 'deseqInitialize_cptac_191015.Rdata')
  
```


In [None]:
aws s3 cp deseqInitialize_cptac_191015.Rdata s3://crownproject/cptac/
