# CCLE - Pilot
```
pi:ababaian
files: ~/Crown/data2/ccle/
start: 2019 05 28
complete : 2019 06 02
```
## Introduction

With the TCGA data analysis complete and some left over Amazon credits, I'm looking at some other datasets which I can analyze.

The **Cancer Cell Line Encyclopedia** (CCLE) is RNAseq from ~1000 cell lines and WGS DNAseq from ~380 of those lines.

This data is on the SRA so it will require a tweaked download script, otherwise technically the data will be processed the same.


## Objective

1. Pilot: Align 2x RNAseq and 2xWGS datato the `hgr1` reference sequence and QC the output.
2. Set-up a full run for the entire CCLE data cohorts.


## Materials and Methods

### Data Initialization


From the SRA website, the CCLE project was selected: [SRP186687](https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP186687)

The data was imported into excel for filtering and prioritization. For the pilot, RNA and WGS will be analyzed from HT115 and HCT116

The output of this parsing is copied to the input file: `ccle_pilot.input`

Input columns are (see below):

1. Library Name
2. Data Type
3. Sample ID
4. SRA Accesion
5. Experiment Accession


### Scripts and Localization

#### 1 - Localization

In [2]:
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'

ccle_pilot2.input      HCT116.rna.flagstat       logs
ccle_pilot.input       HCT116.rna.hgr1.flagstat  metadata
CCLE_SraRunTable.xlsx  HCT116.wgs.screenlog      old_scripts
droneB.sh              hgr1_align_v4.ccle.sh     queenB.sh


In [2]:
INPUT='ccle_pilot.input'
# Note the different column requirements for SRA data access

cat $INPUT

HCT116	rna	SAMN10988251	SRR8615282	SRX5414471
HCT116	wgs	SAMN10988251	SRR8639145	SRX5437588
HCT15	rna	SAMN10987770	SRR8615281	SRX5414472
HCT15	wgs	SAMN10987770	SRR8639146	SRX5437587

#### 2 - Script Versions

In [5]:
cd $WORKDIR
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR

cat hgr1_align_v3.ccle.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# hgr1_align_v3.ccle.sh
# rDNA alignment pipeline - SRA version
PIPE_VERSION='190528 build -- CCLE'
AMI_VERSION='crown-180813 - ami-0031fd61f932bdef9'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

## get SRA toolkit
# wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.6-1/sratoolkit.2.9.6-1-ubuntu64.tar.gz
# aws s3 cp sratoolkit.2.9.6-1-ubuntu64.tar.gz s3://crownproject/ccle/scripts/sratoolkit.tar.gz
aws s3 cp s3://crownproject/ccle/scripts/sratoolkit.tar.gz ./
gzip -dc sratoolkit.tar.gz | tar -xf -
SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin" # binary path

# $1 : Library name + Output name(unique)
# $2 : Seq-read type (wgs|rna)
# $3 : BioSample ID
# $4 : Library SRA Accession

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/ccle'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggi

## Results - CCLE Pilot Run I

#### 3 - Copy local to S3

In [6]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v3.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
#aws s3 cp ../../gdc.token.txt $S3URL/scripts/gdc.token


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/ccle/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/ccle/scripts/droneB.sh
Completed 5.2 KiB/5.2 KiB with 1 file(s) remainingupload: ./hgr1_align_v3.ccle.sh to s3://crownproject/ccle/scripts/hgr1_align_v3.ccle.sh
Completed 181 Bytes/181 Bytes with 1 file(s) remainingupload: ./ccle_pilot.input to s3://crownproject/ccle/scripts/ccle_pilot.input


In [8]:
# start
date
date -u

Wed May 29 18:50:27 PDT 2019
Thu May 30 01:50:27 UTC 2019


#### 4 - Launch and run master EC2 node

In [9]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccl2_pilot.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_pilot.log

aws s3 cp s3://crownproject/ccle/logs/ccle_pilot.log ./
cat ccle_pilot.log
date -u

# Run completed successfully

Completed 2.7 KiB/2.7 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_pilot.log to ./ccle_pilot.log
kec2-user@ip-172-31-17-5:~\[?1034h[ec2-user@ip-172-31-17-5 ~]$ ls
ccle_pilot.input  droneB.sh		 queenB.sh    sratoolkit.tar.gz
CrownKey.pem	  hgr1_align_v3.ccle.sh  screenlog.0
kec2-user@ip-172-31-17-5:~\[ec2-user@ip-172-31-17-5 ~]$ lsexit[2Plsrm *[2Pls -alh[Kbash queenB.sh ccle_pilot.input 
Error 02 - Duplicate Sample ID detected in input
 I hope you know what you're doing
 script will not exit in this version
 re-run with unique ID or outputs will overwrite

Launch instance # 1
Thu May 30 01:50:46 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v3.ccle.sh
Parameters: HCT116 rna SAMN10988251 SRR8615282 SRX5414471
Instance ID: i-081e0951d2b25ecda
Public DNS: ec2-54-245-171-125.us-west-2.compute.amazonaws.com
download: s3://crownpro

**Thu May 30 03:15:34 UTC 2019:**

On instance 1, fastq files are still downloading. This is over an hour this instance has been running (c4.xlarge) and the bottleneck is downloading the data. This would be very inefficient on a larger cohort.

`1.1G May 30 03:15 SRR8615282_1.fastq.gz`. 

**Thu May 30 03:46:53 UTC 2019:**
`-rw-rw-r--  1 ubuntu ubuntu 1.4G May 30 03:46 SRR8615282_1.fastq.gz`

**Thu May 30 16:13:58 UTC 2019**

Next AM, runs did not work. Need to remove/fix the single end read detection, as it fails with this script. Also it takes far too long to download the data from SRA. I need to use a swarm of t2.micro machines to download the data, store it on S3 and then run C4.aligners from that t2.micro machine. One more level of abstraction.

**Fri May 31 02:08:23 UTC 2019**

The RNA runs are both complete. The WGS runs are both still downloading (>24 hours). Haven't even set-up the manual alignments yet. These files will require even longer downloads.

```
-rw-rw-r-- 1 ubuntu ubuntu 16G May 31 02:08 SRR8639145_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 17G May 31 02:08 SRR8639145_2.fastq.gz
...
-rw-rw-r--  1 ubuntu ubuntu  15G May 31 02:09 SRR8639146_1.fastq.gz
-rw-rw-r--  1 ubuntu ubuntu  16G May 31 02:09 SRR8639146_2.fastq.gz
```

**Fri May 31 15:15:56 UTC 2019**

DNA WGS data still not downloaded, the rate of download is also at a standstill it seems. This may be a rate-limiting step with serer-side I/O in making fastq files. PREFETCH of straight SRA files is likely to be faster.

```
-rw-rw-r--  1 ubuntu ubuntu  18G May 31 15:17 SRR8639146_1.fastq.gz
-rw-rw-r--  1 ubuntu ubuntu  19G May 31 15:17 SRR8639146_2.fastq.gz
```

**Fri May 31 19:39:16 UTC 2019:**

Accidently logged into the wrong node (HCT116 wgs) and stopped the screen
```
-rw-rw-r--  1 ubuntu ubuntu  21G May 31 19:37 SRR8639145_1.fastq.gz
-rw-rw-r--  1 ubuntu ubuntu  21G May 31 19:37 SRR8639145_2.fastq.gz

ubuntu@ip-172-31-17-192:~/ncbi$ du -ch
395M	./public/refseq
31G	./public/sra
32G	./public
32G	.
32G	total

```

Will have to re-start. Note the inefficiency fo fastq-dump, it's totally unfeasible. shut down node. Keep HT115 WGS running.

In [10]:
date -u
aws s3 cp s3://crownproject/ccle/tmp/screen.fail.1 ./
cat screen.fail.1

Thu May 30 16:13:58 UTC 2019
Completed 5.2 KiB/5.2 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/tmp/screen.fail.1 to ./screen.fail.1
Completed 1 of 10 part(s) with 1 file(s) remainingCompleted 2 of 10 part(s) with 1 file(s) remainingCompleted 3 of 10 part(s) with 1 file(s) remainingCompleted 4 of 10 part(s) with 1 file(s) remainingCompleted 5 of 10 part(s) with 1 file(s) remainingCompleted 6 of 10 part(s) with 1 file(s) remainingCompleted 7 of 10 part(s) with 1 file(s) remainingCompleted 8 of 10 part(s) with 1 file(s) remainingCompleted 9 of 10 part(s) with 1 file(s) remainingCompleted 10 of 10 part(s) with 1 file(s) remainingdownload: s3://crownproject/ccle/scripts/sratoolkit.tar.gz to ./sratoolkit.tar.gz
 -- hgr1 Alignment Pipeline -- 
 version: 190528 build -- CCLE 
 ami:     crown-180813 - ami-0031fd61f932bdef9  
 s3:      s3://crownproject/ccle  
 library: HCT116 -- rna
 date:    Thu May 30 01:53:55 UTC 2019

Initializing ...
Download

In [12]:
# HCT116 RNA run finished and looks good
date -u
aws s3 cp s3://crownproject/ccle/hgr1/HCT116.rna.flagstat ./
aws s3 cp s3://crownproject/ccle/hgr1/HCT116.rna.hgr1.flagstat ./

cat HCT116.rna.flagstat
echo '------------------'
cat HCT116.rna.hgr1.flagstat

Thu May 30 18:17:14 UTC 2019
Completed 425 Bytes/425 Bytes with 1 file(s) remainingdownload: s3://crownproject/ccle/hgr1/HCT116.rna.flagstat to ./HCT116.rna.flagstat
Completed 417 Bytes/417 Bytes with 1 file(s) remainingdownload: s3://crownproject/ccle/hgr1/HCT116.rna.hgr1.flagstat to ./HCT116.rna.hgr1.flagstat
167789816 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
381747 + 0 mapped (0.23% : N/A)
167789816 + 0 paired in sequencing
83894908 + 0 read1
83894908 + 0 read2
367836 + 0 properly paired (0.22% : N/A)
373224 + 0 with itself and mate mapped
8523 + 0 singletons (0.01% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
------------------
390270 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
381747 + 0 mapped (97.82% : N/A)
390270 + 0 paired in sequencing
195135 + 0 read1
195135 + 0 read2
367836 +

## Materials and Methods II

The old architecture for AWS doesn't work here. Will update AMI to accomodate SRA.


In [13]:
date -u

Fri May 31 15:41:25 UTC 2019


In [15]:
## Test download speeds
## prefetch only vs. fastq split
## Open 2 of the same nodes and download two files of ~equal size with
## different commands.
##


## REMOTE
## On T3A test nodes

## Common
#aws s3 cp s3://crownproject/ccle/scripts/sratoolkit.tar.gz ./
#gzip -dc sratoolkit.tar.gz | tar -xf -
#SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin" # binary path

## Test prefetch:
## Jurkat 6304Mb SRR8615712
## K562 8212Mb SRR8615717

#screen -L
#mkdir -p align; cd align;
#SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin"
#echo Jurkat; date; $SRABIN/prefetch -X 100G -O ./ SRR8615712; date;\
#        echo K562; $SRABIN/prefetch -X 100G -O ./ SRR8615717; date

## -O option copies sra to local dir; use regular cache instead
# aws s3 cp --recursive ./ s3://crownproject/ccle/sra/

# Use `fasterq-dump`


## START: Fri May 31 15:43:02 UTC 2019
## FILE:  Fri May 31 15:50:29 UTC 2019
## END:   Fri May 31 15:58:20 UTC 2019

## Test fastq-dump
## KMH2 6271Mb SRR8615908
## HL60 8281Mb SRR8616133

#screen -L
#mkdir -p align; cd align;
#SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin"
#echo KMH2; date; $SRABIN/fastq-dump --gzip --split-files SRR8615908; date;\
#        echo HL60; $SRABIN/fastq-dump --gzip --split-files SRR8616133; date

## START: Fri May 31 15:43:09 UTC 2019
## ..   : Fri May 31 18:02:11 UTC 2019
## ..   : Fri May 31 19:37:45 UTC 2019 (still not done)

## fastq didn't even finish the first file.
## killing prematurely, this is obviously far less efficient



In [14]:
cd ~/Crown/data2/ccle/
cat prefetch.log

[1m[7m%[27m[1m[0m                                                                                       ]1;~/Desktop]2;artem@glitch[~/Desktop][0m[27m[24m[Jartem@glitch[Desktop] [K[55C[1m[ 6:53PM][0m[64D[?1h=[?2004hssh -i "~/.ssh/CrownKey.pem" ubuntu@ec2-54-212-204-255.us-west-2.compute.amazonaws.com[K[?1l>[?2004l
^C
[1m[7m%[27m[1m[0m                                                                                       ]1;~/Desktop]2;artem@glitch[~/Desktop][0m[27m[24m[Jartem@glitch[Desktop] [K[55C[1m[ 6:54PM][0m[64D[?1h=[?2004hssh -i "~/.ssh/CrownKey.pem" ubuntu@ec2-54-212-204-255.us-west-2.compute.amazonaws.com[K[K                  c [K[A[86C[K[1B[K[A[86C         [K[1C[1m[ 6:54PM][0m[10D                  ec2-54-245-171-125.[Kus-west-2. [Kccompute.amazonaws.com[?1l>[?2004l
Welcome to Ubuntu 16.04.1 LTS (GNU/Linu

In [None]:
# Install Aspera ASCP for faster DL (possibly)
# https://download.asperasoft.com/download/sw/connect/3.9.1/ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz

wget https://download.asperasoft.com/download/sw/connect/3.9.1/ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz
mv ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz ascp.tar.gz
aws s3 cp ascp.tar.gz 

aws s3 cp s3://crownproject/ccle/scripts/ascp.tar.gz ./
gzip -dc ascp.tar.gz | tar -xf -
bash ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.sh

SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin" # binary path
/home/ec2-user/.aspera/connect

$SRABIN/prefetch --ascp-path \
  '$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh"' \
  $SRA

## Test fastq-dump
## KMH2 6271Mb SRR8615908
## HL60 8281Mb SRR8616133

SRABIN="$HOME/sratoolkit.2.9.6-1-ubuntu64/bin" # binary path
echo KMH2; date -u; $SRABIN/prefetch --ascp-path \
  "$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh"\
  SRR8615908; date -u;\
echo HL60; $SRABIN/prefetch --ascp-path \
  "$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" \
  SRR8616133;\ date -u
  
  
## START: Fri May 31 19:23:10 UTC 2019
## FILE:  Fri May 31 19:35:15 UTC 2019
## END:    ~  May 31 19:47

#test fasterq-dump
# memory; threads; split
#$SRABIN/fasterq-dump --mem 500MB -e 2 -S

# SRA Files are stored with 'reference sequence' requisites
# download times can be minimized if an update of the TCGA run node is made with
# the reference files already pre-downloaded. They won't have to re-download.
#
# 418M May 31 19:27 SRR8615908.sra.vdbcache.cache #SRR file finished here.



In [16]:
## Update EC2 Instance for rapid SRA compatibility
## Launch AMI: crown-180914 (ami-096bcb9d18c32d4d5)
## t2.large
## 

# sudo apt-get update

## Clear out unneccesary TCGA data (this isn't a data instance) ==========================
## -128 Gb
# rm -r ~/tcga/*
# rmdir ~/tcga
# rm ~/logs/*

## Install Software Updates
# cd software

## SRA TOOLKIT 2.9.6-1-ubuntu64 ============================================================
# wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
# gzip -dc sratoolkit.current-ubuntu64.tar.gz | tar -xf -
# mv sratoolkit.current-ubuntu64.tar.gz zips/
# ln -fs /home/ubuntu/software/sratoolkit.2.9.6-1-ubuntu64/bin/* ~/bin/

## ASCP 3.9.1.171801.linux ================================================================
# wget https://download.asperasoft.com/download/sw/connect/3.9.1/ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz
# gzip -dc ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz | tar -xf -
# bash ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.sh
# rm ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.sh

## Download refseq reference files for SRA (hg38 I think)
# mkdir -p ~/tmp; cd ~/tmp

## SRA prefetch resources - RNAseq
# prefetch --ascp-path "$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" SRR8615908
# prefetch --ascp-path "$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" SRR8616133

## SRA prefetch resources - DNAseq
# prefetch --ascp-path -X 100G "$HOME/.aspera/connect/bin/ascp|$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" SRR8639145

## clear SRA files (retaining refseq files)
# rm /home/ubuntu/ncbi/public/sra/*

## Bowtie2 SRA ============================================================================
## NOTE: Update bowtie2 to newest version and it can read SRA files directly for alignment
## this will bypass alignment dump steps completely.

# wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.5.1/bowtie2-2.3.5.1-sra-linux-x86_64.zip?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fbowtie-bio%2Ffiles%2Fbowtie2%2F2.3.5.1%2Fbowtie2-2.3.5.1-sra-linux-x86_64.zip%2Fdownload&ts=1559337015
# unzip bowtie2-2.3.5.1-sra-linux-x86_64
# cd ~/software/bowtie2-2.3.5.1-sra-linux-x86_64
# mv bowtie* ~/bin/
# which bowtie2 ## /home/ubuntu/bin/bowtie2

## Test new bowtie2 with SRA
# testing SRA align
# cd ~/tmp
# bowtie2 --very-fast -p 1 \
#      --rg-id TEST --rg LB:TEST --rg SM:TEST \
#      --rg PL:TEST --rg PU:TEST \
#      -x hgr1 --sra SRR8639145 | \
#      ~/bin/samtools view -bS - > aligned_unsorted.bam

## Appears to work transparently... will take too long to run the entire process here
      

## Copy to S3 Zips
# aws s3 cp bowtie2-2.3.5.1-sra-linux-x86_64.zip s3://crownproject/software/
# aws s3 cp ibm-aspera-connect-3.9.1.171801-linux-g2.12-64.tar.gz s3://crownproject/software/
# aws s3 cp sratoolkit.2.9.6-1-ubuntu64/sratoolkit.current-ubuntu64.tar.gz s3://crownproject/software/

# Final du -ch
# 8.5G    total


## Take AMI image
## Save AMI: crown-190601 (ami-0b375c9c58cb4a7a2)
## description: aligner drone -- sra compatible




### Test Run II

Test the WGS files for alignment. This should be very very much faster.


#### Input Files / Scripts

In [19]:
cd $WORKDIR

INPUT='ccle_pilot2.input'
cat $INPUT

echo ''
echo ''

HCT116	wgs	SAMN10988251	SRR8639145	SRX5437588
HCT15	wgs	SAMN10987770	SRR8639146	SRX5437587



In [20]:
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR


cat hgr1_align_v4.ccle.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# hgr1_align_v4.ccle.sh
# rDNA alignment pipeline - SRA version
PIPE_VERSION='190531 build -- CCLE'
AMI_VERSION='crown-190601 - ami-0b375c9c58cb4a7a2'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name + Output name(unique)
# $2 : Seq-read type (wgs|rna)
# $3 : BioSample ID
# $4 : Library SRA Accession

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/ccle'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'
    
# Read Group Data
  LIBRARY=$1    # Library Name / File prefix
  TYPE=$2       # wgs OR rna data-type
	RGPO='ccle'   # Patient Population - CCLE
	RGSM=$3       # Sample / Patient Identifer
	RGID=$4       # Read Group ID. SRA Accession Number
  RGLB=$LIBRARY # Library Name. Accession Number
  RGPL='ILLUMINA' # Seq Platform
  RGPU=$5      

In [21]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
#aws s3 cp ../../gdc.token.txt $S3URL/scripts/gdc.token


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/ccle/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/ccle/scripts/droneB.sh
Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v4.ccle.sh to s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Completed 90 Bytes/90 Bytes with 1 file(s) remainingupload: ./ccle_pilot2.input to s3://crownproject/ccle/scripts/ccle_pilot2.input


#### Remote Node Launch


In [None]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_pilot2.input; aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_pilot2.log

date -u
aws s3 cp s3://crownproject/ccle/logs/ccle_pilot2.log ./
cat ccle_pilot2.log


# Run completed successfully

Post error fix re-launch

```
  2 Launch instance # 1
      3 Sat Jun  1 00:06:24 UTC 2019
      4 Instance Type: c4.xlarge
      5 AMI Image: ami-0b375c9c58cb4a7a2
      6 Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
      7 Parameters: HCT15 wgs SAMN10987770 SRR8639146 SRX5437587
      8 Instance ID: i-082c3fa46aeb746b5
      9 Public DNS: ec2-54-218-77-163.us-west-2.compute.amazonaws.com
     10 Warning: Permanently added 'ec2-54-218-77-163.us-west-2.compute.amazonaws.com,172.31.27.17' 
     11 download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh
     12 

```

#### Errors
HCT116 wgs pipe failed. looks like prefetch command didn't work. Stay and play on HCT116 drone node to fix.

Possible issue is that prefetch in ~/bin/ is a link to the binary in software folder. Will try an explicit call.

Line 70: `prefetch ...` changed to `%HOME/bin/prefetch`. And it works. Re-try the HT115 node from CCLE queen.

```
1  -- hgr1 Alignment Pipeline -- 
      2  version: 190531 build -- CCLE 
      3  ami:     crown-190601 - ami-0b375c9c58cb4a7a2  
      4  s3:      s3://crownproject/ccle  
      5  library: HCT116 -- wgs
      6  date:    Fri May 31 23:50:04 UTC 2019
      7 
      8 Initializing ...
      9 Download SRA file: SRR8639145
     10   cmd: prefetch -X 100G --ascp-path <PATH> SRR8639145
     11 /home/ubuntu/hgr1_align_v4.ccle.sh: line 70: prefetch: command not found
     12 SRA Input Pipe
     13 
     14 Starting hgr1 alignment
     15 Warning: Could not open read file "SRR8639145" for reading; skipping...
     16 Error: No input read files were valid
     17 (ERR): bowtie2-align exited with value 1
     18 Alignment complete.
     19 Calculate flagstats.
     20 Subset reads (retain mapped & their pairs, remove unmapped).
     21 Recompiling mapped bam file.
     22 Processing complete. Copy files to S3
     23 upload: ./HCT116.wgs.flagstat to s3://crownproject/ccle/hgr1/HCT116.wgs.flagstat
     24 upload: ./HCT116.wgs.hgr1.bam to s3://crownproject/ccle/hgr1/HCT116.wgs.hgr1.bam
     25 upload: ./HCT116.wgs.hgr1.bam.bai to s3://crownproject/ccle/hgr1/HCT116.wgs.hgr1.bam.bai
     26 upload: ./HCT116.wgs.hgr1.flagstat to s3://crownproject/ccle/hgr1/HCT116.wgs.hgr1.flagstat
     27 Create log files and copy to S3
     28 upload: ./HCT116.wgs.screenlog to s3://crownproject/ccle/logs/HCT116.wgs.screenlog
     29 ^C
```


In [23]:
## post bug-fix
cat hgr1_align_v4.ccle.sh
aws s3 cp hgr1_align_v4.ccle.sh s3://crownproject/ccle/scripts/


#!/bin/bash
# hgr1_align_v4.ccle.sh
# rDNA alignment pipeline - SRA version
PIPE_VERSION='190531 build -- CCLE'
AMI_VERSION='crown-190601 - ami-0b375c9c58cb4a7a2'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name + Output name(unique)
# $2 : Seq-read type (wgs|rna)
# $3 : BioSample ID
# $4 : Library SRA Accession

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/ccle'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'
    
# Read Group Data
  LIBRARY=$1    # Library Name / File prefix
  TYPE=$2       # wgs OR rna data-type
	RGPO='ccle'   # Patient Population - CCLE
	RGSM=$3       # Sample / Patient Identifer
	RGID=$4       # Read Group ID. SRA Accession Number
  RGLB=$LIBRARY # Library Name. Accession Number
  RGPL='ILLUMINA' # Seq Platform
  RGPU=$5      

In [3]:
## Ran into error with bowtie2 alignment
## need to de-bug
##
## Download was successful, so the explicit prefetch works.
## problem may be with bowtie2 scripts since it was 'installed'
## with a mv command instead of a ln command
##

cd $WORKDIR
cat HCT116.wgs.screenlog

[01;32mubuntu@ip-172-31-30-89[00m:[01;34m~[00m$ bash hgr1_align_v4.ccle.sh HCT116wgsSAMN10988251SRR8639145SRX5437588 wgsSAMN10988251SRR8639145SRX5437588[C[C[C SAMN10988251SRR8639145SRX5437588[C[C[C[C[C[C[C[C[C[C[C[C SRR8639145SRX5437588[C[C[C[C[C[C[C[C[C[C SRX5437588
 -- hgr1 Alignment Pipeline -- 
 version: 190531 build -- CCLE 
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2  
 s3:      s3://crownproject/ccle  
 library: HCT116 -- wgs
 date:    Fri May 31 23:55:55 UTC 2019

Initializing ...
Download SRA file: SRR8639145
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8639145

2019-05-31T23:55:58 prefetch.2.9.3: 1) Downloading 'SRR8639145'...
2019-05-31T23:55:58 prefetch.2.9.3:  Downloading via fasp...
                                                SRR8639145                                     

In [7]:
cat ccle_pilot3.input
echo ''

## re-run with a KMH2 RNAseq test set
## set "TERMINATE=FALSE"
## 
## change bowtie2 command to: "~/bin/bowtie2 --very-sensitive-local -p $THREADS \"...

cd $WORKDIR

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp ccle_pilot3.input $S3URL/scripts/

# Download on remote head node

KMH2	rna	SAMN10988578	SRR8615908	SRX5415142
Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/ccle/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/ccle/scripts/droneB.sh
Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v4.ccle.sh to s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Completed 43 Bytes/43 Bytes with 1 file(s) remainingupload: ./ccle_pilot3.input to s3://crownproject/ccle/scripts/ccle_pilot3.input


```
Launch instance # 1
Sun Jun  2 19:17:34 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: KMH2 rna SAMN10988578 SRR8615908 SRX5415142
Instance ID: i-0ee9fe81089d31795
```

Error appears on line 87: `-x hgr1 -sra $SRA | ` should be  `-x hgr1 --sra-acc $SRA | `

Will fix locally and re-run above box.

```
Launch instance # 1
Sun Jun  2 19:45:11 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: KMH2 rna SAMN10988578 SRR8615908 SRX5415142
Instance ID: i-009efec44e015cd87
Public DNS: ec2-35-155-253-153.us-west-2.compute.amazonaws.com
Warning: Permanently added 'ec2-35-155-253-153.us-west-2.compute.amazonaws.com,172.31.18.88' (ECDSA) to the list of known hosts.
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh
```

It worked! =D Manual inspection the alignments look good. I don't anticipate (lol) a difference between RNA and WGS so I can open up the analysis to a wider pipe.

Time to set-up shop.

In [8]:
ls -alh bam/

total 155M
drwxr-xr-x 2 artem artem 4.0K Jun  2 14:53 .
drwxrwxr-x 6 artem artem 4.0K Jun  2 14:42 ..
-rw-rw-r-- 1 artem artem  25M May 30 11:13 HCT116.rna.hgr1.bam
-rw-rw-r-- 1 artem artem  632 May 30 11:13 HCT116.rna.hgr1.bam.bai
-rw-r--r-- 1 artem artem  33M May 30 11:53 HCT15.rna.hgr1.bam
-rw-r--r-- 1 artem artem  632 May 30 11:53 HCT15.rna.hgr1.bam.bai
-rw-rw-r-- 1 artem artem  97M Jun  2 14:44 KMH2.rna.hgr1.bam
-rw-rw-r-- 1 artem artem  632 Jun  2 14:42 KMH2.rna.hgr1.bam.bai
