# CCLE - hgr1 alignments
```
pi:ababaian
files: ~/Crown/data2/ccle/
start: 2019 06 02
complete : 2019 06 28
```
## Introduction

The **Cancer Cell Line Encyclopedia** (CCLE) is RNAseq from ~1000 cell lines and WGS DNAseq from ~380 of those lines.

This will be an analysis on a sub-set of the data to measure costs and efficiency. As this progressed it will include all CCLE WGS and RNAseq data.

In [1]:
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'

1_1248macp_quantification.Rmd  ccle.sra.table         metadata
ADcalc_ccle2.sh                ccle.wgs.gvcf.log      old_scripts
bam                            droneB.sh              plots
ccle.m1a.Rdata                 gvcf                   queenB.sh
ccle.m3u.Rdata                 hgr1_align_v4.ccle.sh  VAF_disease.pdf
ccle.macp.Rdata                inputs                 VAF_tissue.pdf
ccle.Rproj                     input.set.table
CCLE_SraRunTable.xlsx          macp.blood.csv


## Objective

1. Analyze a sub-set of CCLE data to measure any inefficiencies, bugs or hang ups.

Which will allow for...

2. Set-up a full run for the entire CCLE data cohorts.

3. Set-up a run for additional CCLE WGS data alignments.


## Materials and Methods -- Initial Run

### Data Initialization


From the SRA website, the CCLE project was selected: [SRP186687](https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP186687)

The data was imported into excel for filtering and prioritization. For the pilot, RNA and WGS will be analyzed from HT115 and HCT116

The output of this parsing is copied to the input file: `ccle_pilot.input`

Input columns are (see below):

1. Library Name
2. Data Type
3. Sample ID
4. SRA Accesion
5. Experiment Accession


#### Set 1

Selecting WGS data from HCT116 and HCT15. Selecting RNA data from every cell line from `lower intestine` or `upper intestine`.


### Scripts and Localization

#### 1 - Localization

In [1]:
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'

 1_1248macp_quantification.Rmd  '~$CCLE_SraRunTable.xlsx'   logs
 bam                             CCLE_SraRunTable.xlsx      metadata
 ccle.filelist                   ccle.sra.table             old_scripts
 ccle.m1a.Rdata                  ccle.wgs.filelist          plots
 ccle.m3u.Rdata                  droneB.sh                  queenB.sh
 ccle.macp.Rdata                 gvcf                       VAF_disease.pdf
 ccle.rna.filelist               hgr1_align_v4.ccle.sh      VAF_tissue.pdf
 ccle.Rproj                      inputs
 ccle_set6.input                 input.set.table


In [2]:
INPUT='ccle_set1.input'
# Note the different column requirements for SRA data access

cat $INPUT

HCT116	wgs	SAMN10988251	SRR8639145	SRX5437588
HCT15	wgs	SAMN10987770	SRR8639146	SRX5437587
C2BBe1	rna	SAMN10987952	SRR8616200	SRX5415202
CACO2	rna	SAMN10989600	SRR8616194	SRX5415208
CCK81	rna	SAMN10987776	SRR8615251	SRX5414502
CL11	rna	SAMN10989560	SRR8616139	SRX5414911
CL14	rna	SAMN10988457	SRR8616140	SRX5414910
CL34	rna	SAMN10989562	SRR8616141	SRX5414909
CL40	rna	SAMN10988464	SRR8616142	SRX5414908
COLO201	rna	SAMN10987883	SRR8616145	SRX5414905
COLO320	rna	SAMN10988542	SRR8616146	SRX5414904
COLO678	rna	SAMN10988483	SRR8615792	SRX5414608
CW2	rna	SAMN10987729	SRR8615958	SRX5415092
GP2d	rna	SAMN10988004	SRR8615381	SRX5414372
HCC56	rna	SAMN10988141	SRR8615451	SRX5414302
Hs255.T	rna	SAMN10987994	SRR8615417	SRX5414336
Hs675.T	rna	SAMN10988123	SRR8615783	SRX5414617
Hs698.T	rna	SAMN10988478	SRR8615779	SRX5414621
HT115	rna	SAMN10987744	SRR8616044	SRX5415006
HT29	rna	SAMN10988348	SRR8616032	SRX5415018
HT55	rna	SAMN10988014	SRR8616033	SRX5415017
HuTu80	rna	SAMN10988182	SRR86

#### 2 - Script Versions

In [5]:
cd $WORKDIR
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR

cat hgr1_align_v4.ccle.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# hgr1_align_v4.ccle.sh
# rDNA alignment pipeline - SRA version
PIPE_VERSION='190531 build -- CCLE'
AMI_VERSION='crown-190601 - ami-0b375c9c58cb4a7a2'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name + Output name(unique)
# $2 : Seq-read type (wgs|rna)
# $3 : BioSample ID
# $4 : Library SRA Accession

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/ccle'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'
    
# Read Group Data
  LIBRARY=$1    # Library Name / File prefix
  TYPE=$2       # wgs OR rna data-type
	RGPO='ccle'   # Patient Population - CCLE
	RGSM=$3       # Sample / Patient Identifer
	RGID=$4       # Read Group ID. SRA Accession Number
  RGLB=$LIBRARY # Library Name. Accession Number
  RGPL='ILLUMINA' # Seq Platform
  RGPU=$5      

## Results - CCLE Set 1 Run

Initial run with runtime testing and `Intestine` cell lines analyzed.

#### 3 - Copy local to S3

In [4]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/ccle/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/ccle/scripts/droneB.sh
Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v4.ccle.sh to s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Completed 2.7 KiB/2.7 KiB with 1 file(s) remainingupload: ./ccle_set1.input to s3://crownproject/ccle/scripts/ccle_set1.input


In [6]:
# start
date
date -u

Sun Jun  2 15:18:41 PDT 2019
Sun Jun  2 22:18:41 UTC 2019


#### 4 - Launch and run master EC2 node

In [7]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set1.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set1.log

aws s3 cp s3://crownproject/ccle/logs/ccle_set1.log ./
cat ccle_set1.log
date -u

# Run completed successfully

Completed 33.7 KiB/33.7 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set1.log to ./ccle_set1.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle2[K_set1.input [K
Launch instance # 1
Sun Jun  2 22:20:22 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: HCT116 wgs SAMN10988251 SRR8639145 SRX5437588
Instance ID: i-0dd7c732e4db661e7
Public DNS: ec2-54-187-231-155.us-west-2.compute.amazonaws.com
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Sun Jun  2 22:23:30 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: HCT15 wgs SAMN10987770 SRR8639146 SRX5437587
Instance ID: i-08e724002e0854aac
Public DNS: ec2-34-219-29-94.us-west-2.comput

#### Example RNAseq analysis

```
-- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: C2BBe1 -- rna
 date:    Sun Jun  2 22:29:48 UTC 2019

Initializing ...
Download SRA file: SRR8616200
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8616200

2019-06-02T22:29:52 prefetch.2.9.3: 1) Downloading 'SRR8616200'...
2019-06-02T22:29:52 prefetch.2.9.3:  Downloading via fasp...
SRR8616200
2019-06-02T22:56:09 prefetch.2.9.3:  fasp download succeed
2019-06-02T22:56:09 prefetch.2.9.3: 1) 'SRR8616200' was downloaded successfully
2019-06-02T22:56:32 prefetch.2.9.3: 'SRR8616200' has 0 unresolved dependencies
2019-06-02T22:56:33 prefetch.2.9.3: 'SRR8616200' has remote vdbcache
2019-06-02T22:56:33 prefetch.2.9.3:  Downloading vdbcache...
2019-06-02T22:56:33 prefetch.2.9.3:  Downloading via https...
2019-06-02T22:56:48 prefetch.2.9.3:  vdbcache was downloaded successfully
SRA Input Pipe
...
```

In this run of the C2BBe1 cell line, SRA download took `26 minutes` (22:29 to 22:56).

Completion of first alignment took `46 minutes` (23:40).

Re-processing reads into hgr1.bam took `55 minutes` (00:35).

#### Example RNAseq T84
```
-- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: T84 -- rna
 date:    Mon Jun  3 03:56:01 UTC 2019

Initializing ...
Download SRA file: SRR8615829
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8615829

2019-06-03T03:56:04 prefetch.2.9.3: 1) Downloading 'SRR8615829'...
2019-06-03T03:56:04 prefetch.2.9.3:  Downloading via fasp...
SRR8615829
2019-06-03T04:47:22 prefetch.2.9.3:  fasp download succeed
2019-06-03T04:47:22 prefetch.2.9.3: 1) 'SRR8615829' was downloaded successfully
2019-06-03T04:47:43 prefetch.2.9.3: 'SRR8615829' has 0 unresolved dependencies
2019-06-03T04:47:45 prefetch.2.9.3: 'SRR8615829' has remote vdbcache
2019-06-03T04:47:45 prefetch.2.9.3:  Downloading vdbcache...
2019-06-03T04:47:45 prefetch.2.9.3:  Downloading via https...
2019-06-03T04:50:26 prefetch.2.9.3:  vdbcache was downloaded successfully
SRA Input Pipe

Starting hgr1 alignment
```

Download time: `54 min` (03:56 to 04:50).

Alignments: `3 hour` (04:50 to 06:29 pending)

S3 Upload: `4 hour`(07:49)

#### Example DNAseq HCT116

```
-- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: HCT116 -- wgs
 date:    Sun Jun  2 22:23:29 UTC 2019

Initializing ...
Download SRA file: SRR8639145
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8639145

2019-06-02T22:23:32 prefetch.2.9.3: 1) Downloading 'SRR8639145'...
2019-06-02T22:23:32 prefetch.2.9.3:  Downloading via fasp...
SRR8639145
2019-06-03T01:21:27 prefetch.2.9.3:  fasp download succeed
2019-06-03T01:21:27 prefetch.2.9.3: 1) 'SRR8639145' was downloaded successfully
2019-06-03T01:21:52 prefetch.2.9.3: 'SRR8639145' has 0 unresolved dependencies
2019-06-03T01:21:53 prefetch.2.9.3: 'SRR8639145' has remote vdbcache
2019-06-03T01:21:53 prefetch.2.9.3:  Downloading vdbcache...
2019-06-03T01:21:53 prefetch.2.9.3:  Downloading via https...
2019-06-03T01:22:36 prefetch.2.9.3:  vdbcache was downloaded successfully
SRA Input Pipe

Starting hgr1 alignment
```

Download time: `3 hour` (22:23 to 01:22).

Primary alignment: `13 hours` (06:28 pending)

S3 Upload: `near 16 hours` (14:10)

#### Example DNAseq HCT15

```
-- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: HCT15 -- wgs
 date:    Sun Jun  2 22:26:37 UTC 2019

Initializing ...
Download SRA file: SRR8639146
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8639146

2019-06-02T22:26:41 prefetch.2.9.3: 1) Downloading 'SRR8639146'...
2019-06-02T22:26:41 prefetch.2.9.3:  Downloading via fasp...
SRR8639146
2019-06-03T00:05:33 prefetch.2.9.3:  fasp download succeed
2019-06-03T00:05:33 prefetch.2.9.3: 1) 'SRR8639146' was downloaded successfully
2019-06-03T00:05:59 prefetch.2.9.3: 'SRR8639146' has 0 unresolved dependencies
2019-06-03T00:06:00 prefetch.2.9.3: 'SRR8639146' has remote vdbcache
2019-06-03T00:06:00 prefetch.2.9.3:  Downloading vdbcache...
2019-06-03T00:06:00 prefetch.2.9.3:  Downloading via https...
2019-06-03T00:06:23 prefetch.2.9.3:  vdbcache was downloaded successfully
SRA Input Pipe

Starting hgr1 alignment
521383474 reads; of these:
  521383474 (100.00%) were paired; of these:
    521079292 (99.94%) aligned concordantly 0 times
    303640 (0.06%) aligned concordantly exactly 1 time
    542 (0.00%) aligned concordantly >1 times
    ----
    521079292 pairs aligned concordantly 0 times; of these:
      6992 (0.00%) aligned discordantly 1 time
    ----
    521072300 pairs aligned 0 times concordantly or discordantly; of these:
      1042144600 mates make up the pairs; of these:
        1042102889 (100.00%) aligned 0 times
        41228 (0.00%) aligned exactly 1 time
        483 (0.00%) aligned >1 times
0.06% overall alignment rate
Alignment complete.
Calculate flagstats.
Subset reads (retain mapped & their pairs, remove unmapped).
...
```

Download time: `2 hour` (22:26 to 00:23).

Alignments: `7 hour` for a 59G bam file (00:23 to 04:41 pending)

Upload to S3: `9 hour total` (07:18)


#### Set A Finish

58 RNAseq samples finished before before the 2 WGS.

## Results -- Set 2

Run the same analysis on Set 2; include `Blood, Bone, and Lung` cell lines.

In [9]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set2.input'
cat $INPUT
echo ''

# Local Folder Operations
aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

697	rna	SAMN10988566	SRR8616164	SRX5414886
A3/KAW	rna	SAMN10987848	SRR8616021	SRX5415029
A4/Fuk	rna	SAMN10987839	SRR8616016	SRX5415034
A427	rna	SAMN10989597	SRR8616014	SRX5415036
A549	rna	SAMN10988257	SRR8616017	SRX5415033
A673	rna	SAMN10987881	SRR8616012	SRX5415038
ABC1	rna	SAMN10988302	SRR8615407	SRX5414346
ALLSIL	rna	SAMN10988449	SRR8615411	SRX5414342
AML193	rna	SAMN10987604	SRR8615409	SRX5414344
AMO1	rna	SAMN10989573	SRR8615408	SRX5414345
BCP1	rna	SAMN10987880	SRR8615772	SRX5414628
BDCM	rna	SAMN10988332	SRR8615770	SRX5414630
BEN	rna	SAMN10988541	SRR8615771	SRX5414629
BL41	rna	SAMN10987706	SRR8616119	SRX5414931
BL70	rna	SAMN10988162	SRR8616118	SRX5414932
BV173	rna	SAMN10988488	SRR8616198	SRX5415204
C8166	rna	SAMN10989599	SRR8616202	SRX5415200
CA46	rna	SAMN10987878	SRR8616193	SRX5415209
CADOES1	rna	SAMN10988559	SRR8615273	SRX5414480
CAL12T	rna	SAMN10987846	SRR8615269	SRX5414484
CAL78	rna	SAMN10988508	SRR8615527	SRX5414226
Calu1	rna	SAMN10988371	SRR8615533	SRX5414

In [2]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set2.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set2.log

cd ~/Crown/data2/ccle/
aws s3 cp s3://crownproject/ccle/logs/ccle_set2.log ./
cat ccle_set2.log
date -u

# Run completed successfully

Completed 225.6 KiB/225.6 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set2.log to ./ccle_set2.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ exit[2Plsbash queenB.sh ccle_set2.input
Launch instance # 1
Mon Jun  3 16:54:54 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: 697 rna SAMN10988566 SRR8616164 SRX5414886
Instance ID: i-02d19a4b1211ba92a
Public DNS: ec2-34-219-29-58.us-west-2.compute.amazonaws.com
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Mon Jun  3 16:58:01 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: A3/KAW rna SAMN10987848 SRR8616021 SRX5415029
Instance ID: i-0ccd46854da4b83b4
Public DNS: ec2-54-218-79-244.us-west-2.co

## Results -- Set 3

The last set ran effeciently, will run the remaining cell lines now.

In [3]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set3.input'
cat $INPUT
echo ''

# Local Folder Operations
aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

5637	rna	SAMN10987658	SRR8616172	SRX5414878
23132.87	rna	SAMN10988573	SRR8616168	SRX5414882
22Rv1	rna	SAMN10988315	SRR8616169	SRX5414881
253J	rna	SAMN10987961	SRR8616166	SRX5414884
253JBV	rna	SAMN10987963	SRR8616167	SRX5414883
42MGBA	rna	SAMN10988275	SRR8616173	SRX5414877
59M	rna	SAMN10988041	SRR8616171	SRX5414879
639V	rna	SAMN10989559	SRR8616170	SRX5414880
647V	rna	SAMN10989565	SRR8616165	SRX5414885
769P	rna	SAMN10988198	SRR8615643	SRX5414757
786O	rna	SAMN10988199	SRR8615642	SRX5414758
8305C	rna	SAMN10988447	SRR8615645	SRX5414755
8505C	rna	SAMN10988556	SRR8615644	SRX5414756
8MGBA	rna	SAMN10988303	SRR8615647	SRX5414753
A101D	rna	SAMN10988170	SRR8615646	SRX5414754
A1207	rna	SAMN10988410	SRR8615649	SRX5414751
A172	rna	SAMN10988258	SRR8615648	SRX5414752
A204	rna	SAMN10987802	SRR8615638	SRX5414762
A2058	rna	SAMN10988063	SRR8615637	SRX5414763
A253	rna	SAMN10988034	SRR8616018	SRX5415032
A2780	rna	SAMN10988216	SRR8616019	SRX5415031
A375	rna	SAMN10988347	SRR8616020	SRX5415

In [1]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set3.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set3.log

# ETA: June 8 23:00

aws s3 cp s3://crownproject/ccle/logs/ccle_set3.log ./
cat ccle_set3.log
date -u

# Run completed successfully

Completed 256.0 KiB/313.4 KiB with 1 file(s) remainingCompleted 313.4 KiB/313.4 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set3.log to ./ccle_set3.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ ls
ccle_pilot2.input  ccle_set2.input  droneB.sh		   screenlog.0
ccle_pilot3.input  ccle_set3.input  hgr1_align_v4.ccle.sh
ccle_set1.input    CrownKey.pem     queenB.sh
kec2-user@ip-172-31-31-245:~\[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle_set3.input 
Launch instance # 1
Thu Jun  6 12:52:40 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: 5637 rna SAMN10987658 SRR8616172 SRX5414878
Instance ID: i-055e3347c0e451014
Public DNS: ec2-34-222-145-102.us-west-2.compute.amazonaws.com
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Thu Jun  6 12:5

#### Errors

One instance is hung up, manually cancelling and will drop `253J` cell line. All ubt one is good.

```
 -- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: 253J -- rna
 date:    Thu Jun  6 13:05:16 UTC 2019

Initializing ...
Download SRA file: SRR8616166
  cmd: prefetch -X 100G --ascp-path <PATH> SRR8616166

2019-06-06T13:05:22 prefetch.2.9.3: 1) Downloading 'SRR8616166'...
2019-06-06T13:05:22 prefetch.2.9.3:  Downloading via fasp...
SRR8616166
```

Several (3) bam files were generated with improper name and no alignment data within them. `NCIH2077`, `RS4` and `TM87` all have files of the name `TM87..hgr1.bam`. Most likely this column of `rna` input was not generated correctly in one of the processing steps. These three files will be dropped from the initial GVCF calcualtions.

This was caused by semi-colons in the names of these libraries in input set 2.

## Results -- GVCF Calculation (CCLE)

On a single system, in two sets calculate the GVCF file for all CCLE data across the ROI.

Use a modified version of the `adCalc_TCGA.sh` script.



In [9]:
# DNS: ec2-54-203-15-245.us-west-2.compute.amazonaws.com
# AMI: i-04aa55c347e233e33 (TCGA aligner)
# Instance: m4.4xlarge

# ON REMOTE:

## Copy CCLE files into it's dir
#mkdir -p ~/ccle; cd ccle;
#aws s3 cp --recursive s3://crownproject/ccle/ ./

## AMI not saved

## Copy over file list to s3
# ls -alr ./* > ccle.filelist
# aws s3 cp ccle.filelist s3://crownproject/ccle/gvcf/

# cd ~/ccle/hgr1/
# rm NCIH2077*
# rm RS4*
# rm TM87*

## Run ADcalc script.sh
# screen -L
# bash ADcalc_tcga.sh

# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.gvcf.log ./
cat ccle.gvcf.log

## DONE

Completed 2.4 KiB/2.4 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle.gvcf.log to ./ccle.gvcf.log
[01;32mubuntu@ip-172-31-35-24[00m:[01;34m~[00m$ bash ADcalc_ccle.sh 
Analyzing ccle...
Started processing 18S
[mpileup] 945 samples in 945 input files
Done with 18S 
chr13:1003660-1005529
Started processing 18SE
[mpileup] 945 samples in 945 input files
Done with 18SE 
chr13:1005529-1005629
Started processing 5S
[mpileup] 945 samples in 945 input files
Done with 5S 
chr13:10219-10340
Started processing 5.8S
[mpileup] 945 samples in 945 input files
Done with 5.8S 
chr13:1006622-1006779
Started processing 28S
[mpileup] 945 samples in 945 input files
Done with 28S 
chr13:1007948-1013018
upload: ./ccle.18SE.gvcf to s3://crownproject/ccle/gvcf/ccle.18SE.gvcf
Completed 1 of 18 part(s) with 5 file(s) remainingCompleted 2 of 18 part(s) with 5 file(s) remainingCompleted 3 of 18 part(s) with 5 file(s) remainingCompleted 4 o

In [None]:
#!/bin/bash
# ADcalc_ccle.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/ccle/scripts/ADcalc_ccle.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/ccle/gvcf/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/ccle/
mkdir -p GVCF #Output Folder

TYPE='ccle' # hardcode single ccle run
cd hgr1

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam > ../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


## Results -- CCLE CRC Lines WGS 1

With funds available and unlikely case that TARGET approvals come through in this month, remaining credits will be used to do cell line WGS analysis.


In [3]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set4.input'
cat $INPUT
echo ''

# Local Folder Operations
# slight mods
# Max instance = 45
# instance size = 300 Gb

# align script
# allow for 200G prefetch download (files upto 120G)

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

A3KAW	wgs	SAMN10987848	SRR8639174	SRX5437559
C2BBe1	wgs	SAMN10987952	SRR8639203	SRX5437530
CCK81	wgs	SAMN10987776	SRR8639191	SRX5437542
COLO201	wgs	SAMN10987883	SRR8639227	SRX5437506
DB	wgs	SAMN10987903	SRR8639211	SRX5437522
EOL1	wgs	SAMN10988490	SRR8639143	SRX5437590
GP2d	wgs	SAMN10988004	SRR8639178	SRX5437555
HEL9217	wgs	SAMN10987842	SRR8639141	SRX5437592
HT115	wgs	SAMN10987744	SRR8652111	SRX5449789
HT29	wgs	SAMN10988348	SRR8652114	SRX5449786
HuTu80	wgs	SAMN10988182	SRR8652109	SRX5449791
KARPAS299	wgs	SAMN10988031	SRR8652059	SRX5449841
KM12	wgs	SAMN10988369	SRR8652076	SRX5449824
KMS11	wgs	SAMN10987857	SRR8652075	SRX5449825
LAMA84	wgs	SAMN10988484	SRR8652091	SRX5449809
LoVo	wgs	SAMN10988310	SRR8652098	SRX5449802
LS180	wgs	SAMN10988327	SRR8652101	SRX5449799
LS411N	wgs	SAMN10988286	SRR8652100	SRX5449800
LS513	wgs	SAMN10987976	SRR8652099	SRX5449801
MM1S	wgs	SAMN10988254	SRR8652132	SRX5449768
MONOMAC1	wgs	SAMN10988495	SRR8652131	SRX5449769
MV411	wgs	SAMN10988366	SRR86

In [4]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set4.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set4.log

# ETA: June 8 23:00

aws s3 cp s3://crownproject/ccle/logs/ccle_set4.log ./
cat ccle_set4.log
date -u

## Done

Completed 20.9 KiB/20.9 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set4.log to ./ccle_set4.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle_set4.input 
Launch instance # 1
Tue Jun 11 08:51:53 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: A3KAW wgs SAMN10987848 SRR8639174 SRX5437559
Instance ID: i-0dc3106b682b30644
Public DNS: ec2-34-215-245-167.us-west-2.compute.amazonaws.com
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Tue Jun 11 08:55:14 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: C2BBe1 wgs SAMN10987952 SRR8639203 SRX5437530
Instance ID: i-017d3bf5dfe18ab15
Public DNS: ec2-18-236-158-61.us-west-2.compute.amazon

In [6]:
aws s3 ls --summarize --recursive s3://crownproject/ccle/ | grep 'wgs' -


2019-06-12 09:02:28        430 ccle/hgr1/A3KAW.wgs.flagstat
2019-06-12 09:02:28   66798443 ccle/hgr1/A3KAW.wgs.hgr1.bam
2019-06-12 09:02:29       1008 ccle/hgr1/A3KAW.wgs.hgr1.bam.bai
2019-06-12 09:02:30        418 ccle/hgr1/A3KAW.wgs.hgr1.flagstat
2019-06-11 23:46:53        433 ccle/hgr1/C2BBe1.wgs.flagstat
2019-06-11 23:46:53   76134295 ccle/hgr1/C2BBe1.wgs.hgr1.bam
2019-06-11 23:46:57        896 ccle/hgr1/C2BBe1.wgs.hgr1.bam.bai
2019-06-11 23:46:58        423 ccle/hgr1/C2BBe1.wgs.hgr1.flagstat
2019-06-12 03:15:48        430 ccle/hgr1/CCK81.wgs.flagstat
2019-06-12 03:15:48   43682859 ccle/hgr1/CCK81.wgs.hgr1.bam
2019-06-12 03:15:49        912 ccle/hgr1/CCK81.wgs.hgr1.bam.bai
2019-06-12 03:15:50        418 ccle/hgr1/CCK81.wgs.hgr1.flagstat
2019-06-11 23:25:47        430 ccle/hgr1/COLO201.wgs.flagstat
2019-06-11 23:25:47   49363678 ccle/hgr1/COLO201.wgs.hgr1.bam
2019-06-11 23:25:50        752 ccle/hgr1/COLO201.wgs.hgr1.bam.bai
2019-06-11 23:25:50        418 ccle/hgr1/COL

## Results -- CCLE WGS 2

Continuation of WGS alignments

In [7]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set5.input'
cat $INPUT
echo ''

# Local Folder Operations
# slight mods
# Max instance = 45
# instance size = 300 Gb

# align script
# allow for 200G prefetch download (files upto 120G)

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

NCIH2077	rna	SAMN10989610	SRR8615794	SRX5414606
RS4	rna	SAMN10987766	SRR8615434	SRX5414319
TM87	rna	SAMN10989624	SRR8616086	SRX5414964
769P	wgs	SAMN10988198	SRR8639134	SRX5437599
786O	wgs	SAMN10988199	SRR8639133	SRX5437600
8305C	wgs	SAMN10988447	SRR8639136	SRX5437597
A101D	wgs	SAMN10988170	SRR8639135	SRX5437598
A204	wgs	SAMN10987802	SRR8639132	SRX5437601
A2058	wgs	SAMN10988063	SRR8639131	SRX5437602
A375	wgs	SAMN10988347	SRR8639175	SRX5437558
A549	wgs	SAMN10988257	SRR8639173	SRX5437560
A704	wgs	SAMN10987988	SRR8639172	SRX5437561
ABC1	wgs	SAMN10988302	SRR8639171	SRX5437562
ACHN	wgs	SAMN10988309	SRR8639170	SRX5437563
AU565	wgs	SAMN10987882	SRR8639167	SRX5437566
BFTC909	wgs	SAMN10988547	SRR8639207	SRX5437526
BT20	wgs	SAMN10987894	SRR8639208	SRX5437525
BT474	wgs	SAMN10988308	SRR8639205	SRX5437528
BT483	wgs	SAMN10988523	SRR8639206	SRX5437527
Caki1	wgs	SAMN10987640	SRR8639204	SRX5437529
CAL120	wgs	SAMN10989564	SRR8639201	SRX5437532
CAL51	wgs	SAMN10988576	SRR8639209	SRX543

In [10]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set5.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set5.log

# 88/181 ran on jun 14 05:52
# approx 3 day run time.

aws s3 cp s3://crownproject/ccle/logs/ccle_set5.log ./
cat ccle_set5.log
date -u

## Done

Completed 125.2 KiB/125.2 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set5.log to ./ccle_set5.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle__[Kset4.input [K[K[K[K[K[K[K[K5.input 
Launch instance # 1
Thu Jun 13 09:49:39 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: NCIH2077 rna SAMN10989610 SRR8615794 SRX5414606
Instance ID: i-07db749f16bfa55d1
Public DNS: ec2-54-244-70-171.us-west-2.compute.amazonaws.com
Offending key for IP in /home/ec2-user/.ssh/known_hosts:452
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Thu Jun 13 09:52:48 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: RS4 rna SAMN10987766 SRR

#### Failed Runs
The following instances failed during download for unknown reasons, will have to try them again (or drop)

HCC38 - wgs - SRR8639157

5 more libraries going with 100G alignments -_-.

Done by June 18.


In [3]:
date -u

aws s3 ls s3://crownproject/ccle/hgr1/ > ccle.filelist

grep 'wgs' ccle.filelist > ccle.wgs.filelist

## 216 / 329 WGS files ran

Tue Jun 18 21:19:22 UTC 2019


## Results -- Set 6 RNA-seq clean-up

Going over the data, there were ~20 libraries which did not complete into the GVCF run; these will be run as a "clean-up" set, filling in the missing data. In addition, all breast cancer cell lines were missed and will be analyzed. These libraries will require their own GVCF calculation file and a script-level merge will be required in analysis :/


In [2]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set6.input'
cat $INPUT
echo ''

# Local Folder Operations
# slight mods
# Max instance = 45
# instance size = 300 Gb

# align script
# allow for 200G prefetch download (files upto 120G)

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

253J	rna	SAMN10987961	SRR8616166	SRX5414884
A3_KAW	rna	SAMN10987848	SRR8616021	SRX5415029
A4_Fuk	rna	SAMN10987839	SRR8616016	SRX5415034
LU99	rna	SAMN10987825	SRR8615394	SRX5414359
LUDLU1	rna	SAMN10987864	SRR8615402	SRX5414351
LXF289	rna	SAMN10988494	SRR8615401	SRX5414352
M07e	rna	SAMN10988568	SRR8615762	SRX5414638
MC116	rna	SAMN10987876	SRR8615760	SRX5414640
ME1	rna	SAMN10988452	SRR8615585	SRX5414815
MEC1	rna	SAMN10989555	SRR8615586	SRX5414814
MEG01	rna	SAMN10988184	SRR8615234	SRX5414519
MHHCALL3	rna	SAMN10988500	SRR8615523	SRX5414230
MM1S	rna	SAMN10988254	SRR8616070	SRX5414980
MOLT3	rna	SAMN10989586	SRR8615694	SRX5414706
MOR_CPR	rna	SAMN10987872	SRR8615698	SRX5414702
NALM6	rna	SAMN10988491	SRR8615346	SRX5414407
P31_FUJ	rna	SAMN10987620	SRR8616148	SRX5414902
MDA-MB-361	rna	SAMN10987900	SRR8615581	SRX5414819
MDA-MB-468	rna	SAMN10988340	SRR8615578	SRX5414822
Hs 274.T	rna	SAMN10988532	SRR8615416	SRX5414337
ZR-75-1	rna	SAMN10988134	SRR8618301	SRX5417215
MDA-MB-453	rna	

In [2]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set6.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set6.log

aws s3 cp s3://crownproject/ccle/logs/ccle_set6.log ./
cat ccle_set6.log
date -u

## All but one library finished running.

Completed 169.2 KiB/169.2 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set6.log to ./ccle_set6.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle__[Kset4.input [K[K[K[K[K[K[K[K5.input 
Launch instance # 1
Thu Jun 13 09:49:39 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: NCIH2077 rna SAMN10989610 SRR8615794 SRX5414606
Instance ID: i-07db749f16bfa55d1
Public DNS: ec2-54-244-70-171.us-west-2.compute.amazonaws.com
Offending key for IP in /home/ec2-user/.ssh/known_hosts:452
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Thu Jun 13 09:52:48 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: RS4 rna SAMN10987766 SRR

One instance looks to have failed. The input library name was `Hs 274.T` changed to `Hs.274.T` and re-ran. Adjust manually in total input set list. Going over the output bam file lists, there appears to be a few libraries which are in effect 'empty' due to a similiar naming error.
```
Launch instance # 20
Wed Jun 19 03:48:03 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: Hs 274.T rna SAMN10988532 SRR8615416 SRX5414337
Instance ID: i-0f602cbfdcca0ee30
Public DNS: ec2-35-167-110-40.us-west-2.compute.amazonaws.com
ssh: connect to host ec2-35-167-110-40.us-west-2.compute.amazonaws.com port 22: Connection refused
```

#### Set 6 B
```
Hs.274.T	rna	SAMN10988532	SRR8615416	SRX5414337
Hs.742.T	rna	SAMN10987685	SRR8615321	SRX5414432
Hs.739.T	rna	SAMN10987649	SRR8615791	SRX5414609
Hs.578.T	rna	SAMN10987893	SRR8615420	SRX5414333
Hs.343.T	rna	SAMN10988552	SRR8615421	SRX5414332
Hs.606.T	rna	SAMN10988335	SRR8615418	SRX5414335
Hs.281.T	rna	SAMN10988524	SRR8615415	SRX5414338
```

In [1]:
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set6B.log

aws s3 cp s3://crownproject/ccle/logs/ccle_set6B.log ./
cat ccle_set6B.log

Completed 5.0 KiB/5.0 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set6B.log to ./ccle_set6B.log
kec2-user@ip-172-31-31-245:~\[ec2-user@ip-172-31-31-245 ~]$ cat ccle_20.input 
Hs.274.T	rna	SAMN10988532	SRR8615416	SRX5414337
Hs.742.T	rna	SAMN10987685	SRR8615321	SRX5414432
Hs.739.T	rna	SAMN10987649	SRR8615791	SRX5414609
Hs.578.T	rna	SAMN10987893	SRR8615420	SRX5414333
Hs.343.T	rna	SAMN10988552	SRR8615421	SRX5414332
Hs.606.T	rna	SAMN10988335	SRR8615418	SRX5414335
Hs.281.T	rna	SAMN10988524	SRR8615415	SRX5414338
kec2-user@ip-172-31-31-245:~\[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh ccle_20.input 
Launch instance # 1
Wed Jun 19 13:49:22 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: Hs.274.T rna SAMN10988532 SRR8615416 SRX5414337
Instance ID: i-0df09ce0cc564c3e7
Public DNS: ec2-54-200-150-15.us-west-2.compute.amazonaws.com
downl

## Results -- Set 7 Remaining WGS

Remainder of the WGS data to be processed.

In [2]:
# Initialize
WORKDIR='/home/artem/Crown/data2/ccle'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/ccle'
INPUT='ccle_set7.input'
cat $INPUT
echo ''

# Local Folder Operations
# slight mods
# Max instance = 45
# instance size = 300 Gb

# align script
# allow for 200G prefetch download (files upto 120G)

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v4.ccle.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
echo ''

date -u

5637	wgs	SAMN10987658	SRR8639140	SRX5437593
22Rv1	wgs	SAMN10988315	SRR8639138	SRX5437595
23132_87	wgs	SAMN10988573	SRR8639137	SRX5437596
59M	wgs	SAMN10988041	SRR8639139	SRX5437594
A2780	wgs	SAMN10988216	SRR8639176	SRX5437557
AGS	wgs	SAMN10988305	SRR8639169	SRX5437564
AN3.CA	wgs	SAMN10988375	SRR8639168	SRX5437565
CAL.27	wgs	SAMN10987905	SRR8639202	SRX5437531
Caov-3	wgs	SAMN10988200	SRR8639190	SRX5437543
Capan-1	wgs	SAMN10987884	SRR8639189	SRX5437544
CAS-1	wgs	SAMN10987869	SRR8639192	SRX5437541
CHP-212	wgs	SAMN10988326	SRR8639197	SRX5437536
COV362	wgs	SAMN10987971	SRR8639222	SRX5437511
COV644	wgs	SAMN10988003	SRR8639214	SRX5437519
DAN-G	wgs	SAMN10987807	SRR8788980	SRX5578769
Daoy	wgs	SAMN10988097	SRR8639212	SRX5437521
Detroit.562	wgs	SAMN10988307	SRR8639218	SRX5437515
DU.145	wgs	SAMN10988253	SRR8639152	SRX5437581
EFO-21	wgs	SAMN10988346	SRR8639153	SRX5437580
EFO-27	wgs	SAMN10988215	SRR8639154	SRX5437579
ES-2	wgs	SAMN10988319	SRR8639144	SRX5437589
ESS-1	wgs	SAMN109882

In [1]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/ccle/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/; chmod 400 ~/.ssh/<KEY>.pem
#
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh ccle_set6.input
#
# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle_set7.log

aws s3 cp s3://crownproject/ccle/logs/ccle_set7.log ./
cat ccle_set7.log
date -u

## All but one library finished running.

Completed 77.1 KiB/77.1 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle_set7.log to ./ccle_set7.log
kec2-user@ip-172-31-31-245:~\[?1034h[ec2-user@ip-172-31-31-245 ~]$ bash queenB.sh in[K[Kccle_set7.input [K
Launch instance # 1
Wed Jun 19 17:38:15 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: 5637 wgs SAMN10987658 SRR8639140 SRX5437593
Instance ID: i-0a5ad42bfa719a6d1
Public DNS: ec2-34-212-89-53.us-west-2.compute.amazonaws.com
download: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh to ./hgr1_align_v4.ccle.sh


Launch instance # 2
Wed Jun 19 17:41:24 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0b375c9c58cb4a7a2
Run Script: s3://crownproject/ccle/scripts/hgr1_align_v4.ccle.sh
Parameters: 22Rv1 wgs SAMN10988315 SRR8639138 SRX5437595
Instance ID: i-0600764273f6fc9a0
Public DNS: ec2-35-165-18-171.us-west-2.comp

All instances finished, one failure on:

```
- hgr1 Alignment Pipeline --
 version: 190531 build -- CCLE
 ami:     crown-190601 - ami-0b375c9c58cb4a7a2
 s3:      s3://crownproject/ccle
 library: RL95-2 -- wgs
 date:    Thu Jun 20 21:41:22 UTC 2019
```


## Results -- GVCF Calculation II WGS Data

Re-calculate GVCF Files for RNA/WGS datasets from CCLE


In [3]:
## RNA ---------------

# DNS 1: ec2-54-212-210-220.us-west-2.compute.amazonaws.com
# AMI: i-04aa55c347e233e33 (TCGA aligner)
# Instance: m4.4xlarge

## ON REMOTE:

## Copy ADcalc_ccle2.sh to instance
# aws s3 cp s3://crownproject/ccle/scripts/ADcalc_ccle2.sh ./
# chmod 777 ADcalc_ccle2.sh


## Copy CCLE files into it's dir
#mkdir -p ~/ccle; cd ~/ccle
#mkdir -p hgr1; cd hgr1

## For RNA run
#aws s3 cp s3://crownproject/ccle/hgr1/ ./ --recursive --exclude "*" --include "*rna*""

## AMI not saved
#cd ~/ccle

## Copy over file list to s3
# ls -alr ./* > ccle.rna.filelist
# aws s3 cp ccle.rna.filelist s3://crownproject/ccle/gvcf/

## Run ADcalc script.sh
# screen -L
# bash ~/ADcalc_ccle2.sh

## POST-RUN

# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.rna1.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.rna1.gvcf.log ./
cat ccle.rna1.gvcf.log

## DONE


# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.rna2.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.rna2.gvcf.log ./
cat ccle.rna2.gvcf.log

## DONE

Completed 3.6 KiB/3.6 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle.rna1.gvcf.log to ./ccle.rna1.gvcf.log
[01;32mubuntu@ip-172-31-14-64[00m:[01;34m~[00m$ bash ADcalc_ccle2.sh 
Analyzing ccle.rna1...
Started processing 18S
[mpileup] 600 samples in 600 input files
[01;32mubuntu@ip-172-31-14-64[00m:[01;34m~[00m$ ls
[0m[01;32mADcalc_ccle2.sh[0m  [01;34mbin[0m   [01;34mdata[0m  [01;34mncbi[0m       [01;34mRNA[0m          [01;34mscripts[0m   [01;34mtmp[0m
[01;32mADcalc_ccle3.sh[0m  [01;34mccle[0m  [01;34mlogs[0m  [01;34mresources[0m  screenlog.0  [01;34msoftware[0m
[01;32mubuntu@ip-172-31-14-64[00m:[01;34m~[00m$ exit
exit
Done with 18S 
chr13:1003660-1005529
Started processing 18SE
[mpileup] 420 samples in 420 input files
Done with 18SE 
chr13:1005529-1005629
Started processing 5S
[mpileup] 420 samples in 420 input files
Done with 5S 
chr13:10219-10340
Started processing 5.8S
[mpileup] 420

In [3]:
## WGS ---------------

# DNS 1: ec2-18-237-224-159.us-west-2.compute.amazonaws.com
# AMI: i-04aa55c347e233e33 (TCGA aligner)
# Instance: m4.4xlarge

## ON REMOTE:

## Copy ADcalc_ccle2.sh to instance
# aws s3 cp s3://crownproject/ccle/scripts/ADcalc_ccle2.sh ./
# chmod 777 ADcalc_ccle2.sh
## MANUALLY CHANGE TYPE='ccle.wgs'
# vim ADcalc_ccle2.sh


## Copy CCLE files into it's dir
#mkdir -p ~/ccle; cd ~/ccle
#mkdir -p hgr1; cd hgr1

## For WGS run
#aws s3 cp s3://crownproject/ccle/hgr1/ ./ --recursive --exclude "*" --include "*wgs*"

## AMI not saved
#cd ~/ccle

## Copy over file list to s3
# ls -alr ./* > ccle.wgs.filelist
# aws s3 cp ccle.wgs.filelist s3://crownproject/ccle/gvcf/

## Run ADcalc script.sh
# screen -L
# bash ~/ADcalc_ccle2.sh

## POST-RUN

# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.wgs.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.wgs.gvcf.log ./
cat ccle.wgs.gvcf.log

## DONE

Completed 2.4 KiB/2.4 KiB with 1 file(s) remainingdownload: s3://crownproject/ccle/logs/ccle.wgs.gvcf.log to ./ccle.wgs.gvcf.log
[01;32mubuntu@ip-172-31-13-131[00m:[01;34m~[00m$ bash D[KADcalc_ccle2.sh 
Analyzing ccle.wgs...
Started processing 18S
[mpileup] 328 samples in 328 input files
Done with 18S 
chr13:1003660-1005529
Started processing 18SE
[mpileup] 328 samples in 328 input files
Done with 18SE 
chr13:1005529-1005629
Started processing 5S
[mpileup] 328 samples in 328 input files
Done with 5S 
chr13:10219-10340
Started processing 5.8S
[mpileup] 328 samples in 328 input files
Done with 5.8S 
chr13:1006622-1006779
Started processing 28S
[mpileup] 328 samples in 328 input files
Done with 28S 
chr13:1007948-1013018
upload: ./ccle.wgs.18SE.gvcf to s3://crownproject/ccle/gvcf/ccle.wgs.18SE.gvcf
Completed 1 of 10 part(s) with 5 file(s) remainingCompleted 2 of 10 part(s) with 5 file(s) remainingCompleted 3 of 10 part(s) with 5 file

In [2]:
# Uplink AD calc script
 echo $WORKDIR; cd $WORKDIR;

aws s3 cp ADcalc_ccle2.sh s3://crownproject/ccle/scripts/ADcalc_ccle2.sh

/home/artem/Crown/data2/ccle
Completed 2.0 KiB/2.0 KiB with 1 file(s) remainingupload: ./ADcalc_ccle2.sh to s3://crownproject/ccle/scripts/ADcalc_ccle2.sh


In [None]:
#!/bin/bash
# ADcalc_ccle2.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/ccle/scripts/ADcalc_ccle2.sh

# Type [rna | wgs]
TYPE='ccle.wgs'

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/ccle/gvcf/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/ccle/
mkdir -p GVCF #Output Folder

cd hgr1

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.wgs.hgr1.bam > bam.list.tmp
    ls *.wgs.hgr1.bam > ../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate binstance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


In [None]:
#!/bin/bash
# ADcalc_ccle2.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/ccle/scripts/ADcalc_ccle2.sh

# Type [rna | wgs]
TYPE='ccle.rna'

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/ccle/gvcf/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/ccle/
mkdir -p GVCF #Output Folder

cd hgr1
ls *.rna.hgr1.bam > ../GVCF/$TYPE.bamlist

# lazy 600 file limiter
TYPE='ccle.rna1'
ls *rna.*.bam | head -n600 - > bam.list.tmp
ls *rna.*.bam | head -n600 - > ../GVCF/$TYPE.bamlist

#TYPE='ccle.rna2'
#ls *rna.bam | tail -n +601 - > bam.list.tmp
#ls *rna.bam | head -n600 - > ../GVCF/$TYPE.bamlist

echo Analyzing $TYPE...
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


### Error

NOTE: the ccle.rna1.18S file is incomplete, only contains 1055 lines out of ~1884 expected.

Will need to be re-generated.