# TCGA2 - BRCA
```
pi:ababaian
files: ~/Crown/data2/tcga2_2_brca/
start: 2019 05 08
complete : 2019 05 10
```
## Introduction

TCGA2-COAD analysis looks to have completed successfully. The ~444 files took just about 24 hours to complete running. I will tune the parameters here slightly to allow for faster boot times and run more instances and analyze the larger BRCA cohort of patients.



## Objective

1. Align this set of TCGA seq data to the `hgr1` reference sequence for further analysis


## Matererials and Methods

### Data Initialization


From the GDC/TCGA website, this cohort of data was selected with the following filter command.

```
cases.project.project_id in ["TCGA-BRCA"] and files.data_format in ["BAM"] and files.data_type in ["Aligned Reads"] and files.experimental_strategy in ["RNA-Seq"] and cases.samples.is_ffpe in [false]
```

This yields a total of `1206` files in `1096` cases.

The `Sample Sheet`, `File Manifest`, and `Biospecimen` data for this selection was downloaded. This is stored in `$PWD\metadata`

In the `TCGA_File_Selection_BRCA.xlsx` spreadsheet, this set of files was filtered/parsed to `984` files

1. Remove files already aligned in the `tcga1` set, as defined in `~/Crown/data2/tcga2_pilot/tcga1_filelist.bamlist`

2. If there is a technical replicate of the same sample, they will share a SampleID (`TCGA-XX-####-01A`), add a replicate suffix to make naming unique downstream (`TCGA-XX-####-01Ax`) where x = {a,b,c...}


The output of this parsing was copied to two input files: `tcga2_brca_input1.txt` for files 1-500, and and `tcga2_brca_input2.txt` for files 501-984.

### Scripts and Localization

#### 1 - Localization

In [3]:
WORKDIR='/home/artem/Crown/data2/tcga2_2_brca'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/tcga2'

droneB.sh              queenB.sh             tcga_brca_input1.txt
hgr1_align_v3.tcga.sh  tcga2_brca1_setA.log  tcga_brca_input2.txt
metadata               tcga2_brca2_setA.log  TCGA_File_Selection_BRCA.xlsx


In [2]:
INPUT='tcga_brca_input1.txt'

cat $INPUT

TCGA-3C-AAAU-01A	TCGA-BRCA	4c68dee0-d1f5-4c6c-86bd-d4aad85ce6fb
TCGA-3C-AALI-01A	TCGA-BRCA	f84ccc77-6b78-4a1b-8862-e97038d0182c
TCGA-3C-AALJ-01A	TCGA-BRCA	e07f4631-c01d-47d4-8ee3-16ee5e79b984
TCGA-3C-AALK-01A	TCGA-BRCA	4d56cc4d-a1ef-4044-91f2-b0c71e3d18f1
TCGA-4H-AAAK-01A	TCGA-BRCA	4ab02a27-4259-4077-a00e-74df48025e72
TCGA-5L-AAT0-01A	TCGA-BRCA	01db8cf4-ace7-4ef6-8ad6-5e511c2b3175
TCGA-5L-AAT1-01A	TCGA-BRCA	a1c9ab7a-5561-4a43-925a-8b1d56cbf4bf
TCGA-5T-A9QA-01A	TCGA-BRCA	b0eaece8-d2b7-42e5-bb0e-072ce5273ce1
TCGA-A1-A0SB-01A	TCGA-BRCA	fa7a13d8-80b6-46dc-938a-af0d17b61a4e
TCGA-A1-A0SD-01A	TCGA-BRCA	e9643c1e-19dc-4c1e-96d0-36725f29080b
TCGA-A1-A0SE-01A	TCGA-BRCA	9e592817-2660-40f6-b3a2-58b164af3778
TCGA-A1-A0SF-01A	TCGA-BRCA	6f3b0c29-f679-4f63-a665-7d2e0599ae40
TCGA-A1-A0SG-01A	TCGA-BRCA	b79bc87e-0fcf-4015-9948-d7a544a9fffc
TCGA-A1-A0SH-01A	TCGA-BRCA	729a67a4-ed1a-498d-acdd-f4de8497b0ca
TCGA-A1-A0SI-01A	TCGA-BRCA	02663fbf-3402-45fa-aa2e-093f280414e7
TCGA-A1-A0SJ-01A	TCGA-BRC

In [3]:
INPUT2='tcga_brca_input2.txt'

cat $INPUT2

TCGA-B6-A402-01A	TCGA-BRCA	96e9872b-65bf-4411-b7e9-ebaa414cc4c0
TCGA-B6-A408-01A	TCGA-BRCA	fd93e4f2-8d6c-4aec-8fcd-de840ce2346d
TCGA-B6-A409-01A	TCGA-BRCA	7f871048-8a72-4512-bd14-04027f4370be
TCGA-B6-A40B-01A	TCGA-BRCA	6072673c-d564-4cb7-8513-214ad6dc6cc0
TCGA-B6-A40C-01A	TCGA-BRCA	d0e8399a-8ed8-4aa5-a00d-5a2023fd3d09
TCGA-BH-A0AV-01A	TCGA-BRCA	7bd65e59-890f-474a-9d61-78bcae04bb35
TCGA-BH-A0AW-01A	TCGA-BRCA	3f20c077-04eb-41a2-91f9-24842347d41f
TCGA-BH-A0B0-01A	TCGA-BRCA	040139ec-fabd-42b8-bf5a-17fbf4970adc
TCGA-BH-A0B1-01A	TCGA-BRCA	817b923e-988b-4c3c-90fd-65cfd413293f
TCGA-BH-A0B2-01A	TCGA-BRCA	65a5e8be-2ba8-4b8c-a5ff-95502303f0fb
TCGA-BH-A0B4-01A	TCGA-BRCA	3bbe1fff-6264-43e7-9a6b-42aa09c68a1b
TCGA-BH-A0B6-01A	TCGA-BRCA	c12355be-38b4-4eba-8e30-3b415a2d1351
TCGA-BH-A0B9-01A	TCGA-BRCA	bf30d526-8140-4f3d-8676-f6ed71aaed36
TCGA-BH-A0BD-01A	TCGA-BRCA	8e66d9b8-4eac-47bb-a6c5-df8240e95498
TCGA-BH-A0BF-01A	TCGA-BRCA	d2abb2af-b6eb-4596-98e5-48500e18e8bc
TCGA-BH-A0BG-01A	TCGA-BRC

#### 2 - Script Versions

In [4]:
cd $WORKDIR
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR

cat hgr1_align_v3.tcga.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# hgr1_align_v3.tcga.sh
# rDNA alignment pipeline
PIPE_VERSION='190506 build -- TCGA'
AMI_VERSION='crown-180813 - ami-0031fd61f932bdef9'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name and file-output name (unique)
# $2 : Library population/analysis set
# $3 : Library UUID

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/tcga2'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'

# Sequencing Data
	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
  UUID=$3

 # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO=$2 # Patient Population

	#RGSM= # Sample. Patient Identifer
	#RGID= # Read Group ID. Acces

## Results - TCGA2 Pilot Run



#### 3 - Copy local to S3

In [5]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v3.tcga.sh $S3URL/scripts/

aws s3 cp $INPUT $S3URL/scripts/
aws s3 cp $INPUT2 $S3URL/scripts/

aws s3 cp ../../gdc.token.txt $S3URL/scripts/gdc.token


Completed 4.3 KiB/4.3 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/tcga2/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/tcga2/scripts/droneB.sh
Completed 7.5 KiB/7.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v3.tcga.sh to s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Completed 31.3 KiB/31.3 KiB with 1 file(s) remainingupload: ./tcga_brca_input1.txt to s3://crownproject/tcga2/scripts/tcga_brca_input1.txt
Completed 30.2 KiB/30.2 KiB with 1 file(s) remainingupload: ./tcga_brca_input2.txt to s3://crownproject/tcga2/scripts/tcga_brca_input2.txt
Completed 1.0 KiB/1.0 KiB with 1 file(s) remainingupload: ../../gdc.token.txt to s3://crownproject/tcga2/scripts/gdc.token


#### 4 - Launch and run master EC2 node

In [4]:
# Remote EC2 Instance Operations ----------------------

# QUEEN BEE 1
# Instance ID: i-0f74a2d0a41c64e45
# ec2-user@ec2-34-217-67-161.us-west-2.compute.amazonaws.com
# input 1

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga2/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh tcga_brca_input1.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga2/logs/tcga2_brca1_setA.log
#
# After ~38 instances, run was stopped to increase node usage (see note)
#
# bash queenB.sh tcga_brca_input1.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga2/logs/tcga2_brca1_setB.log

aws s3 cp s3://crownproject/tcga2/logs/tcga2_brca1_setA.log ./
aws s3 cp s3://crownproject/tcga2/logs/tcga2_brca1_setB.log ./

cat tcga2_brca1_setA.log
cat tcga2_brca1_setB.log

# Run ...

Completed 23.4 KiB/23.4 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/logs/tcga2_brca1_setA.log to ./tcga2_brca1_setA.log
Completed 256.0 KiB/280.6 KiB with 1 file(s) remainingCompleted 280.6 KiB/280.6 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/logs/tcga2_brca1_setB.log to ./tcga2_brca1_setB.log
Launch instance # 1
Wed May  8 21:22:23 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Parameters: TCGA-3C-AAAU-01A TCGA-BRCA 4c68dee0-d1f5-4c6c-86bd-d4aad85ce6fb
Instance ID: i-0edc6b994d9a404d4
Public DNS: ec2-34-216-221-229.us-west-2.compute.amazonaws.com
Offending key for IP in /home/ec2-user/.ssh/known_hosts:292
download: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh to ./hgr1_align_v3.tcga.sh


Launch instance # 2
Wed May  8 21:25:33 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownpro

In [5]:
# Remote EC2 Instance Operations ----------------------

# QUEEN BEE 2
# Instance ID: i-05fc4b0d63f87157d
# ec2-user@ec2-34-209-196-15.us-west-2.compute.amazonaws.com
# input 2

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga2/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh tcga_brca_input2.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga2/logs/tcga2_brca2.log

aws s3 cp s3://crownproject/tcga2/logs/tcga2_brca2_setA.log ./
aws s3 cp s3://crownproject/tcga2/logs/tcga2_brca2_setB.log ./
cat tcga2_brca2_setA.log
cat tcga2_brca2_setB.log

# Run ...

Completed 21.9 KiB/21.9 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/logs/tcga2_brca2_setA.log to ./tcga2_brca2_setA.log
Completed 256.0 KiB/259.8 KiB with 1 file(s) remainingCompleted 259.8 KiB/259.8 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/logs/tcga2_brca2_setB.log to ./tcga2_brca2_setB.log
kec2-user@ip-172-31-17-64:~\[?1034h[ec2-user@ip-172-31-17-64 ~]$ bash queenB.sh tcga_brca_input2.txt
Launch instance # 1
Wed May  8 21:25:44 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Parameters: TCGA-B6-A402-01A TCGA-BRCA 96e9872b-65bf-4411-b7e9-ebaa414cc4c0
Instance ID: i-0ef3d19940a20fd2a
Launchign instance not ready yet, chill
Public DNS: ec2-54-213-235-80.us-west-2.compute.amazonaws.com
download: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh to ./hgr1_align_v3.tcga.sh


Launch instance # 2
Wed May  8 21:31:52 UTC 2019
Ins

## Discussion


#### Initial Check


In [7]:
aws s3 ls s3://crownproject/tcga2/TCGA-BRCA/ | grep -e '.bam$' > brca2.bamlist
wc -l brca2.bamlist

aws s3 ls s3://crownproject/tcga2/TCGA-BRCA/ | grep -e '.bam$'

983 brca2.bamlist
2019-05-08 15:42:48   12487560 TCGA-3C-AAAU-01A.hgr1.bam
2019-05-08 15:19:50   35271169 TCGA-3C-AALI-01A.hgr1.bam
2019-05-08 15:01:27   20045523 TCGA-3C-AALJ-01A.hgr1.bam
2019-05-08 15:51:01   63799031 TCGA-3C-AALK-01A.hgr1.bam
2019-05-08 15:35:37   41833600 TCGA-4H-AAAK-01A.hgr1.bam
2019-05-08 15:21:51   23781297 TCGA-5L-AAT0-01A.hgr1.bam
2019-05-08 15:11:53   12275992 TCGA-5L-AAT1-01A.hgr1.bam
2019-05-08 15:23:27    7320168 TCGA-5T-A9QA-01A.hgr1.bam
2019-05-08 16:01:12   35266269 TCGA-A1-A0SB-01A.hgr1.bam
2019-05-08 16:14:12   42620824 TCGA-A1-A0SD-01A.hgr1.bam
2019-05-08 16:21:48   70986252 TCGA-A1-A0SE-01A.hgr1.bam
2019-05-08 16:17:19   32766608 TCGA-A1-A0SF-01A.hgr1.bam
2019-05-08 16:32:05   27242529 TCGA-A1-A0SG-01A.hgr1.bam
2019-05-08 16:39:33   38972187 TCGA-A1-A0SH-01A.hgr1.bam
2019-05-08 17:10:52  106723726 TCGA-A1-A0SI-01A.hgr1.bam
2019-05-08 16:48:24   39544226 TCGA-A1-A0SJ-01A.hgr1.bam
2019-05-08 17:07:10   60565770 TCGA-A1-A0SK-01A.hgr1.

First check, a file is missing at the mid-way run stoppage.

```
TCGA-A2-A0EX-01A	TCGA-BRCA	6d135cc4-4eca-48be-8f87-bf94daf83e7a
```

Made a txt input and ran this file alone.

```
[ec2-user@ip-172-31-31-231 ~]$ bash queenB.sh tcga_brca_missing.txt 
Launch instance # 1
Fri May 10 14:41:41 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Parameters: TCGA-A2-A0EX-01A TCGA-BRCA 6d135cc4-4eca-48be-8f87-bf94daf83e7a
Instance ID: i-0a6c50a6cd9fa3134
```

### Errors / Debugging

#### Increasing number of instances running at once

There was a "collision" of instances when running the set-up above. The idea was that each queenB instance would be able to run 25 independent instances, thus a total of 50 simultanious instances would run. This is set by the `$MAX_INSTANCE` variable in the `queenB.sh` script. An important note going forward though, `$MAX_INSTANCE` is compared against `$running_instance`

```
running_instance=$(aws ec2 describe-instances |\
    grep '"Name": "running"' - |\
    wc -l - | cut -f1 -d' ' - )
```

Which means that this considers ALL EC2 instance which are online (including head node). In this case use a value of 50 for `$MAX_INSTANCE` will approximately split 25 nodes between each head node.
