# TCGA2 - COAD
```
pi:ababaian
files: ~/Crown/data2/tcga2_1_coad/
start: 2019 05 07
complete : 2019 05 08
```
## Introduction

The remaining files from the TCGA-COAD cohort will be analyzed. All new patients in these files lack normal replicates, some are also biological replicates (01A samples vs 01B samples) of some of the cases previously analyzed.

The 01B samples in general appear to have differing chemistry/processing not limited to a subset of them are FFPE samples. Thus switch over to 01A vs 11A analysis once these cohorts are complete.

## Objective

1. Align this set of TCGA seq data to the `hgr1` reference sequence for further analysis


## Matererials and Methods

### Data Initialization


From the GDC/TCGA website, this cohort of data was selected with the following filter command.

```
cases.project.project_id in ["TCGA-COAD"] and files.data_format in ["BAM"] and files.data_type in ["Aligned Reads"] and files.experimental_strategy in ["RNA-Seq"] and cases.samples.is_ffpe in [false]
```

The `Sample Sheet`, `File Manifest`, and `Biospecimen` data for this selection was downloaded. This is stored in `$PWD\metadata`

In the `TCGA_File_Selection_COAD.xlsx` spreadsheet, this set of files was filtered/parsed to

1. Remove files already aligned in the `tcga1` set, as defined in `~/Crown/data2/tcga2_pilot/tcga1_filelist.bamlist`

2. If there is a technical replicate of the same sample, they will share a SampleID (`TCGA-XX-####-01A`), add a replicate suffix to make naming unique downstream (`TCGA-XX-####-01Ax`) where x = {a,b,c...}


The output of this parsing is copied to the input file: `tcga2_coad_input.txt`

### Scripts and Localization

#### 1 - Localization

In [3]:
WORKDIR='/home/artem/Crown/data2/tcga2_1_coad'
cd $WORKDIR
ls

# Amazon AWS S3 Home URL
S3URL='s3://crownproject/tcga2'

coad2.filter           metadata              tcga2_coad.log
droneB.sh              queenB.sh             TCGA_File_Selection_COAD.xlsx
hgr1_align_v3.tcga.sh  tcga2_coad_input.txt  TCGA_File_Selection_template.xlsx


In [2]:
INPUT='tcga2_coad_input.txt'

cat $INPUT

TCGA-3L-AA1B-01A	TCGA-COAD	c1c36a5e-5410-45ef-8954-70c26ef27066
TCGA-4N-A93T-01A	TCGA-COAD	fd9ac46f-2517-446c-9325-06f8db2ab89c
TCGA-4T-AA8H-01A	TCGA-COAD	06921a3a-5c30-4fb0-8ed0-347f51af459d
TCGA-5M-AAT4-01A	TCGA-COAD	ef99b87e-4d27-4689-be93-6a55f20ca577
TCGA-5M-AAT5-01A	TCGA-COAD	f1b27b36-e2c0-42da-beb1-2bc2bc61abb9
TCGA-5M-AAT6-01A	TCGA-COAD	b80f2f67-842c-4b6d-9b8c-936c6f03ac96
TCGA-5M-AATA-01A	TCGA-COAD	cbbd47c7-cc50-479d-a1a9-7199f0bdb9eb
TCGA-5M-AATE-01A	TCGA-COAD	8315040e-4201-42fe-9c4e-10ff635672cf
TCGA-A6-2671-01A	TCGA-COAD	80ff8844-3c90-4c66-b6d9-a72ea86219ed
TCGA-A6-2672-01A	TCGA-COAD	f08dc7f4-3cc3-4743-a84e-d586d74af8d1
TCGA-A6-2674-01Aa	TCGA-COAD	9d537202-d436-48de-8ee8-2d417576705f
TCGA-A6-2674-01Ab	TCGA-COAD	0c02bf18-3f95-468a-bdb8-408ad4e77e6a
TCGA-A6-2676-01A	TCGA-COAD	2e48e315-5cdf-4fee-aa5d-3c7baa4030ad
TCGA-A6-2677-01A	TCGA-COAD	90832632-cf57-463b-9d08-c76975066f56
TCGA-A6-2678-01A	TCGA-COAD	670e9a3d-5a6e-4ddb-a1bf-6e48e3786fb1
TCGA-A6-2679-01A	TCGA-C

#### 2 - Script Versions

In [3]:
cd $WORKDIR
# Echo scripts to be used for this analysis for version control.
# Note these need to be manually copied to the $WORKDIR

cat hgr1_align_v3.tcga.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# hgr1_align_v3.tcga.sh
# rDNA alignment pipeline
PIPE_VERSION='190506 build -- TCGA'
AMI_VERSION='crown-180813 - ami-0031fd61f932bdef9'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name and file-output name (unique)
# $2 : Library population/analysis set
# $3 : Library UUID

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject/tcga2'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'

# Sequencing Data
	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
  UUID=$3

 # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO=$2 # Patient Population

	#RGSM= # Sample. Patient Identifer
	#RGID= # Read Group ID. Acces

## Results - TCGA2 Pilot Run



#### 3 - Copy local to S3

In [4]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/
aws s3 cp droneB.sh $S3URL/scripts/
aws s3 cp hgr1_align_v3.tcga.sh $S3URL/scripts/
aws s3 cp $INPUT $S3URL/scripts/
aws s3 cp ../../gdc.token.txt $S3URL/scripts/gdc.token


Completed 4.3 KiB/4.3 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/tcga2/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/tcga2/scripts/droneB.sh
Completed 7.5 KiB/7.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v3.tcga.sh to s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Completed 27.8 KiB/27.8 KiB with 1 file(s) remainingupload: ./tcga2_coad_input.txt to s3://crownproject/tcga2/scripts/tcga2_coad_input.txt
Completed 1.0 KiB/1.0 KiB with 1 file(s) remainingupload: ../../gdc.token.txt to s3://crownproject/tcga2/scripts/gdc.token


#### 4 - Launch and run master EC2 node

In [2]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga2/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh $INPUT
#
# aws s3 cp screenlog.0 s3://crownproject/tcga2/logs/tcga2_coad.log

aws s3 cp s3://crownproject/tcga2/logs/tcga2_coad.log ./
cat tcga2_coad.log

# Run completed successfully - 23 hour run

Completed 256.0 KiB/266.1 KiB with 1 file(s) remainingCompleted 266.1 KiB/266.1 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/logs/tcga2_coad.log to ./tcga2_coad.log
kec2-user@ip-172-31-31-231:~\[?1034h[ec2-user@ip-172-31-31-231 ~]$ ls
droneB.sh  gdc.token  hgr1_align_v3.tcga.sh  input_tcga2.txt  queenB.sh  screenlog.0
kec2-user@ip-172-31-31-231:~\[ec2-user@ip-172-31-31-231 ~]$ bash queenB.sh input_tcga2.txt
Launch instance # 1
Mon May  6 19:37:19 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh
Parameters: TCGA-44-2662-01Aa TCGA-LUAD 10cdc012-492d-4090-83ab-ff73ac0ee536
Instance ID: i-06cea275db49daf32
Public DNS: ec2-18-236-102-47.us-west-2.compute.amazonaws.com
download: s3://crownproject/tcga2/scripts/hgr1_align_v3.tcga.sh to ./hgr1_align_v3.tcga.sh


Launch instance # 2
Mon May  6 19:40:26 UTC 2019
Instance Type: c4.xlarge
AMI Image: ami-003

## Discussion

Final node was initiated ~23 hours after run started (444 runs x 3 minutes between start-up).

Spot check of alignment files looks good : )


#### Output BAM filelist

Confirmed there are `444` non-empty bam files which were created as the output of this pipe. Run was successful.

In [5]:
aws s3 ls s3://crownproject/tcga2/TCGA-COAD/ | grep -e '.bam$'

2019-05-07 12:06:36   35383644 TCGA-3L-AA1B-01A.hgr1.bam
2019-05-07 12:13:00   90566954 TCGA-4N-A93T-01A.hgr1.bam
2019-05-07 11:38:54   47597991 TCGA-4T-AA8H-01A.hgr1.bam
2019-05-07 12:11:25   73053497 TCGA-5M-AAT4-01A.hgr1.bam
2019-05-07 12:07:18   25078449 TCGA-5M-AAT5-01A.hgr1.bam
2019-05-07 11:54:12   30808977 TCGA-5M-AAT6-01A.hgr1.bam
2019-05-07 12:20:55   48855228 TCGA-5M-AATA-01A.hgr1.bam
2019-05-07 11:53:48   23435843 TCGA-5M-AATE-01A.hgr1.bam
2019-05-07 11:43:24   39286269 TCGA-A6-2671-01A.hgr1.se.bam
2019-05-07 11:41:17   34556124 TCGA-A6-2672-01A.hgr1.se.bam
2019-05-07 11:45:03   26967237 TCGA-A6-2674-01Aa.hgr1.se.bam
2019-05-07 14:43:29 2170905715 TCGA-A6-2674-01Ab.hgr1.bam
2019-05-07 11:49:47   29623278 TCGA-A6-2676-01A.hgr1.se.bam
2019-05-07 11:58:02   49517537 TCGA-A6-2677-01A.hgr1.se.bam
2019-05-07 11:58:01   24774455 TCGA-A6-2678-01A.hgr1.se.bam
2019-05-07 11:48:05    9895726 TCGA-A6-2679-01A.hgr1.se.bam
2019-05-07 12:02:45   36812629 TCGA-A6-2680-01A.h

In [6]:
aws s3 ls s3://crownproject/tcga2/TCGA-COAD/ | grep -e '.bam$' > coad.bamlist



QED