# TCGA-5 COAD Complete RNA-seq hgr1 alignments
```
pi:ababaian
files: ~/Crown/data2/tcga_5_coad2
start: 2018 11 20
complete : 2018 .. ..
```

## Introduction

Download and align the complete set of TCGA-COAD data (all Cancer RNA-seq, even those excluding normal controls).

This complete set will form the basis of doing group-based analysis of normo- vs. hypo-macp differential gene expression and survival analysis, it will be a large data-set which hopefully will give robust differential expression analyses.



## Materials and Methods

#### TCGA Data Input

Search Term for limiting files
``` 
cases.project.project_id in ["TCGA-COAD"] and files.data_category in ["Raw Sequencing Data","Transcriptome Profiling"] and files.data_format in ["BAM"] and files.experimental_strategy in ["RNA-Seq"]
```

- Yields 521 files in 456 cases
- Metadata and manifest downloaded and saved to `../data2/tcga_5_coad2/metadata/`
- 434 Files will be analyzed in coad2 run (excludes those already analyzed)

Data selection saved in `COAD2_input.xlsx`


In [None]:
# New GDC Token downloaded - 181120

In [12]:
WORKDIR='/home/artem/Crown/data2/tcga_5_coad2'

cd $WORKDIR
ls

coad2_input.txt   hgr1_align_v2.tcga.sh  tcga_5_coad_pilot.log
COAD2_input.xlsx  metadata               tcga_5_coad_pilot.txt
droneB.sh         queenB.sh              tcga_5_coad_run1.txt


In [4]:
INPUT_ALL='coad2_input.txt'

cat $INPUT_ALL

# Pilot set
INPUT0='tcga_5_coad_pilot.txt'
sed -n 1,3p $INPUT_ALL > $INPUT0

TCGA-3L-AA1B-01A	TCGA-COAD	c1c36a5e-5410-45ef-8954-70c26ef27066
TCGA-4N-A93T-01A	TCGA-COAD	fd9ac46f-2517-446c-9325-06f8db2ab89c
TCGA-4T-AA8H-01A	TCGA-COAD	06921a3a-5c30-4fb0-8ed0-347f51af459d
TCGA-5M-AAT4-01A	TCGA-COAD	ef99b87e-4d27-4689-be93-6a55f20ca577
TCGA-5M-AAT5-01A	TCGA-COAD	f1b27b36-e2c0-42da-beb1-2bc2bc61abb9
TCGA-5M-AAT6-01A	TCGA-COAD	b80f2f67-842c-4b6d-9b8c-936c6f03ac96
TCGA-5M-AATA-01A	TCGA-COAD	cbbd47c7-cc50-479d-a1a9-7199f0bdb9eb
TCGA-5M-AATE-01A	TCGA-COAD	8315040e-4201-42fe-9c4e-10ff635672cf
TCGA-A6-2672-01A	TCGA-COAD	f08dc7f4-3cc3-4743-a84e-d586d74af8d1
TCGA-A6-2672-01B	TCGA-COAD	ff6d6688-c19c-4a7f-8058-4d1bc0249d83
TCGA-A6-2674-01A	TCGA-COAD	9d537202-d436-48de-8ee8-2d417576705f
TCGA-A6-2674-01A	TCGA-COAD	0c02bf18-3f95-468a-bdb8-408ad4e77e6a
TCGA-A6-2674-01B	TCGA-COAD	f9745437-1d51-43f3-940d-0086a1912aec
TCGA-A6-2676-01A	TCGA-COAD	2e48e315-5cdf-4fee-aa5d-3c7baa4030ad
TCGA-A6-2677-01A	TCGA-COAD	90832632-cf57-463b-9d08-c76975066f56
TCGA-A6-2677-01B	TCGA-COA

#### Scripts
Echo the run scripts for this analysis


In [5]:
cd $WORKDIR
# Echo 

cat hgr1_align_v2.tcga.sh
echo 
echo
cat queenB.sh
echo 
echo
cat droneB.sh
echo 
echo 

#!/bin/bash
# 1kg_align_v2.tcga.sh
# rDNA alignment pipeline
# 180831 build -- TCGA
# AMI: crown-180813 - ami-0031fd61f932bdef9
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name and file-output name
# $2 : Library population/analysis set
# $3 : Library UUID

# Control Panel -------------------------------
# CPU
	THREADS='3'

# Sequencing Data
	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
  UUID=$3

 # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO=$2 # Patient Population

	#RGSM= # Sample. Patient Identifer
	#RGID= # Read Group ID. Accession Number
    
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
    
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $

### Results -- COAD-2 Pilot Run

Run first three libraries as a pilot to ensure pipeline is operational

In [8]:
# Local Folder Operations -----------------------------
# also re-ran with update
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh s3://crownproject/tcga/scripts/
aws s3 cp droneB.sh s3://crownproject/tcga/scripts/
aws s3 cp hgr1_align_v2.tcga.sh s3://crownproject/tcga/scripts/
aws s3 cp $INPUT0 s3://crownproject/tcga/scripts/
aws s3 cp ../../gdc.token.txt s3://crownproject/tcga/scripts/gdc.token


Completed 3.8 KiB/3.8 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject/tcga/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject/tcga/scripts/droneB.sh
Completed 6.5 KiB/6.5 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.sh to s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Completed 192 Bytes/192 Bytes with 1 file(s) remainingupload: ./tcga_5_coad_pilot.txt to s3://crownproject/tcga/scripts/tcga_5_coad_pilot.txt
Completed 1.0 KiB/1.0 KiB with 1 file(s) remainingupload: ../../gdc.token.txt to s3://crownproject/tcga/scripts/gdc.token


In [9]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-6cd6f714
# t2.micro
#
# ssh login:
# ssh -i "crown.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh tcga_5_coad_pilot.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga/logs/tcga_5_coad_pilot.log

aws s3 cp s3://crownproject/tcga/logs/tcga_5_coad_pilot.log ./
cat tcga_5_coad_pilot.log

# Run completed successfully

Completed 2.1 KiB/2.1 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga/logs/tcga_5_coad_pilot.log to ./tcga_5_coad_pilot.log
kec2-user@ip-172-31-32-142:~\[?1034h[ec2-user@ip-172-31-32-142 ~]$ exitbash queenB.sh tcga_5_coad_pilot.txt
Launch instance # 1
Tue Nov 20 22:05:52 UTC 2018
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Parameters: TCGA-3L-AA1B-01A TCGA-COAD c1c36a5e-5410-45ef-8954-70c26ef27066
Instance ID: i-0a1665d37becb1e64
Public DNS: ec2-34-221-231-57.us-west-2.compute.amazonaws.com
download: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh to ./hgr1_align_v2.tcga.sh


Launch instance # 2
Tue Nov 20 22:08:59 UTC 2018
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Parameters: TCGA-4N-A93T-01A TCGA-COAD fd9ac46f-2517-446c-9325-06f8db2ab89c
Instance ID: i-0dcc93092a19

### TCGA 5 - COAD Run set 1

Run files 4 - 100

In [11]:
# Uncomment EC2 shutdown
cat hgr1_align_v2.tcga.sh
aws s3 cp hgr1_align_v2.tcga.sh s3://crownproject/tcga/scripts/

# Run tcga5 files 4-100
sed -n 4,100p coad2_input.txt > tcga_5_coad_run1.txt
aws s3 cp tcga_5_coad_run1.txt s3://crownproject/tcga/scripts/

#!/bin/bash
# 1kg_align_v2.tcga.sh
# rDNA alignment pipeline
# 180831 build -- TCGA
# AMI: crown-180813 - ami-0031fd61f932bdef9
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name and file-output name
# $2 : Library population/analysis set
# $3 : Library UUID

# Control Panel -------------------------------
# CPU
	THREADS='3'

# Sequencing Data
	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
  UUID=$3

 # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO=$2 # Patient Population

	#RGSM= # Sample. Patient Identifer
	#RGID= # Read Group ID. Accession Number
    
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
    
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $

In [13]:
# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga/scripts/ ./

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh tcga_5_coad_run1.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga/logs/tcga_5_coad_run1.log

aws s3 cp s3://crownproject/tcga/logs/tcga_5_coad_run1.log ./
cat tcga_5_coad_run1.log

Completed 55.7 KiB/55.7 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga/logs/tcga_5_coad_run1.log to ./tcga_5_coad_run1.log
kec2-user@ip-172-31-32-142:~\[?1034h[ec2-user@ip-172-31-32-142 ~]$ bash queenB.sh tcga_5_coad_run1.txt
Launch instance # 1
Tue Nov 20 23:47:35 UTC 2018
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Parameters: TCGA-5M-AAT4-01A TCGA-COAD ef99b87e-4d27-4689-be93-6a55f20ca577
Instance ID: i-052c816e10176b89a
Public DNS: ec2-18-236-82-109.us-west-2.compute.amazonaws.com
download: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh to ./hgr1_align_v2.tcga.sh


Launch instance # 2
Tue Nov 20 23:50:43 UTC 2018
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Parameters: TCGA-5M-AAT5-01A TCGA-COAD f1b27b36-e2c0-42da-beb1-2bc2bc61abb9
Instance ID: i-0b789cb4d7263e70a
P

## GVCF Analysis of TCGA-COAD Complete Set

(Note: Incocomplete, only contains Pilot + COAD-2-Run1)

In [None]:
# Ran on Remote:
# Download TCGA-COAD Data
aws s3 cp --recursive "s3://crownproject/tcga/tcga-coad0"  ./TCGA-COAD/
aws s3 cp --recursive "s3://crownproject/tcga/tcga-coad-1" ./TCGA-COAD/
aws s3 cp --recursive "s3://crownproject/tcga/TCGA-COAD"   ./TCGA-COAD/

cd tcga
bash ~/scripts/ADcalc_c.sh

# Copied to s3:///crownproject/tcga/181120_tcga4_gvcf/

In [None]:
#!/bin/bash
# ADcalc_c.sh
# Allelic Depth Calculator
# for a position

cd ~/tcga/

# Controls -----------------
DEPTH='100000'
BAMLIST='bam.list.tmp'

#regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:10219-10340' \
	'chr13:1006622-1006779' 'chr13:1007948-1013018')

#corresponding gene names
GENES=('18S' '5S' '5.8S' '28S')

# Iterate through every TCGA Cancer Type
for TYPE in $(echo "TCGA-COAD")
do
    echo Analyzing $TYPE...
    cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam >> ../$OUTPUT.bamlist
    ls *.bam >> ../tcga.bamlist

    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done
      
    rm bam.list.tmp
    
    cd ..
done

In [14]:
WORKDIR='/home/artem/Desktop/Crown/data2/tcga_analysis/181120_tcga4_gvcf'
cd $WORKDIR

aws s3 cp s3://crownproject/tcga/181120_tcga4_gvcf/tcga-coad.bamlist ./

aws s3 cp s3://crownproject/tcga/181120_tcga4_gvcf/TCGA-COAD.18S.gvcf ./
aws s3 cp s3://crownproject/tcga/181120_tcga4_gvcf/TCGA-COAD.5S.gvcf ./
aws s3 cp s3://crownproject/tcga/181120_tcga4_gvcf/TCGA-COAD.5.8S.gvcf ./
aws s3 cp s3://crownproject/tcga/181120_tcga4_gvcf/TCGA-COAD.28S.gvcf ./

# Note: updated data will be in 181123 capture

Completed 4.8 KiB/4.8 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga/181120_tcga4_gvcf/tcga-coad.bamlist to ./tcga-coad.bamlist
Completed 256.0 KiB/5.3 MiB with 1 file(s) remainingCompleted 512.0 KiB/5.3 MiB with 1 file(s) remainingCompleted 768.0 KiB/5.3 MiB with 1 file(s) remainingCompleted 1.0 MiB/5.3 MiB with 1 file(s) remaining  Completed 1.2 MiB/5.3 MiB with 1 file(s) remaining  Completed 1.5 MiB/5.3 MiB with 1 file(s) remaining  Completed 1.8 MiB/5.3 MiB with 1 file(s) remaining  Completed 2.0 MiB/5.3 MiB with 1 file(s) remaining  Completed 2.2 MiB/5.3 MiB with 1 file(s) remaining  Completed 2.5 MiB/5.3 MiB with 1 file(s) remaining  Completed 2.8 MiB/5.3 MiB with 1 file(s) remaining  Completed 3.0 MiB/5.3 MiB with 1 file(s) remaining  Completed 3.2 MiB/5.3 MiB with 1 file(s) remaining  Completed 3.5 MiB/5.3 MiB with 1 file(s) remaining  Completed 3.8 MiB/5.3 MiB with 1 file(s) remaining  Completed 4.0 MiB/5.3 MiB with 1 file(s) remaining  Complete

In [15]:
## After TCGA-COAD-2 Run2 Complete - Re-run above analysis
## Launched m4.xlarge
## AMI: crown-180914 ami-096bcb9d18c32d4d5
## 250 Gb SSD

# Ran on Remote:
screen
cd tcga
rm -r TCGA-COAD/*

# Download TCGA-COAD Data
aws s3 cp --recursive "s3://crownproject/tcga/tcga-coad0"  ./TCGA-COAD/
aws s3 cp --recursive "s3://crownproject/tcga/tcga-coad-1" ./TCGA-COAD/
aws s3 cp --recursive "s3://crownproject/tcga/TCGA-COAD"   ./TCGA-COAD/

rm -r ./TCGA-COAD/logs/*
rmdir ./TCGA-COAD/logs

screen -L
cd tcga
bash ~/scripts/ADcalc_c.sh

# Copied to s3:///crownproject/tcga/181120_tcga4_gvcf/




## TCGA-COAD-2 Run 2

Files 101 - end

In [16]:
cd /home/artem/Crown/data2/tcga_5_coad2

# Run tcga5 files 101 - 434
sed -n 101,434p coad2_input.txt > tcga_5_coad_run2.txt
aws s3 cp tcga_5_coad_run2.txt s3://crownproject/tcga/scripts/

Completed 20.9 KiB/20.9 KiB with 1 file(s) remainingupload: ./tcga_5_coad_run2.txt to s3://crownproject/tcga/scripts/tcga_5_coad_run2.txt


In [1]:
# Remote EC2 Instance Operations ----------------------
# REMOTE:
# aws s3 cp --recursive s3://crownproject/tcga/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launching EC2 instances
# screen -L
# 
# bash queenB.sh tcga_5_coad_run2.txt
#
# aws s3 cp screenlog.0 s3://crownproject/tcga/logs/tcga_5_coad_run2.log


# Local: 
cd /home/artem/Crown/data2/tcga_5_coad2
aws s3 cp s3://crownproject/tcga/logs/tcga_5_coad_run2.log ./
cat tcga_5_coad_run2.log

# Run completed successfully

Completed 191.4 KiB/191.4 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga/logs/tcga_5_coad_run2.log to ./tcga_5_coad_run2.log
kec2-user@ip-172-31-26-254:~\[?1034h[ec2-user@ip-172-31-26-254 ~]$ aws s3 cp screenlog.0 s3://crownproject/tcga/logs/tcga_5_coad_run2.lo g
Completed 70 Bytes/70 Bytes (2.2 KiB/s) with 1 file(s) remainingupload: ./screenlog.0 to s3://crownproject/tcga/logs/tcga_5_coad_run2.log
kec2-user@ip-172-31-26-254:~\[ec2-user@ip-172-31-26-254 ~]$ [H[2J[ec2-user@ip-172-31-26-254 ~]$ bash queenB.sh tcga_5_coad_run2.txt
Launch instance # 1
Thu Nov 22 18:47:55 UTC 2018
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh
Parameters: TCGA-AA-3678-01A TCGA-COAD b9861b19-9008-4381-b6fc-1119427b93fd
Instance ID: i-0a81b90263c98cb9f
Public DNS: ec2-52-40-102-61.us-west-2.compute.amazonaws.com
download: s3://crownproject/tcga/scripts/hgr1_align_v2.tcga.sh to ./hgr1_alig

## TCGA-COAD-2 Analysis

Preliminary analysis using `TCGA-COAD-1` and `TCGA-COAD-2-Run1` datasets only.

Downloaded `TCGA-COAD.18S.gvcf`, extracted VCF data, analyzed initially in `/home/artem/Desktop/Crown/data2/tcga_analysis/pilot/TCGA-18S_1248.xlsx`

Note: There IS a very large difference between `TCGA-NN-XXXX-01A` and `TCGA-NN-XXXX-01B` samples, I can't locate what the difference between these two types of samples are.

This may be a substantial techincal confounding effect in the data; ensure that all analyses are done comparing 01A vs 11A in the future.

```
SampleID	RAF	VAF	SubmitterID	sample
TCGA-A6-2674-01A	85.66552901	14.33447099	TCGA-A6-2674	01A
TCGA-A6-2677-01A	93.750000	6.250000	TCGA-A6-2677	01A
TCGA-A6-2684-01A	64.19308357	35.80691643	TCGA-A6-2684	01A
TCGA-A6-3809-01A	65.21478521	34.78521479	TCGA-A6-3809	01A
TCGA-A6-3810-01A	44.64922711	55.35077289	TCGA-A6-3810	01A
TCGA-A6-5656-01A	72.45129135	27.54870865	TCGA-A6-5656	01A
TCGA-A6-5659-01A	66.33076467	33.66923533	TCGA-A6-5659	01A
TCGA-A6-5661-01A	53.61972951	46.38027049	TCGA-A6-5661	01A
TCGA-A6-5665-01A	59.06011854	40.93988146	TCGA-A6-5665	01A
TCGA-A6-6780-01A	96.88378632	3.116213683	TCGA-A6-6780	01A
TCGA-A6-6781-01A	50.15587115	49.84412885	TCGA-A6-6781	01A

TCGA-A6-2674-01B	98.49981512	1.500184882	TCGA-A6-2674	01B
TCGA-A6-2677-01B	97.51013318	2.489866821	TCGA-A6-2677	01B
TCGA-A6-3809-01B	98.58570552	1.414294476	TCGA-A6-3809	01B
TCGA-A6-5656-01B	98.8341379	1.1658621	TCGA-A6-5656	01B
TCGA-A6-5659-01B	98.8868536	1.1131464	TCGA-A6-5659	01B
TCGA-A6-5661-01B	99.4146838	0.585316198	TCGA-A6-5661	01B
TCGA-A6-5665-01B	99.02525275	0.974747246	TCGA-A6-5665	01B
TCGA-A6-6650-01B	97.95766803	2.042331972	TCGA-A6-6650	01B
TCGA-A6-6780-01B	96.80881107	3.19118893	TCGA-A6-6780	01B
TCGA-A6-6781-01B	98.45925376	1.540746241	TCGA-A6-6781	01B
TCGA-A6-2684-01C	98.36116505	1.638834951	TCGA-A6-2684	01C
```