# Initialize Amazon EC2 Instance for TCGA Analysis
```
pi:ababaian
files: ~/Crown/data2/tcga_0/
start: 2018 07 30
complete : 2018 08 14
Addendum: 2018 09 06
```
## Introduction

I've been granted access to TCGA data amd was awarded $8K of Amazon AWS credits. I have 11 months to process the entire cohort of TCGA DNA- and RNA-seq data to get alignments against hgr1 and thus build a data-base of rDNA alignments which to analyze.

First step is to set-up a pipeline instance of EC2 which I can then run in parallel and process large amounts of TCGA data.


## Objective

Initialize an EC2 instance such that the entire TCGA analysis pipeline can be run in a single step (use hgr1 align pipeline). 

This instance will be private/encrypted though and my TCGA/dbGAP credentials can be on it so that each instance can download TCGA data directly.


## Materials and Methods

### Crown 1800813 TCGA Instance Image (AMI) Set-up

Launched an EC2 instance based on Crown 170220 AMI

```
ssh -i "~/.ssh/glitch.pem" ubuntu@ec2-34-216-2-13.us-west-2.compute.amazonaws.com 
```

In [None]:
# Install hgr1 resource into AMI image

# In Resource folder create dir for each genome type
cd ~/resources/

mkdir hg38
mv hg38*.* hg38/

mkdir hgr0
mv hgr*.* hgr0/

# Download hgr1 resource genome
mkdir hgr1
cd hgr1
aws s3 cp s3://crownproject/resources/hgr1.fa ./
samtools faidx hgr1.fa
bowtie2-build hgr1.fa hgr1


In [1]:
# Install gdc-client into AMI image

cd ~/software/

# Download TCGA-GDC File Transfer Tool (ubuntu)
wget https://gdc.cancer.gov/system/files/authenticated%20user/0/gdc-client_v1.3.0_Ubuntu14.04_x64.zip
  sudo apt install unzip
  unzip gdc-client_v1.3.0_Ubuntu14.04_x64.zip
  mv gdc-client ~/bin/
  
# Copy over gdc token to resouurces
cd ~/resources/
vim gdc.token #copied over manually
chmod 400 gdc.token

# Files can now be downloaded from a manifest file (created on GDC)
# gdc-client download -t ~/resources/gdc.token -d ./ -m manifest.test 


[sudo] password for artem: 


In [None]:
# Install HISAT2 into AMI image

cd ~/software/
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
unzip hisat2*
mv hisat2-2.1.0-Linux_x86_64.zip hisat2-2.1.0

# add hisat2 dir to PATH
export PATH="/home/ubuntu/software/hisat2-2.1.0:$PATH"

# clean-up some older software zips to one folder
mkdir zips
mv cufflinks-2.2.1.Linux_x86_64.tar.gz zips/
mv gdc-client_v1.3.0_Ubuntu14.04_x64.zip  zips/

# Install Samtools 1.9 to AMI Image

cd ~/software/
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -xf samtools-1.9.tar.bz2
# Install dependencies
sudo apt-get update  # Ensure the package list is up to date
sudo apt-get install autoconf automake make gcc perl \
  zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev \
  libssl-dev libncurses5-dev

# Install samtools
cd samtools-1.9/
./configure
make
make DESTDIR=$PWD install

# samtools back-version compatibility
which samtools

mv /home/ubuntu/software/pitchfork/deployment/bin/samtools ~/bin/samtools_0.1.20
ln /home/ubuntu/software/samtools-1.9/samtools ~/bin/samtools

# Requires re-login to function.



### Running GDC Client Locally to debug

In [5]:
### Running gdc-client locally
cd ~/Desktop/gdc

# From the GDC website; created a manifest file containing a 
# single rnaseq bam file for testing.

cat gdc_manifest*
echo ''
echo ''

#./gdc-client download -t gdc.token -d ./ -n 2 -m gdc_manifest_20180730_223304.txt
# downlaoaded

ls -alh *

echo ''
echo 'select bam file only'
echo ''
ls -alh */*.bam

# can also use command to select bam filename specifically
# cut -f2 gdc_manifest* | tail -n-1

id	filename	md5	size	state
9f52e91b-cf6f-4701-a651-096979267ae6	5192c580-43f3-4f41-81f2-cfd68b7a78fd_gdc_realn_rehead.bam	03bdc3e675ec69f2a110cedce1ad52a6	2240824619	submitted

-rwxr-xr-x 1 artem artem  24M Aug 14  2017 gdc-client
-rw-rw-r-- 1 artem artem  24M Jul 30 14:44 gdc-client_v1.3.0_Ubuntu14.04_x64.zip
-rw-rw-r-- 1 artem artem  175 Jul 30 15:33 gdc_manifest_20180730_223304.txt
-r-------- 1 artem artem 1.1K Jul 30 14:49 gdc.token

9f52e91b-cf6f-4701-a651-096979267ae6:
total 2.1G
drwxrwxr-x 3 artem artem 4.0K Jul 30 17:09 .
drwxrwxr-x 3 artem artem 4.0K Jul 30 16:53 ..
-rw-rw-r-- 1 artem artem 5.4M Jul 30 17:09 5192c580-43f3-4f41-81f2-cfd68b7a78fd_gdc_realn_rehead.bai
-rw-rw-r-- 1 artem artem 2.1G Jul 30 17:09 5192c580-43f3-4f41-81f2-cfd68b7a78fd_gdc_realn_rehead.bam
-rw-rw-r-- 1 artem artem  410 Jul 30 17:09 annotations.txt
drwxrwxr-x 2 artem artem 4.0K Jul 30 17:09 logs

select bam file only

-rw-rw-r-- 1 artem artem 2.1G Jul 30 17:09 9f52e91b-cf6f-4701-a651-

In [None]:
# Compare HISAT2 vs. bowtie2 alignment for hgr1
# to test if I should switch over alignment protocol

# Extracted reads from bamfile into fastq_0 format.

# HISAT2: align to genome
    hisat2 -p 2 \
      -x hgr1 -U tmp.sort.0.fq | \
      samtools view -bS - > aligned_unsorted.bam
      
    # 30120644 reads; of these:
    # 30120644 (100.00%) were unpaired; of these:
    # 29522710 (98.01%) aligned 0 times
    # 597927 (1.99%) aligned exactly 1 time
    # 7 (0.00%) aligned >1 times
    # 1.99% overall alignment rate
    
    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p 2 \
      -x hgr1 -U tmp.sort.0.fq | \
      samtools view -bS - > aligned_unsorted.bt2.bam
      
    # 30120644 reads; of these:
    # 30120644 (100.00%) were unpaired; of these:
    # 29448773 (97.77%) aligned 0 times
    # 669940 (2.22%) aligned exactly 1 time
    # 1931 (0.01%) aligned >1 times
    # 2.23% overall alignment rate



In [None]:
# samtools tricks for extracting data

# Extract ReadGroup Sample Name (SM)
samtools view -H */*.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq

# Extract ReadGroup identifier (ID)
samtools view -H */*.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq

# Extract ReadGroup platform (PL)
samtools view -H */*.bam | grep '^@RG' | sed "s/.*PL:\([^\t]*\).*/\1/g" | uniq


# Convert input bam file to fastq files for re-alignment
samtools sort -n initial.bam | samtools fastq -O \
    -0 tmp.sort.0.fq \
    -1 tmp.sort.1.fq \
    -2 tmp.sort.2.fq - 


### Running test data on EC2

Launched crown-180813 AMI on a t2.xlarge instance with 80 Gb storage.

Running through test script below manually to debug / test commands

In [None]:
#!/bin/bash
# TESTING SCRIPT (NON PRODUCTION)
# rDNA alignment pipeline
# 180731 build -- TCGA
# AMI: crown-XXXXX - ami-XXXXX
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 300 Gb
#

# Control Panel -------------------------------
# CPU
	THREADS='3'

# Sequencing Data
	LIBRARY="test_align" # Library/ File name
#	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
    UUID='9f52e91b-cf6f-4701-a651-096979267ae'
#   UUID=$2

    # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO="CRC" # Patient Population
#	RGPO=$2 # Patient Population

	#RGSM=   # Sample. Patient Identifer
	#RGID= # Read Group ID. Accession Number
    
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
    
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')

# Initialize wordir ---------------------------

# Make working directory
  mkdir -p align; cd align

# Copy hgrX genome and create bowtie2 index
  cp ~/resources/hgr1/* ./
  
# Download RNAseq BAM file
# with a UUID as input
  gdc-client download -t ~/resources/gdc.token -d ./ \
  -n $THREADS\
  $FILE_UUID
  
# Link the RNA-seq bamfile which is called by its UID to workdir
  ln -s */*.bam input.bam
  
# Extract ReadGroup Sample Name (SM)
  RGSM=$(samtools view -H input.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq )

# Extract ReadGroup identifier (ID)
  RGID=$(samtools view -H input.bam | grep '^@RG' | sed "s/.*ID:\([^\t]*\).*/\1/g" | uniq )

# Convert input bam file to fastq files for re-alignment
samtools sort -@ $THREADS-n input.bam | \
    samtools fastq -@ $THREADS -O \
    -0 $FQ0 \
    -1 $FQ1 \
    -2 $FQ2 -

# SINGLE END READS ====================================================

if [ -s $FQ0 ]
then
    # Single-End Extracted Reads Alignment

    # Extract Sequencing Run Info
    #RGPU=$(gzip -dc $FQ0| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')
    RGPU=$(head -n1 $FQ0 | cut -f1 -d':' | cut -f2 -d' ')

    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS \
      --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -U $FQ0 | \
      samtools view -bS - > aligned_unsorted.bam
     
    # Calculate library flagstats
      samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat
      rm $FQ0 # Remove fastq files to save space

    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Extract Mapped Reads
      samtools view -b -F 4 aligned_unsorted.bam | \
      samtools sort > align.hgr1.bam #mapped
      
    # Calcualte library flagstats
    samtools index align.hgr1.bam
    samtools flagstat align.hgr1.bam > align.hgr1.flagstat

    # Rename the total Bam Files
      #mv aligned_unsorted.bam $LIBRARY.se.bam
      #mv aligned_unsorted.bam.bai $LIBRARY.se.bam.bai
      mv aligned_unsorted.flagstat $LIBRARY.se.flagstat

    # Rename the hgr-aligned Bam files
      mv align.hgr1.bam $LIBRARY.hgr1.se.bam
      mv align.hgr1.bam.bai $LIBRARY.hgr1.se.bam.bai
      mv align.hgr1.flagstat $LIBRARY.hgr1.se.flagstat
      
    # Alignments (Full)
    aws s3 cp $LIBRARY.se.flagstat s3://crownproject/tcga/

    # Alignments (Aligned)
    aws s3 cp $LIBRARY.hgr1.se.bam s3://crownproject/tcga/
    aws s3 cp $LIBRARY.hgr1.se.bam.bai s3://crownproject/tcga/
    aws s3 cp $LIBRARY.hgr1.se.flagstat s3://crownproject/tcga/

fi

# PAIRED END READS ====================================================


if [ -s $FQ1 ]
then
    # Paired-End Extracted Reads Alignment

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')
    
    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS \
      --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -1 $FQ1 -2 $FQ2 | \
      samtools view -bS - > aligned_unsorted.bam
      
    # Calcualte library flagstats
      samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat
      
      rm $FQ1 $FQ2 # Remove fastq files to save space

      
    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Unmapped reads with mapped pairs
      # Extract Mapped Reads
      # and their unmapped pairs
      samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
      samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs

      # Extract just the 45S unit
      #aws s3 cp s3://crownproject/resources/rDNA_45s.bed ./
      #samtools view -b -L rDNA_45s.bed align.F4.bam > align.F4.45s.bam

      # What are the mapped readnames
      samtools view align.F4.bam | cut -f1 - > read.names.tmp

      # Extract mapped reads
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


      # Extract cases of read pairs mapped on edge of region of interest
      # -------|======= R O I ======| ----------
      # read:                  ====---====
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

      # Complete mapped reads list
      #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

      # Extract unmapped reads with a mapped pair
      samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

      # Re-compile bam file
      cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
        samtools sort align.hgr1.tmp.bam align.hgr1
        samtools index align.hgr1.bam
        samtools flagstat align.hgr1.bam > align.hgr1.flagstat

      # Clean up 
      rm *tmp* align.F4.bam align.f4F8.bam

    # Rename the total Bam Files
      mv aligned_unsorted.bam $LIBRARY.bam
      mv aligned_unsorted.bam.bai $LIBRARY.bam.bai
      mv aligned_unsorted.flagstat $LIBRARY.flagstat

    # Rename the hgr-aligned Bam files
      mv align.hgr1.bam $LIBRARY.hgr1.bam
      mv align.hgr1.bam.bai $LIBRARY.hgr1.bam.bai
      mv align.hgr1.flagstat $LIBRARY.hgr1.flagstat
    
fi

rm $FQ0 $FQ1 $FQ2 # Remove fastq files to save space  
  
# Alignments (Full)
 aws s3 cp $LIBRARY.se.flagstat s3://crownproject/tcga/

# Alignments (Aligned)
  aws s3 cp $LIBRARY.hgr1.bam s3://crownproject/tcga/
  aws s3 cp $LIBRARY.hgr1.bam.bai s3://crownproject/tcga/
  aws s3 cp $LIBRARY.hgr1.flagstat s3://crownproject/tcga/

# VCF
 aws s3 cp $LIBRARY.hgr1.vcf s3://crownproject/tcga/
 aws s3 cp $LIBRARY.hgr1.vcf.idx s3://crownproject/tcga/
 
# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
aws ec2 terminate-instances --instance-ids $EC2ID

# Script complete

### TCGA-COAD RNA-seq selection

For the initial trail of the multiple-launch pipeline (using queenB / droneB scripts) I first need a selection of RNA-seq files from TCGA to perform alignment against. Let's start with 5 examples.

On the [GDC website](https://portal.gdc.cancer.gov); I used the advanced selection to download a `Manifest File`, `Sample File`, `Clinical File` and `Exposure File` for all files which match the following selection criteria:

```
cases.primary_site in ["Colorectal"] and cases.project.project_id in ["TCGA-COAD"] and files.data_category in ["Raw Sequencing Data","Transcriptome Profiling"] and files.data_format in ["BAM"] and files.experimental_strategy in ["RNA-Seq"]
```



In [2]:
cd /home/artem/Desktop/Crown/data2/tcga_0/coad

head -n5 COAD_rnaseq_manifest.txt
echo ''

head -n5 COAD_clinical_table.tsv
echo ''

head -n5 COAD_sample_table.tsv

id	filename	md5	size	state
01581d3a-8427-41db-82c2-a7dd880a3937	9e79af22-063c-4f44-a353-3066a096e9c8_gdc_realn_rehead.bam	a872da35d9fff4cdd63cadc386045983	6299788578	submitted
017e4572-bbde-4e44-9de2-22e7d2573603	5ef172da-d39f-4f80-89a9-10f656a441ba_gdc_realn_rehead.bam	ee95d6e07a8892e76216e1627c58f236	6167063349	submitted
01871314-b195-4142-8149-1cff7ea8c3b4	eb089e55-398c-4f58-84d6-484bc5f5707d_gdc_realn_rehead.bam	2649ec6ab10affb57bb84e88f34d6420	2254640489	submitted
0188bc58-2deb-4978-98c3-3028d6fa61f8	c513f9b0-764d-4db0-a163-9d2b28111957_gdc_realn_rehead.bam	dbf7c6bd6bf2b6888e55dd3f3e6d30e0	5795499393	submitted

case_id	submitter_id	project_id	gender	year_of_birth	race	ethnicity	year_of_death	classification_of_tumor	last_known_disease_status	primary_diagnosis	tumor_stage	age_at_diagnosis	vital_status	morphology	days_to_death	days_to_last_known_disease_status	days_to_recurrence	tumor_grade	tissue_or_organ_of_origin	days_to_birth	progression_or_recurrence	prior_malignancy	site_

In [3]:
# Which can be simplied to the input parameters:

cat ../tcga0_input.tsv


TCGA-F4-6461-01A	TCGA-COAD	01581d3a-8427-41db-82c2-a7dd880a3937
TCGA-G4-6315-01A	TCGA-COAD	017e4572-bbde-4e44-9de2-22e7d2573603
TCGA-AA-3529-01A	TCGA-COAD	01871314-b195-4142-8149-1cff7ea8c3b4
TCGA-AZ-6599-01A	TCGA-COAD	0188bc58-2deb-4978-98c3-3028d6fa61f8
TCGA-QG-A5Z1-01A	TCGA-COAD	019ae823-acfc-49a6-a144-3b363b15a0dc

## Discussion

This should be the neccesary pre-requisites to run the analysis and debug in next experiment.

# Addendum - 20180906

Initial hgr1 alignments of TCGA data is now complete (see TCGA_3_General). Load all this data from S3 to an AMI image containing the data organized for analysis.

Run initial MACP GVCF analysis.


In [None]:
# Login to EC2 TCGA AMI
mkdir tcga; cd tcga

# Download TCGA data
aws s3 cp --recursive s3://crownproject/tcga/ ./

# Move COAD and LUSC data so folder names are matching
mv tcga-lusc/ TCGA-LUSC
mv tcga-coad-1/ TCGA-COAD


In [None]:
cd ~/software

wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvf bcftools-1.9.tar.bz2

mv bcftools-1.9.tar.bz2 zips/

./configure --prefix=$PWD
make
make install

cp bin/* ~/bin/

sudo mv /usr/bin/bcftools /usr/bin/bcftools_0.1.19

# Move to home directory and run ADcalc.sh script
cd ~

screen -L

bash scripts/ADcalc.sh #see below

mkdir tcga/vcf; cd tcga

mv *.vcf vcf/
mv ~/screenlog.0 18S_1248_vcf.log

aws s3 cp --recursive ./ s3://crownproject/tcga/vcf/

# Created crown-180906 EC2 AMI: ami-0ed79a4de8e2ca023


In [None]:
#!/bin/bash
# ADcalc.sh
# Allelic Depth Calculator
# for a position

cd ~/tcga/

# Controls -----------------
REGION='chr13:1004900-1004915'
OUTPUT='18S_1248.vcf'
DEPTH='100000'
BAMLIST='bam.list.tmp'

# Iterate through every TCGA Cancer Type
for TYPE in $(ls)
do
    echo Analyzing $TYPE...
    
    cd $TYPE

    ls *.bam > bam.list.tmp

    # Iterate through every bam file in directory
    # look-up position and return VCF
        bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
      --max-depth $DEPTH --min-BQ 30 \
      -a FORMAT/DP,AD \
      -r "$REGION" \
      --ignore-RG \
      -b $BAMLIST |
      bcftools annotate -x INFO,FORMAT/PL - |
      bcftools view -O v -H - \
      > ../$TYPE.$OUTPUT
      
    rm bam.list.tmp
    
    cd ..
done