# hgr1 TCGA Pilot Experiment
```
pi:ababaian
files: ~/Crown/data2/tcga_0
start: 2018 08 14
complete : 2018 08 16
```
## Introduction

Pilot run script of hgr1-alignment of TCGA RNA-seq data

## Materials and Methods

### DRAFT Scripts

### queenB.sh

In [None]:
#!/bin/bash
# queenB.sh
# 20180814 build
# EC2 Launch / Control Script
#

# 1. queenB script is initialized locally and input files
#    are parsed ready for cluster analaysis
# 2. queenB launches instances, logs in to it and runs the
#    droneB.sh script remotely.
# 3. The droneB script is executed on the instance and it
#    launches a `screen` on the instance and loads and 
#    starts to perform the $TASK (gather.sh) script.
# 4. TASK script should include a instance shut-down
#    command to close instance upon completion.
#

# Control Panel =========================
# EC2 TASK Script - script for droneB to execute
TASK="s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh"

# Parameter file:
# Each line of PARAMETERS will be input to STDIN of
# the droneB script which can then be used to run the
# TASK script.
# i.e. bash droneB.sh <line_N_of_PARAMETERS>
# PARAMETERS="tcga0_input.txt"
PARAMETERS=$1

# EC2 Set-up
instanceTYPE='c4.xlarge'
imageID='ami-0031fd61f932bdef9' #AMI TCGA

devNAME='/dev/sda1' # /dev/sda1 for Crown-AMI
volSIZE='200' # in Gb

# Number of instances to launch
#COUNT=2 # predetermined number
COUNT=$(wc -l $PARAMETERS | cut -f 1 -d' ' ) # for each input argument

# Security
keyNAME='CrownKey'
keyPATH="/home/artem/.ssh/CrownKey.pem"
secGROUP='crown-group'

# =======================================

for ITER in $(seq 1 $COUNT)
do

  # Extract Parameters/Arguments ----------

  ARGS=$(sed -n "$ITER"p $PARAMETERS | sed 's/\t/ /g' - )

  echo "Launch instance # $ITER"
  echo "Instance Type: $instanceTYPE"
  echo "AMI Image: $imageID"
  echo "Run Script: $TASK"
  echo "Parameters: $ARGS"

  # Launch an instance --------------------
  # NOTE: each iteration of the for loop launches one instance
  # therefore each loop launches only one instance
  aws ec2 run-instances --image-id $imageID --count 1 \
   --instance-type $instanceTYPE --key-name $keyNAME \
   --block-device-mappings DeviceName=$devNAME,Ebs={VolumeSize=$volSIZE} \
   --security-groups $secGROUP > launch.tmp

  # Another alternative is to use --user-data droneB.sh 
  # which will run at instance boot-up
  # passing arguments to it may be challenging

  # Retrieve instance ID
  instanceID=$(cat launch.tmp | \
    egrep -o -e 'InstanceId[":/A-Za-z0-9_ \\-]*' - |\
    cut -f2 -d' ' - | xargs)

  echo "Instance ID: $instanceID"


  # Add a few minute wait here to allow for Public DNS to be assigned
  # otherwise ssh doesn't work
  sleep 180s

  # Retrieve public DNS
  aws ec2 describe-instances --instance-ids $instanceID > launch2.tmp

  pubDNS=$(cat launch2.tmp | \
    egrep -o -m 1 -e 'PublicDnsName[.":/A-Za-z0-9_ \\-]*' - |\
    cut -f2 -d' ' - | xargs)

  echo "Public DNS: $pubDNS"

  # Access the instance -------------------

  LOGIN="ubuntu@$pubDNS" 

  ssh -i $keyPATH \
    -o StrictHostKeyChecking=no \
    $LOGIN 'bash -s' < droneB.sh $TASK $(echo $ARGS)

  # Cleanup
  rm *.tmp

  echo ''
  echo ''

done

# end of script

### droneB.sh

In [None]:
#!/bin/bash
# droneB.sh
#

# This script-layer is neccesary to launch a screen session
# on each ec2-machine. The pipeline is run within that session
# and the output is logged. This allows 'looking in' on sessions
# as they are running.

# Commands to run on server-side
# ===============================================================

SCRIPTPATH=$1

SCRIPT=$(basename $1)

shift # drop first (TASK or SCRIPT variable)

# Download pipeline / droneB's function
  aws s3 cp $SCRIPTPATH ./

  chmod 777 *.sh

# open screen; run gather.sh function. -L logged
  screen -Ldmt sh ~/$SCRIPT $@


# ===============================================================


### hgr1_align_v2.tcga.0.sh

In [1]:
#!/bin/bash
# 1kg_align_v2.tcga.0.sh
# rDNA alignment pipeline
# 180813 build -- TCGA
# AMI: crown-180813 - ami-0031fd61f932bdef9
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 300 Gb
#

# Input Requirements --------------------------

# $1 : Library name and file-output name
# $2 : Library population/analysis set
# $3 : Library UUID

# Control Panel -------------------------------
# CPU
	THREADS='3'

# Sequencing Data
	LIBRARY=$1 # Library/ File name

# TCGA FILE UUID
  UUID=$3

 # FastQ File-names
    FQ0="$LIBRARY.tmp.sort.0.fq"
    FQ1="$LIBRARY.tmp.sort.1.fq"
    FQ2="$LIBRARY.tmp.sort.2.fq"
    
# Read Group Data
# Extract from downloaded BAM file / input
	RGPO=$2 # Patient Population

	#RGSM= # Sample. Patient Identifer
	#RGID= # Read Group ID. Accession Number
    
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
    
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')

# Initialize wordir ---------------------------

# Make working directory
  mkdir -p align; cd align

# Copy hgrX genome and create bowtie2 index
  cp ~/resources/hgr1/* ./
  
# Download RNAseq BAM file
# with a UUID as input
  gdc-client download -t ~/resources/gdc.token -d ./ \
  -n $THREADS\
  $FILE_UUID
  
# Link the RNA-seq bamfile which is called by its UID to workdir
  ln -s */*.bam input.bam
  
# Extract ReadGroup Sample Name (SM)
  RGSM=$(samtools view -H input.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq )

# Extract ReadGroup identifier (ID)
  RGID=$(samtools view -H input.bam | grep '^@RG' | sed "s/.*ID:\([^\t]*\).*/\1/g" | uniq )

# Convert input bam file to fastq files for re-alignment
samtools sort -@ $THREADS-n input.bam | \
    samtools fastq -@ $THREADS -O \
    -0 $FQ0 \
    -1 $FQ1 \
    -2 $FQ2 -

# SINGLE END READS ====================================================

if [ -s $FQ0 ]
then
    # Single-End Extracted Reads Alignment

    # Extract Sequencing Run Info
    #RGPU=$(gzip -dc $FQ0| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')
    RGPU=$(head -n1 $FQ0 | cut -f1 -d':' | cut -f2 -d' ')

    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS \
      --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -U $FQ0 | \
      samtools view -bS - > aligned_unsorted.bam
     
    # Calculate library flagstats
      samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat
      rm $FQ0 # Remove fastq files to save space

    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Extract Mapped Reads
      samtools view -b -F 4 aligned_unsorted.bam | \
      samtools sort > align.hgr1.bam #mapped
      
    # Calcualte library flagstats
    samtools index align.hgr1.bam
    samtools flagstat align.hgr1.bam > align.hgr1.flagstat

    # Rename the total Bam Files
      #mv aligned_unsorted.bam $LIBRARY.se.bam
      #mv aligned_unsorted.bam.bai $LIBRARY.se.bam.bai
      mv aligned_unsorted.flagstat $LIBRARY.se.flagstat

    # Rename the hgr-aligned Bam files
      mv align.hgr1.bam $LIBRARY.hgr1.se.bam
      mv align.hgr1.bam.bai $LIBRARY.hgr1.se.bam.bai
      mv align.hgr1.flagstat $LIBRARY.hgr1.se.flagstat
      
    # Alignments (Full)
    aws s3 cp $LIBRARY.se.flagstat s3://crownproject/tcga/

    # Alignments (Aligned)
    aws s3 cp $LIBRARY.hgr1.se.bam s3://crownproject/tcga/
    aws s3 cp $LIBRARY.hgr1.se.bam.bai s3://crownproject/tcga/
    aws s3 cp $LIBRARY.hgr1.se.flagstat s3://crownproject/tcga/

fi

# PAIRED END READS ====================================================


if [ -s $FQ1 ]
then
    # Paired-End Extracted Reads Alignment

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')
    
    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS \
      --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -1 $FQ1 -2 $FQ2 | \
      samtools view -bS - > aligned_unsorted.bam
      
    # Calcualte library flagstats
      samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat
      
      rm $FQ1 $FQ2 # Remove fastq files to save space

      
    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Unmapped reads with mapped pairs
      # Extract Mapped Reads
      # and their unmapped pairs
      samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
      samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs

      # Extract just the 45S unit
      #aws s3 cp s3://crownproject/resources/rDNA_45s.bed ./
      #samtools view -b -L rDNA_45s.bed align.F4.bam > align.F4.45s.bam

      # What are the mapped readnames
      samtools view align.F4.bam | cut -f1 - > read.names.tmp

      # Extract mapped reads
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


      # Extract cases of read pairs mapped on edge of region of interest
      # -------|======= R O I ======| ----------
      # read:                  ====---====
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

      # Complete mapped reads list
      #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

      # Extract unmapped reads with a mapped pair
      samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

      # Re-compile bam file
      cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
        samtools sort align.hgr1.tmp.bam align.hgr1
        samtools index align.hgr1.bam
        samtools flagstat align.hgr1.bam > align.hgr1.flagstat

      # Clean up 
      rm *tmp* align.F4.bam align.f4F8.bam

    # Rename the total Bam Files
      mv aligned_unsorted.bam $LIBRARY.bam
      mv aligned_unsorted.bam.bai $LIBRARY.bam.bai
      mv aligned_unsorted.flagstat $LIBRARY.flagstat

    # Rename the hgr-aligned Bam files
      mv align.hgr1.bam $LIBRARY.hgr1.bam
      mv align.hgr1.bam.bai $LIBRARY.hgr1.bam.bai
      mv align.hgr1.flagstat $LIBRARY.hgr1.flagstat
    
fi

rm $FQ0 $FQ1 $FQ2 # Remove fastq files to save space  
  
# Alignments (Full)
 aws s3 cp $LIBRARY.se.flagstat s3://crownproject/tcga/

# Alignments (Aligned)
  aws s3 cp $LIBRARY.hgr1.bam s3://crownproject/tcga/
  aws s3 cp $LIBRARY.hgr1.bam.bai s3://crownproject/tcga/
  aws s3 cp $LIBRARY.hgr1.flagstat s3://crownproject/tcga/

# VCF
 aws s3 cp $LIBRARY.hgr1.vcf s3://crownproject/tcga/
 aws s3 cp $LIBRARY.hgr1.vcf.idx s3://crownproject/tcga/
 
# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
aws ec2 terminate-instances --instance-ids $EC2ID

# Script complete


^C


### tcga0_input.tsv

In [None]:
TCGA-F4-6461-01A	TCGA-COAD	01581d3a-8427-41db-82c2-a7dd880a3937
TCGA-G4-6315-01A	TCGA-COAD	017e4572-bbde-4e44-9de2-22e7d2573603
TCGA-AA-3529-01A	TCGA-COAD	01871314-b195-4142-8149-1cff7ea8c3b4
TCGA-AZ-6599-01A	TCGA-COAD	0188bc58-2deb-4978-98c3-3028d6fa61f8
TCGA-QG-A5Z1-01A	TCGA-COAD	019ae823-acfc-49a6-a144-3b363b15a0dc

## Pilot Run 1 - 180814 1448


In [1]:
# Copy tcga script to S3
cd /home/artem/Desktop/Crown/data2/tcga_0
aws s3 cp hgr1_align_v2.tcga.0.sh s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh

Completed 7.0 KiB/7.0 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.0.sh to s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh


In [3]:
bash queenB.sh tcga0_input.tsv

Launch instance # 1
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-F4-6461-01A TCGA-COAD 01581d3a-8427-41db-82c2-a7dd880a3937
Instance ID: i-0559a94aff30c5a00
Public DNS: ec2-54-212-59-228.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh to ./hgr1_align_v2.tcga.0.sh


Launch instance # 2
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-G4-6315-01A TCGA-COAD 017e4572-bbde-4e44-9de2-22e7d2573603
Instance ID: i-0668ef286d8c2af74



In [4]:
# Stop with interrupt
# Didn't work; the instances shut down prematurely. I'll re-test by turning off self-close of instances
# (Line 223/224 of align script commented out)
aws s3 cp hgr1_align_v2.tcga.0.sh s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh

Completed 6.8 KiB/6.8 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.0.sh to s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh


`bash queenB.sh tcga0_input0.tsv # two files only`
 
Note: ran in seperate window.

--- Bugs ---
* Only first line of input ran, second parameter didn't run. -- EOL char --> changed to awk command DONE
* Found bug on line 55-57 of tcga align script. gdc-client uses FILE_UUID, parameter is UUID -- DONE
* Input file must be SPACE delimited, not TAB delimited (changed) -- DONE
* screenlog `/home/ubuntu/hgr1_align_v2.tcga.0.sh: line 55: gdc-client: command not found` change command to an explicit call `~/bin/gdc-client`. The samtools version being called is also the older one which makes me think that the most recent updates to AMI are not functioning. -- DONE (but sloppy)

--- TODO ---
* Add support to copy the screenlog.0 file which is generated and save it to the output tcga folder such that each run instance can be debugged and go back to it to confirm all commands run successfully -- DONE


In [2]:
# Pilot 2 with above bugfixes
cd ~/Crown/data2/tcga0/
aws s3 cp hgr1_align_v2.tcga.0.sh s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
bash queenB.sh tcga0_input0.tsv

bash: cd: /home/artem/Crown/data2/tcga0/: No such file or directory
Completed 7.0 KiB/7.0 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.0.sh to s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Launch instance # 1
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-F4-6461-01A TCGA-COAD 01581d3a-8427-41db-82c2-a7dd880a3937
Instance ID: i-09dbea3ca5362b27c
Public DNS: ec2-34-219-137-196.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh to ./hgr1_align_v2.tcga.0.sh


Launch instance # 2
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-G4-6315-01A TCGA-COAD 017e4572-bbde-4e44-9de2-22e7d2573603
Instance ID: i-0255b652d1082e594
Public DNS: ec2-34-209-170-244.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh 

From TCGA.screenlog the problem is with a `samtools sort` command:

```
{running correctly}...
...
1.04% overall alignment rate
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
...
{errors}
```

Lines 101-103 of `hgr1_align_v2.tcga.0.sh` script updated to use samtools 1.9 command syntax
```
      # Extract Mapped Reads
      ~/bin/samtools view -b -F 4 aligned_unsorted.bam | \
      ~/bin/samtools sort -@ $THREADS -O BAM - > align.hgr1.bam #mapped
```

Running only a single bam file for testing now since iteration is working and re-run with updated script.

In [3]:
# Pilot 2 with above bugfixes
cd ~/Crown/data2/tcga0/
echo ''; cat tcga0_input0.tsv; echo ''
aws s3 cp hgr1_align_v2.tcga.0.sh s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
bash queenB.sh tcga0_input0.tsv

bash: cd: /home/artem/Crown/data2/tcga0/: No such file or directory

TCGA-F4-6461-01A TCGA-COAD 01581d3a-8427-41db-82c2-a7dd880a3937

Completed 7.1 KiB/7.1 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.0.sh to s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Launch instance # 1
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-F4-6461-01A TCGA-COAD 01581d3a-8427-41db-82c2-a7dd880a3937
Instance ID: i-0c7296d75f8dc68eb
Public DNS: ec2-34-221-200-179.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh to ./hgr1_align_v2.tcga.0.sh




This version now worked and there are bam files on S3 : )

Trying with remaining 4 libraries.

Errors in screenlog
```
mv: cannot stat 'aligned_unsorted.bam.bai': No such file or directory

The user-provided path TCGA-F4-6461-01A.se.flagstat does not exist.
```


In [5]:
# Pilot 3 with above bugfixes
cd ~/Crown/data2/tcga0/
echo ' '; cat tcga0_input.tsv; echo ' '

aws s3 cp hgr1_align_v2.tcga.0.sh s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
bash queenB.sh tcga0_input.tsv

bash: cd: /home/artem/Crown/data2/tcga0/: No such file or directory
 
TCGA-G4-6315-01A	TCGA-COAD	017e4572-bbde-4e44-9de2-22e7d2573603
TCGA-AA-3529-01A	TCGA-COAD	01871314-b195-4142-8149-1cff7ea8c3b4
TCGA-AZ-6599-01A	TCGA-COAD	0188bc58-2deb-4978-98c3-3028d6fa61f8
TCGA-QG-A5Z1-01A	TCGA-COAD	019ae823-acfc-49a6-a144-3b363b15a0dc 
Completed 7.1 KiB/7.1 KiB with 1 file(s) remainingupload: ./hgr1_align_v2.tcga.0.sh to s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Launch instance # 1
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh
Parameters: TCGA-G4-6315-01A TCGA-COAD 017e4572-bbde-4e44-9de2-22e7d2573603
Instance ID: i-0e27041d09664f9af
Public DNS: ec2-34-219-143-181.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/hgr1_align_v2.tcga.0.sh to ./hgr1_align_v2.tcga.0.sh


Launch instance # 2
Instance Type: c4.xlarge
AMI Image: ami-0031fd61f932bdef9
Run Script: s3://crownproject/scri

## Results

Alignment worked successfully : ) This should be adaquete for piping through larger amounts of RNA-seq data, especially the matched cancer-normal controls.

Note: The single-end RNA-seq libraries (se) are quite error prone over the 18S/28S regions since they are shorter and older reads. They should not be used for variant calling of 18S/28S/5.8S.
The converse is true about 5S rRNA, since this smaller lower GC content RNA is often lost in the larger fragment size paired-end libraries, it is sequenced very well on both strands in the se libraries. These rRNA should be treated seperately based on which type of library each sample was prepared from.

![18S Alignment Output](../../data2/tcga_0/align/tcga0_18S.png)

![5S Alignment Output](../../data2/tcga_0/align/tcga0_5S.png)

Good to go!


Archived these libraries to: s3://crownproject/tcga-coad0/

```
aws s3 mv --recursive --include "*hgr1*" s3://crownproject/tcga/ s3://crownproject/tcga/tcga-coad0/
```

