# 1kg hgr0 Alignment Pilot
```
pi:ababaian
files: ~/Crown/data/1kg_hgr0
start: 2017 02 20
complete : 2017 02 22
```
## Introduction

I've identified that there is bona-fide intra- and inter-personal variation in rDNA. All my hours of analysis have essentially focused on RNA45S and it's promoter, there is LOTS of analysis to be done with it and it's the low-hanging fruit. The remainder of the rDNA repeat is more difficult to study but I will get back to it (hopefully with help).

To measure how much population-level variation exists, I want to quickly assay these 'interesting' regions across many samples so I've narrowed down a reference sequence of interest to something I'm calling [hgr0](./20170213_hgr0_reference_rDNA.ipynb), which contains 5S, in the context of a full 5S rDNA repeat (without simple-repeat sequences), and RNA45S and it's immediate upstream promoter (1.1kb).

These alignments and variant calling should run very quickly so I'll be able to do my target of 100 human genomes quickly and analyze these.



## Objective

* Run the `1kg_align_v0.sh` pipeline on NA19240 and her parents.
* Informally compare sequence alignments in the Trio between `hgr` and `hgr0`.

## Materials and Methods

Alignments will be run on ec2 using a modified `1kgpilot.sh` script called `1kg_align_vo.sh`, as in this is the 'pipeline' version 0.

This should be much faster then 1kgpilot.sh since alot of the simple repeat sequences have been taken out and the focus is on RNA45S and 5S itself as opposed to the rest of the rDNA repeat.

### NA19240 Trio Files


| Sample | Population | Sex | Notes       | Fastq | Size (KB) |
|--------|:----------:|:---:|-------------|-------|-----------|
|NA19240|YRI|F|WGS, PCR-free|  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894723/ERR894723_1.fastq.gz|24068394|
|NA19240|YRI|F|WGS, PCR-free| ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894723/ERR894723_2.fastq.gz|24313932|
|NA19238|YRI|F|WGS, PCR-free| ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899706/ERR899706_1.fastq.gz|5423970|
|NA19238|YRI|F|WGS, PCR-free| ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899706/ERR899706_2.fastq.gz|5554683|
|NA19239|YRI|M|WGS, PCR-free| ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899707/ERR899707_1.fastq.gz|8052464|
|NA19239|YRI|M|WGS, PCR-free| ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899707/ERR899707_2.fastq.gz|8192156|


### 1kg_align_v0_a.sh

In [None]:
#!/bin/bash
# 1kg_align_v0_a.sh script
# AMI: crown-170220 - ami-66129306
# EC2: c4.2xlarge (8cpu / 15 gb)
# Storage: 100-500 Gb
# Start: 
# Alignment done: 
# Align.subset done: 
# End:

# Control Panel -------------------------------
# CPU
	THREADS='7'
	AWSID='#########'

# Sequencing Data
	LIBRARY=$1 # Library/ File name
	FASTQ1=$5
	FASTQ2=$6

    # File-names
    FQ1=$(basename $FASTQ1)
    FQ2=$(basename $FASTQ2)

# Read Group Data
	RGSM=$2   # Sample. Patient Identifer
	RGID=$3 # Read Group ID. Accession Number
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
	RGPO=$4 # Patient Population
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')


# Initialize wordir ---------------------------

# Make working directory
  mkdir -p align; cd align

# Copy hgr0 genome and create bowtie2 index
  cp ~/resources/hgr0.fa ./
  cp ~/resources/hgr0.fa.fai ./
  
  bowtie2-build hgr0.fa hgr0
  
# Download Genome Sequencing Data
  wget $FASTQ1
  wget $FASTQ2

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')

# Primary Alignment -------------------------

# Bowtie2: align to genome

bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM --rg PL:$RGPL --rg PU:$RGPU -x hgr0 -1 $FQ1 -2 $FQ2 | samtools view -bS - > aligned_unsorted.bam

rm $FQ1 $FQ2 # Remove fastq files to save space

# Sort alignment file
  samtools sort -@ $THREADS aligned_unsorted.bam aligned
  samtools index aligned.bam
  rm aligned_unsorted.bam

# Calcualte library flagstats
  samtools flagstat aligned.bam > aligned.flagstat
  
# Rename the final Bam Files
  mv aligned.bam $LIBRARY.bam
  mv aligned.bam.bai $LIBRARY.bam.bai
  mv aligned.flagstat $LIBRARY.flagstat

# GATK variant calling over 45S region
  aws s3 cp s3://crownproject/resources/hgr0.gatk.fa ./
  aws s3 cp s3://crownproject/resources/hgr0.gatk.fa.fai ./
  aws s3 cp s3://crownproject/resources/hgr0.gatk.dict ./
  
  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr0.gatk.fa -T HaplotypeCaller \
  -ploidy 2 --max_alternate_alleles 6 \
  -I $LIBRARY.bam -o $LIBRARY.rDNA_p2.vcf
   # Memory issues, restrict to 45S region only
     # -ploidy 100, 50, 20 failed... do 2 and analyze 45S further
     
# Upload final output files to S3
 
# Alignments
 aws s3 cp $LIBRARY.bam s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.bam.bai s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.flagstat s3://crownproject/1kg_hgr0/
 
# VCF
 aws s3 cp $LIBRARY.rDNA_p2.vcf s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.rDNA_p2.vcf.idx s3://crownproject/1kg_hgr0/
 
 
# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
aws ec2 terminate-instances --instance-ids $EC2ID

# Script complete

### EC2 Runs

In [None]:
# AMI crown-170220
# 200 Gb of storage on C4.2xlarge
# On each instance run the commands below:

# Common Command
screen
aws s3 cp s3://crownproject/scripts/1kg_align_v0.sh ./

# ec2-52-39-30-14.us-west-2.compute.amazonaws.com
# DOES NOT AUTO-SHUTDOWN
sh 1kg_align_v0.sh NA19240_pcr NA19240 ERR894723 YRI ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894723/ERR894723_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR894/ERR894723/ERR894723_2.fastq.gz

# ec2-52-36-62-251.us-west-2.compute.amazonaws.com
sh 1kg_align_v0.sh NA19238_pcr NA19238 ERR899706 YRI ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899706/ERR899706_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899706/ERR899706_2.fastq.gz

# ec2-35-164-38-177.us-west-2.compute.amazonaws.com
sh 1kg_align_v0.sh NA19239_pcr NA19239 ERR899707 YRI ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899707/ERR899707_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR899/ERR899707/ERR899707_2.fastq.gz

# Output moved to s3://crownproject/1kg_hgr0/na19240_trio/

### 1kg_align_v0_b.sh

* Alignment worked; files are large though (they include unmapped reads)
* VCF didn't work; `hgr.gatk.fa.fai does not exist`. UPLOADED
* It may be more efficient to re-work the pipeline to feed 'mapped reads' and 'unmapped reads' into two distinct files at the alignment step.

In [None]:
#!/bin/bash
# 1kg_align_v0_b.sh script
# AMI: crown-170220 - ami-66129306
# EC2: c4.2xlarge (8cpu / 15 gb)
# Storage: 100-500 Gb
# Start: 
# Alignment done: 
# Align.subset done: 
# End:

# Control Panel -------------------------------
# CPU
	THREADS='7'
	AWSID='#########'

# Sequencing Data
	LIBRARY=$1 # Library/ File name
	FASTQ1=$5
	FASTQ2=$6

    # File-names
    FQ1=$(basename $FASTQ1)
    FQ2=$(basename $FASTQ2)

# Read Group Data
	RGSM=$2   # Sample. Patient Identifer
	RGID=$3 # Read Group ID. Accession Number
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
	RGPO=$4 # Patient Population
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')


# Initialize wordir ---------------------------

# Make working directory
  mkdir -p align; cd align

# Copy hgr0 genome and create bowtie2 index
  cp ~/resources/hgr0.fa ./
  cp ~/resources/hgr0.fa.fai ./
  
  bowtie2-build hgr0.fa hgr0
  
# Download Genome Sequencing Data
  wget $FASTQ1
  wget $FASTQ2

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')

# Primary Alignment -------------------------

# Bowtie2: align to genome

bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
--rg PL:$RGPL --rg PU:$RGPU -x hgr0 -1 $FQ1 -2 $FQ2 | samtools view -bS - > aligned_unsorted.bam

rm $FQ1 $FQ2 # Remove fastq files to save space

# Sort alignment file
  samtools sort -@ $THREADS aligned_unsorted.bam aligned
  samtools index aligned.bam
  rm aligned_unsorted.bam

# Calcualte library flagstats
  samtools flagstat aligned.bam > aligned.flagstat
  
# Read Subset ------------------------------
# Extract mapped reads, and their unmapped pairs

  # Extract Header
  samtools view -H aligned.bam > align.header.tmp

  # Unmapped reads with mapped pairs
  # Extract Mapped Reads
  # and their unmapped pairs
  samtools view -b -F 4 aligned.bam > align.F4.bam #mapped
  samtools view -b -f 4 -F 8 aligned.bam > align.f4F8.bam #unmapped pairs
  
  # Extract just the 45S unit
  #aws s3 cp s3://crownproject/resources/rDNA_45s.bed ./
  #samtools view -b -L rDNA_45s.bed align.F4.bam > align.F4.45s.bam
  
  # What are the mapped readnames
  samtools view align.F4.bam | cut -f1 - > read.names.tmp
  
  # Extract mapped reads
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

  
  # Extract cases of read pairs mapped on edge of region of interest
  # -------|======= R O I ======| ----------
  # read:                  ====---====
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

  # Complete mapped reads list
  #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

  # Extract unmapped reads with a mapped pair
  samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

  # Re-compile bam file
  cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr0.tmp.bam
    samtools sort align.hgr0.tmp.bam align.hgr0
    samtools index align.hgr0.bam
    samtools flagstat align.hgr0.bam > align.hgr0.flagstat
    
  # Read Counts: align.hgr0.bam (NA19240_pcr)
    # 651340 + 0 in total (QC-passed reads + QC-failed reads)
    # 0 + 0 duplicates
    # 614264 + 0 mapped (94.31%:-nan%)
    # 651340 + 0 paired in sequencing
    # 325670 + 0 read1
    # 325670 + 0 read2
    # 166576 + 0 properly paired (25.57%:-nan%)
    # 577188 + 0 with itself and mate mapped
    # 37076 + 0 singletons (5.69%:-nan%)
    # 0 + 0 with mate mapped to a different chr
    # 0 + 0 with mate mapped to a different chr (mapQ>=5)
  
  rm *tmp* align.F4.bam align.f4F8.bam # Clean-up

# Rename the total Bam Files
  mv aligned.bam $LIBRARY.bam
  mv aligned.bam.bai $LIBRARY.bam.bai
  mv aligned.flagstat $LIBRARY.flagstat

# Rename the hgr0 Bam files
  mv align.hgr0.bam $LIBRARY.hgr0.bam
  mv align.hgr0.bam.bai $LIBRARY.hgr0.bam.bai
  mv align.hgr0.flagstat $LIBRARY.hgr0.flagstat
  
# Primary VCF ----------------------------

# GATK variant calling over 45S region
  aws s3 cp s3://crownproject/resources/hgr0.gatk.fa ./
  aws s3 cp s3://crownproject/resources/hgr0.gatk.fa.fai ./
  aws s3 cp s3://crownproject/resources/hgr0.gatk.dict ./
  
  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr0.gatk.fa -T HaplotypeCaller \
  -ploidy 2 --max_alternate_alleles 6 \
  -I $LIBRARY.bam -o $LIBRARY.hgr0.vcf
   # Memory issues, restrict to 45S region only
     # -ploidy 100, 50, 20 failed... do 2 and analyze 45S further
     
# Upload final output files to S3
 
# Alignments (Full)
 #aws s3 cp $LIBRARY.bam s3://crownproject/1kg_hgr0/
 #aws s3 cp $LIBRARY.bam.bai s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.flagstat s3://crownproject/1kg_hgr0/

# Alignments (Aligned)
  aws s3 cp $LIBRARY.hgr0.bam s3://crownproject/1kg_hgr0/
  aws s3 cp $LIBRARY.hgr0.bam.bai s3://crownproject/1kg_hgr0/
  aws s3 cp $LIBRARY.hgr0.flagstat s3://crownproject/1kg_hgr0/

# VCF
 aws s3 cp $LIBRARY.hgr0.vcf s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.hgr0.vcf.idx s3://crownproject/1kg_hgr0/
 
# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
aws ec2 terminate-instances --instance-ids $EC2ID

# Script complete

## Results - Discussion

After a bit of hand-holding and refining the script to `1kg_align_v0_b.sh` (note: still called `1kg_align_v0.sh` on disk) the alignments worked.

This is a much faster pipeline and I can 'open the hose'. I talked with S.Jackman last night about ways I can auto-launch EC2 instances with a script and he suggested coding the ssh command to login to the EC2 machine and run the commands automatically. The hope is that the EC2-launch / run / close is all automated and I can do 1000 genomes alignments from a single csv file.

![NA19240 Trio hgr0 Variation](../figure/20170222_na19240_trio_hgr0.png)

Time to open up the throttle on these alignments; the components are in place.

## Addendum - 170228 NA12878

From the 1kgpilot experiment, I left the NA1278 instance 'online' but paused. I didn't want to re-download the data (which was stupid and expensive in the end).

Anyways, I resumed the m4.4xlarge instance and ran the 1kg_hgr0 pipeline with the command:

```
sh 1kg_align_v0.sh NA12878_pcr NA12878 SRR622457 CEU ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622457/SRR622457_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622457/SRR622457_2.fastq.gz
```