# hgr1 Reference rDNA
```
pi:ababaian
files: ~/Crown/data/hgr1/
start: 2017 03 03
complete : 2017 03 04
```
## Introduction

As I was preparing the larger array of libraries from 1000 genomes data I re-read the [data use policy](http://www.internationalgenome.org/IGSR_disclaimer). The site hosts both the 1000 genomes project data which is freely available and the Human Genome Structural Variation Consortium (HGSVC) data, which is currently under embargo. This includes the NA19240 data [I used to make](./20170213_hgr0_reference_rDNA.ipynb) `hgr0` which sucks. This is something I need to check in the future but regardless, there is lots of available data which can be freely used: Genome in a Bottle and Illumina Platinum Pedigree are two such projects with open data.

Anyways, I guess the gentleman thing to do here is use public data and restart. As such the index genome will now be utah `NA12878`, I'll be using the [Illumina platinum pedigree](http://dx.doi.org/10.1101/gr.210500.116) data which is WGS pcr-free at 50x for the trio.

I know what exactly I have to do this time though so it should be faster. I'll have to remake the rDNA stats though `-_-



## Objective

Define a reference 'rDNA' based on NA12878 for use in initial series of alignment experiments.


## Materials and Methods

Schematic of hgr1 reference rDNA
``` 
                         1M                                          +43 kb
hgr       ----------------|              U13369.1                     |---------
                          ( RNA45S )                                ( )promoter



              10k     -1kb 1M          13.5 kb
hgr0       ---| 5s |--|prom|    RNA45S |-----------------------------------------
```

### Reference Sequences

- The 5S and promoter sequence was directly taken from hgr0.
- `hgr.fa` reference sequence, Accesion: U13369.1 

### Index Data

- High Coverage WGS. 1kg
- This was part of hgr_pilot and is already done.

``` NA12878 NA12878	SRR622457	CEU	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622457/SRR622457_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622457/SRR622457_2.fastq.gz
```

- WGS, PCR-free. Illumina Platinum
- https://www.illumina.com/platinumgenomes.html

```NA12878_pp	NA12878	ERR194147	CEU	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

NA12891_pp	NA12891	ERR194160	CEU	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194160/ERR194160_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194160/ERR194160_2.fastq.gz

NA12892_pp	NA12892	ERR194161	CEU	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194161/ERR194161_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194161/ERR194161_2.fastq.gz
```

#### CEPH 1463 Pedigree
This is actually pretty cool because there is sequencing for the extended pedigree. Should include these data in analysis.

![NA12878 Pedigree. CEPH 1463](../figure/20170304_na12878_pedigree.png)
From: [Illumina platinum pedigree](http://dx.doi.org/10.1101/gr.210500.116) NA128**


### Drafting hgr1

#### Draft 1

Starting with hgr.fa sequence. `NA12878_pcr` was used as the initial index and fasta sequence was edited to match the consensus from the aligned reads.

Note: This dataset `SRR622457` isn't really PCR-free, there is substantial GC-bias which makes certain regions difficult to interpret. Coverage drops from ~10,000x to ~100x and the error rate appears to be higher. This is an older illumina sequencing set but large regions of the reference could still be corrected.

Output sequence: `hgr1_d1.fa`.

Perform realignment in `~/Crown/data/hgr1/d1/`


#### Re-alignment

Extract the aligned reads for re-alignment
```
samtools sort -n NA12878_pcr.45s.bam -o nsort.bam 
bam2fastx -q -NAQP -o na12878_reads.fq nsort.bam   
```

Prepare draft reference sequence. 1) Remove all fasta headers manually from file. 2) Refold the file to standard width. Add-back `>chr13` header. 3) Index genome
```
cat hgr1_draft_1.fa | tr '\n' ' ' | sed 's/ //g' - | fold -w 50 > hgr1_draft1_w50.fa
samtools faidx hgr1_draft1_w50.fa  

samtools faidx hgr1_draft1_w50.fa    
bowtie2-build hgr1_draft1_w50.fa hgr1_d1  
```

Re-align reads to the draft reference sequence to look for errors/improved consensus
```
bowtie2 --very-sensitive-local -x hgr1_d1 -1 na12878_reads.1.fq -2 na12878_reads.2.fq | samtools view -bS - > re_aligned_unsorted.bam

samtools sort re_aligned_unsorted.bam -o re_aligned.bam
samtools index re_aligned.bam
```

#### Draft 2

Perform the same steps as in Draft 1. Plat. genomes will be re-aligned to draft 2 for further refinement, especially over GC-rich regions.

The na12878_pp.bam genome was completed successfully (see below). This was used as a secondary reference sequence to refine the sequence to draft 3.

#### Draft 3 CG

Testing adding 150 'C' and 150 'G' reads to the begining of the file. This seems to be a common source of error in some regions; a region which is CG rich gets reads which are polyC or polyG alignign over them and skewing results. Possibly adding a polyC - polyG sequence will hybridize against it.

```
bowtie2 --very-sensitive-local -x hgr1_d3cg -1 na12878_pp.1.fq -2 na12878_pp.2.fq | samtools view -bS - > re_aligned_unsorted_cg.bam

samtools sort re_aligned_unsorted_cg.bam -o na12878_pp_d3cg.bam
samtools index na12878_pp_d3cg.bam
```

For Draft 4 / hgr1, there were a couple of SNV changes made. otherise it's done!


In [1]:
cd ~/Crown/data/hgr1/

md5sum hgr1.fa # with other chromosomes (empty)
md5sum hgr1_d4.fa # without other chromosomes ( = gatk version)

498146050141bfdcaa039844c1edd74a  hgr1.fa
1334c8cb69101aefab6242bf0387d01c  hgr1_d4.fa


### PP genome alignments

Since `SRR622457` isn't neccesarily the best quality; re-align the platinum genomes to hgr.fa to allow for better drafting.

I've also re-worked the pipeline script: `1kg_align_v0.sh` to bypass some very time-expensive alignment steps. There isn't really a need for whole (100G+) bam files to be aligned when I'm interested in a subset of them. Bypassing this substantially reduces time and memory requirements.

- Running NA12878 and NA12891 on ec2 against hgr for use in drafting. Will keep instances operational and fastq files on them to allow for rapid  re-alignment to hgr1.fa once it's made. =D

`1kgpilot_2.sh`

In [None]:
#!/bin/bash
# 1kgpilot_2
# DNS: ec2-52-34-12-139.us-west-2.compute.amazonaws.com
# AMI: crown-170220 - ami-66129306
# EC2: c4.2xlarge (8cpu / 15 gb)
# Storage: 1000 Gb
# Start: 
# Alignment done: 
# Align.subset done: 
# End:
#
# CMD:
# ec2-52-34-12-139.us-west-2.compute.amazonaws.com
# sh 1kgpilot_2.sh NA12878_pp NA12878 ERR194147 CEU ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

# ec2-52-27-70-31.us-west-2.compute.amazonaws.com
# sh 1kgpilot_2.sh NA12891_pp NA12891 ERR194160 CEU ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194160/ERR194160_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194160/ERR194160_2.fastq.gz


# Control Panel -------------------------------
# CPU
	THREADS='7'

# Sequencing Data
	LIBRARY=$1 # Library/ File name
	FASTQ1=$5
	FASTQ2=$6

    # File-names
    FQ1=$(basename $FASTQ1)
    FQ2=$(basename $FASTQ2)

# Read Group Data
	RGSM=$2   # Sample. Patient Identifer
	RGID=$3 # Read Group ID. Accession Number
	RGLB=$LIBRARY # Library Name. Accession Number
	RGPL='ILLUMINA'  # Sequencing Platform.
	RGPO=$4 # Patient Population
	# Extract Sequencing Run Info
	#  RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d':' | cut -f2 -d' ')


# Initialize wordir ---------------------------

# Make working directory
  mkdir -p align; cd align

# Copy hgr genome and create bowtie2 index
  cp ~/resources/hgr_45s.fa ./
  samtools faidx hgr_45s.fa
  
  bowtie2-build hgr_45s.fa hgr
  
# Download Genome Sequencing Data
  wget $FASTQ1
  wget $FASTQ2

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1| head -n1 - | cut -f1 -d':' | cut -f2 -d' ')

# Primary Alignment -------------------------

# Bowtie2: align to genome

bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID --rg LB:$RGLB --rg SM:$RGSM \
--rg PL:$RGPL --rg PU:$RGPU -x hgr -1 $FQ1 -2 $FQ2 | samtools view -bS - > aligned_unsorted.bam

#rm $FQ1 $FQ2 # Remove fastq files to save space

# Sort alignment file
#  samtools sort -@ $THREADS aligned_unsorted.bam aligned
#  samtools index aligned.bam
#  rm aligned_unsorted.bam

# Calcualte library flagstats
  samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat
  
# Read Subset ------------------------------
# Extract mapped reads, and their unmapped pairs

  # Extract Header
  samtools view -H aligned_unsorted.bam > align.header.tmp

  # Unmapped reads with mapped pairs
  # Extract Mapped Reads
  # and their unmapped pairs
  samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
  samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs
  
  # Extract just the 45S unit
  #aws s3 cp s3://crownproject/resources/rDNA_45s.bed ./
  #samtools view -b -L rDNA_45s.bed align.F4.bam > align.F4.45s.bam
  
  # What are the mapped readnames
  samtools view align.F4.bam | cut -f1 - > read.names.tmp
  
  # Extract mapped reads
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

  
  # Extract cases of read pairs mapped on edge of region of interest
  # -------|======= R O I ======| ----------
  # read:                  ====---====
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

  # Complete mapped reads list
  #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

  # Extract unmapped reads with a mapped pair
  samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

  # Re-compile bam file
  cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr.tmp.bam
    samtools sort align.hgr.tmp.bam align.hgr
    samtools index align.hgr.bam
    samtools flagstat align.hgr.bam > align.hgr.flagstat
    
  # Read Counts: align.hgr0.bam (NA19240_pcr)
    # 651340 + 0 in total (QC-passed reads + QC-failed reads)
    # 0 + 0 duplicates
    # 614264 + 0 mapped (94.31%:-nan%)
    # 651340 + 0 paired in sequencing
    # 325670 + 0 read1
    # 325670 + 0 read2
    # 166576 + 0 properly paired (25.57%:-nan%)
    # 577188 + 0 with itself and mate mapped
    # 37076 + 0 singletons (5.69%:-nan%)
    # 0 + 0 with mate mapped to a different chr
    # 0 + 0 with mate mapped to a different chr (mapQ>=5)
  
  rm *tmp* align.F4.bam align.f4F8.bam # Clean-up

# Rename the total Bam Files
  mv aligned_unsorted.bam $LIBRARY.bam
  mv aligned_unsorted.bam.bai $LIBRARY.bam.bai
  mv aligned_unsorted.flagstat $LIBRARY.flagstat

# Rename the hgr Bam files
  mv align.hgr.bam $LIBRARY.hgr.bam
  mv align.hgr.bam.bai $LIBRARY.hgr.bam.bai
  mv align.hgr.flagstat $LIBRARY.hgr.flagstat
  
# Primary VCF ----------------------------

# GATK variant calling over 45S region
#  aws s3 cp s3://crownproject/resources/hgr.gatk.fa ./
#  aws s3 cp s3://crownproject/resources/hgr.gatk.fa.fai ./
#  aws s3 cp s3://crownproject/resources/hgr.gatk.dict ./
  
#  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
#  -R hgr.gatk.fa -T HaplotypeCaller \
#  -ploidy 2 --max_alternate_alleles 6 \
#  -I $LIBRARY.bam -o $LIBRARY.hgr.vcf
   # Memory issues, restrict to 45S region only
     # -ploidy 100, 50, 20 failed... do 2 and analyze 45S further
     
# Upload final output files to S3
 
# Alignments (Full)
 #aws s3 cp $LIBRARY.bam s3://crownproject/1kg_hgr0/
 #aws s3 cp $LIBRARY.bam.bai s3://crownproject/1kg_hgr0/
 aws s3 cp $LIBRARY.flagstat s3://crownproject/1kg_pilot/

# Alignments (Aligned)
  aws s3 cp $LIBRARY.hgr.bam s3://crownproject/1kg_pilot/
  aws s3 cp $LIBRARY.hgr.bam.bai s3://crownproject/1kg_pilot/
  aws s3 cp $LIBRARY.hgr.flagstat s3://crownproject/1kg_pilot/

# VCF
 aws s3 cp $LIBRARY.hgr.vcf s3://crownproject/1kg_pilot/
 aws s3 cp $LIBRARY.hgr.vcf.idx s3://crownproject/1kg_pilot/
 
# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
#aws ec2 terminate-instances --instance-ids $EC2ID

# Script complete

#### Standard vs. PCR-free alignment comparison

Standard Illumina
![NA12878 Standard](../figure/20170304_na12878_pcr_18S.png)

PCR-Free Platinum Genome
![NA12878 Plat. Genome. PCR Free](../figure/20170304_na12878_pp_18S.png)

### hgr1.fa Reference sequence

Reference sequence complete using NA12878 as an index.

### rDNA Stats for hgr1

This is a bit of a head-ache to do but I need to re-generate the rDNA stats for this genome. I'll do this when I need something boring and trivial to work on. For now let's get aligning.

#### hgr1 BED file
#### hgr0 Bed file (draft 4 coords). Zero-based
```
chr13	10220	10340	5S	cctccttcagc|GTCTACGGCCA	TGTAGGCTTT|ttctttggctttt
chr13	1000000	1013400	45S gggttataatt|GCTGACACGCT	[Note should read gggttatt|GCTG...]	GGGTCGACCAGC|agaccgcgggtgg
chr13	1003653	1005522	18S tctaccttacc|TACCTGGTTGAT TGCGGAAGGATCATTA|acggagcccggaggg
chr13	1006615	1006772	5.8S cgacctgcgta|CGACTCTTAGCGG TGTCTGAGCGTCGCTT|gccgatcaatcgc
chr13	1007940	1013009	28S gtccccctccgaga|CGCGACCTCA CACAAGGGTTTGTC|cgcgcgcgcgtgc
```

#### rRNA Bed file for hgr1

```
chr13	10219	10340	5S
chr13	1000000	1013408	45S
chr13	1003660	1005529	18S
chr13	1006622	1006779	5.8S
chr13	1007947	1013018	28S
```

In [5]:
## Sequence Extraction (hgr1)

mkdir -p ~/Crown/data/rDNA_stats/hgr1
mkdir -p ~/Crown/data/rDNA_stats/hgr1/seq

cd ~/Crown/data/rDNA_stats/hgr1/seq

cp ~/Crown/resources/hgr1/hgr1.fa ./
cp ~/Crown/resources/hgr1/rDNA.bed ./

bedtools getfasta -fi hgr1.fa -bed rDNA.bed -fo FASTA.tmp -name

fastaexplode FASTA.tmp

rm FASTA.tmp



In [6]:
## Sequence Annotation
cd ~/Crown/data/rDNA_stats/hgr1/
mkdir -p annot; cd annot

cp ~/Crown/resources/hgr1/rDNA.bed ./
cp ~/Crown/resources/hgr1/rDNA.gtf ./



In [8]:
## GC Content Wig Initialize
cd ~/Crown/data/rDNA_stats/hgr1/
mkdir -p gc; cd gc

# cat rRNA_gc.bed
  #5s_gc.bed
echo "chr13 9800 13000 5S" > 5s_gc.bed

  #45s_gc.bed
echo "chr13 998000 1014000 45S" > 45s_gc.bed

echo 'chr13 1023550' > hgr1.fa.idx
cp ~/Crown/resources/hgr1/rDNA.bed ./

cp ../seq/hgr1.fa ./



In [13]:
#!/bin/bash
# gcContent Calculator
# for rDNA
#
# Calculated gc of rDNA for 30,50,75 bp windows

# index =
# chr13	1023550

WINDOW='70'
SLIDE='1'
NAME="gc.w$WINDOW"

#bedtools makewindows -g hgr0.fa.idx -w $WINDOW -s $SLIDE > $NAME.bed
#bedtools makewindows -b rRNA_gc.bed -w $WINDOW -s $SLIDE > $NAME.bed

bedtools makewindows -b 5s_gc.bed -w $WINDOW -s $SLIDE > 5s.bed
bedtools makewindows -b 45s_gc.bed -w $WINDOW -s $SLIDE > 45s.bed


# make start=1,000,000 + (0.5 * $WINDOW)
echo "fixedStep chrom=chr13 start=9835 step=$SLIDE" > $NAME.wig
bedtools nuc -fi hgr1.fa -bed 5s.bed | cut -f 5 - | sed 1d - >> $NAME.wig

echo "fixedStep chrom=chr13 start=998035 step=$SLIDE" >> $NAME.wig
bedtools nuc -fi hgr1.fa -bed 45s.bed | cut -f 5 - | sed 1d - >> $NAME.wig

rm 5s.bed 45s.bed



In [14]:
# clean-up
rm hgr1.fa hgr1.fa.fai hgr1.fa.idx *.bed

rm: cannot remove 'hgr1.fa.ix': No such file or directory


In [None]:
## Protein / BP Contacts

## pending