# Set-up Reference Genomes
```
pi:ababaian
start: 2016 03 29 
complete : 2016 06 02
```
## Objective
Set-up reference genomes for the Crown Project

* hg38: Human Reference Genome
* hgr: Human Ribosomal Unit inserted on masked chr13
* hg38r: Human Reference Genome with ribosome unit injection
* chrM: Human Mitochondrial Genome

## Methods

In [1]:
# Navigate to Reference Directory
#
CROWN='/home/artem/Crown'

cd $CROWN/resources

# Initialize Directories
mkdir -p hg38
mkdir -p hgr
mkdir -p hg38r
mkdir -p chrM




In [None]:
#!/bin/bash
# hg38Get.sh

# Move to hg38 Directory
cd $CROWN/resources/hg38

# UCSC Download Genome
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

# Unzip archive
gzip -d hg38.fa.gz

# Index fasta file with samtools
samtools faidx hg38.fa

# Creates hg38.fa.fai file

### Download human Ribosomal Sequence

On the NCBI nucleotide website, human ribosomal DNA sequence repeat downloaded

* [GenBank: U13369.1](http://www.ncbi.nlm.nih.gov/nuccore/555853)

Both Genbank (.gb) and Fasta formats (.fa) Downloaded.
Fasta header was changed to `>rDNA`

File names are `rDNA.gb` and `rDNA.fa` respectively

#### Notes on rDNA.fa :
- rDNA.fa is 42,999 bp long
- 70 bp per line for 614 lines
- 19 bp on line 615

( These resources are in $CROWN/resources/rDNA )

In [2]:
cd $CROWN/resources/hgr
cp ../rDNA/rDNA.fa rDNA.fa
cp ../rDNA/rRNA.gb rDNA.gb
ls

rDNA.fa  rDNA.gb


In [3]:
# Transform fasta file from 70 bp-base to 50 bp-base
# since the rest of the hg38.fa genome uses 50 bp per line

# Buffer the fasta file into one line with no header
# Fold it into 

tail -n +2 rDNA.fa |
tr '\n' ' ' |
sed 's/ //g' - |
fold -w 50 - > 50fa.tmp

# Add back header
echo '>rDNA' > head.tmp

# Add back fasta sequence
cat 50fa.tmp >> head.tmp

# Ugly/specific code to add one last 'N'
# to the end of the sequence to round it out
sed -i 's/CGGGTTATAT/CGGGTTATATN/g' head.tmp

mv head.tmp rDNA.fa
rm *.tmp
ls

rDNA.fa  rDNA.gb


In [4]:
# rDNA.fa
md5sum rDNA.fa # f3e692ec2beea5d9fd3f6b3ded273c2e

f3e692ec2beea5d9fd3f6b3ded273c2e  rDNA.fa


In [5]:
#!/bin/bash
# hgrMake.sh
#
# Combine rDNA.fa and the headers from hg38.fa
# A single U13369 ribosomal DNA unit will be artifically
# inserted into an empty all "N" chromosome on the
# acrocentric arm of Chromosome 13
#
# chr13:1,000,000-1,042,999
#
cd $CROWN/resources/hgr

# Initialize file
grep ">" ../hg38/hg38.fa > hgr.fa

# Initialize an all "N" Chr13:1-1,000,000
echo ">chr13" > chr13.tmp

# Print 1m "N"s and fold them to 50 characters per line
printf 'N%.0s' {1..1000000} | fold -w 50 - >> chr13.tmp

# and add terminal newline
echo -e '\n' >> chr13.tmp

# Append the rDNA sequence to acrocentric arm of chr13
tail -n +2 rDNA.fa >> chr13.tmp
echo -e '\n' >> chr13.tmp

# Add some buffers "N"s (10k)
printf 'N%.0s' {1..10000} | fold -w 50 - >> chr13.tmp

sed -i '/^\s*$/d' chr13.tmp # removes empty lines

echo -e '\n' >> chr13.tmp # add terminal newline

md5sum chr13.tmp # 335cf2769014e9968cc791034f40e18a
md5sum hgr.fa # acf3ab8b9360aaca24aac608148ef8b3

335cf2769014e9968cc791034f40e18a  chr13.tmp
acf3ab8b9360aaca24aac608148ef8b3  hgr.fa


In [8]:
# hgr_main.fa
# Delete rDNA array after position 13,500
# (after 28S, before polyT/TCT simple repeat)

sed '20277,20866d' hgr.fa > hgr_main.fa
samtools faidx hgr_main.fa

md5sum hgr_main.fa # 6b36e847dcffdb5be8b094f402fa8038

6b36e847dcffdb5be8b094f402fa8038  hgr_main.fa


In [6]:
# Replace hgr.fa "chr13" which is empty with the chr13.tmp
CHR13=$(grep -n ">chr13$" hgr.fa | cut -f1 -d':' -  ) # =6
    ABOVE=$(expr $CHR13 - 1) # 5
    BELOW=$(expr $CHR13 + 1) # 6
END=$(wc -l hgr.fa | cut -f1 -d' ' -)


# All entries starting above chr13
sed -n "1,$ABOVE"p hgr.fa > hgr.start

# All entries below chr13
sed -n "$BELOW,$END"p hgr.fa > hgr.end

cat hgr.start chr13.tmp hgr.end > hgr_2.fa

sed '/^$/d' hgr_2.fa > hgr.fa

rm hgr.start hgr.end hgr_2.fa

md5sum hgr.fa # 8744076592c47fcb7f2c04238ccbc924

8744076592c47fcb7f2c04238ccbc924  hgr.fa


In [None]:
#!/bin/bash
# hg38rMake.sh

# Extract and explode each chromosome from hg38

mkdir 38; cd 38
fastaexplode -f ../../hg38/hg38.fa
cd ..

mkdir hgr; cd hgr
fastaexplode -f ../../hgr/hgr.fa
cd ..


In [None]:
# Inject hgr.fa (chr13) into hg38 (chr13) at the beginning of chr13

# Reformat both fasta files so that htey are the same (width wise)
fastareformat chr13.tmp > chr13.r.tmp
fastareformat 38/chr13.fa > chr13.38.tmp

# Oddly specific line to remove last line from chr13.r.tmp
# so it's formatted correctly (full N line)
# tail may be more elegant here
sed -i '/^NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN$/d' chr13.r.tmp

# Post-rDNA end of chr13
RLEN=$(wc -l chr13.r.tmp | cut -f1 -d' ' -)
sed "1,$RLEN"d chr13.38.tmp > chr13.end

# Append rDNA insert and post-rDNA end of chr13
cat chr13.r.tmp chr13.end > chr13.fa
sed -i '/^$/d' chr13.fa

rm chr13.38.tmp chr13.end chr13.r.tmp chr13.tmp

wc chr13.fa
wc chr13.38.tmp

mv chr13.fa 38/

In [None]:
# chrUn_GL000220v1
# This unplaced contig contains an rDNA sequence
# but it is not very high quality but enough that it would
# cause substantial mapping problems with the rDNA insertion on 13
#
# Mask this chromosome to N's

sed -i 's/[atgcATGC]/N/g' 38/chrUn_GL000220v1.fa 

In [None]:
# Copy Mitochondrial chromosome to its own folder too
cp 38/chrM.fa ../chrM/


In [None]:
# Recompile hg38r genome
cut -f1 ../hg38/hg38.fa.fai | sed -e 's/$/.fa/g' - > hg38.chrList
cd 38
cat $(cat ../hg38.chrList) > hg38r.FULL
cd ..


mv 38/hg38r.FULL ../hg38r/hg38r.fa


In [None]:
# Superficial QC for hg38r
cd $CROWN/resources/hg38r

# I don't remember where I got fastacompositoin but it's useful here
fastacomposition ../hg38/hg38.fa
fastacomposition hg38r.fa

md5sum hg38r.fa # 665e94ddf08da4d4a3a168e0f427b5f5

In [None]:
#Clean up directory
cd $CROWN/resources/hgr

rm 38/* hgr/*
rmdir 38 hgr

rm hg38.chrList rDNA.fa rDNA.gb

In [None]:
# Go to Crown Directory
cd $CROWN/resources

# Index the genomes with samtools
cd hg38
samtools faidx hg38.fa

cd ../hg38r
samtools faidx hg38r.fa

cd ../hgr
samtools faidx hgr.fa

cd ../chrM
samtools faidx chrM.fa

cd ..

In [None]:
# Twobit conversion
cd $CROWN/resources/hg38r

faToTwoBit hg38r.fa hg38r.2bit


In [None]:
md5sum hg38r.2bit # c3bf4b0d9e35b352b7489726224e5aab

# Addendum
This generated the final outputs I need to base the project off of

One issue that arose is that git-lfs (github really) doesn't like files greater then 2 Gb (like hg38.fa and hg38r.fa)

To get around this I will upload these genomes as .2bit files. Meaning

```
faToTwobit hg38[r].fa hg38[r].2bit
```

was run and the 2bit files will be stored in the repository through git-lfs.

To accomodate this, these files were added to .gitignore

```
resources/hg38/hg38.fa
resources/hg38r/hg38r.fa
```

(When downloading the repository make sure to unpack the 2bit files)

Finally the 2bit files were added to a local ignore list and deleted locally to save space

```
git update-index --assume-unchanged resources/hg38/hg38.2bit
git update-index --assume-unchanged resources/hg38r/hg38r.2bit
```


#QED

## Addendum II - 2016 10 20

For CNV analysis I'd like to do some genome-wide alignments along with rDNA alignments (i.e. with hg38r).

The hg38r which I currently have (Oct 20th) contains rDNA contig. This contig is now converted to 'N's

## Addendum III -  2016 10 31

Whoops, accidently didn't put the rDNA onto chr13 in hg38r but it's OK in hgr so the files made didn't have rDNA in it. Have to re-generate all resources.

Also all of chr13 in hg38 (original) was somewhere turned into "N"s and I had to re-download. Lots of fixing -_-'

For now don't worry about hg38r genome. Focus on doing variant/structural variation identification. Remove hg38r.2bit from git repo.

## Addendum IV - 2016 11 06

Some of the hgr genome I've used is a bit wonky. There was a second copy of rDNA unit at position 2 million of chr13. The mappability
plot therefore has a second wiggle (identical to the first).

Regenerating hgr.fa. Confirmed it's correct. Moved other hgr.fa to $CROWN/resources/hgr/hr_DoubleError.

Also for alignment purposes; made hgr_main.fa (which is a 18 / 5.8 / 28S + little bit of flanking sequence).