# Encodings of hg19
```
pi:ababaian
start: 2016 12 19
partially complete : 2017 01 02
```
## Introduction

This work was between Dec 19 - Jan 2nd and is now being written up into a digital notebook based on the notes I've taken.

To analyze the information content of alternative encodings of hg19, a standard genome was choosen and conversion scripts were created.

## Objective

* Write Standard-Encoding to Alternative-Encoding Scripts
* Create RY, SW, MK, B and Random encoded genomes for further experiments
* Create Sub-genomes; transcriptome, coding-sequence-ome, transposable elements, simple repeats, regul-ome for sub-analysis


## Materials and Methods

### Resources Used
* [UCSC hg19](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit)

### Software Used
* [fastaexplode](http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate)



In [2]:
# Resource Directory
SERRATUS='/home/artem/Serratus'

cd $SERRATUS
 mkdir -p resources
 cd resources



In [None]:
## Human Genome (hg19)
# ==============================================================
  cd $SERRATUS/resources/
  mkdir -p hg19

# Copied hg19.fa (UCSC)
  cp ~/resources/hg19.fa ./hg19.fa

# Expand genome to all chromosomes
  fastaexplode hg19.fa

# Manually removed alternative haplotypes (higher similarity)
# to get a more 'fair' view of the haploid genome
# Repeat afterwards with hg38 and haplotypes included. This will likely
# yeild even more repetition per search
  rm chr*_*.fa hg19.fa
  
# Genome of chr1 ... chrX
  cat chr* > hg19.fa
  rm chr*


# Note: The genome downloaded is "Repeat Masked" so it's important
# to retain capital and lowercase. It is possible to analyze
# a 'Unique' genome by simply masking all lower-case characters
# to Ns

In [None]:
## Non-Standard Encodings (hg19)
# ==============================================================
# Create Purine/Pyrimidine Genome (ry)
# CA genome
  cp hg19.fa hg19_ry.fa

  # Collapse Pyrimidines
  sed -i 's/t/c/g' hg19_ry.fa
  sed -i 's/T/C/g' hg19_ry.fa

  # Collapse Purines
  sed -i 's/g/a/g' hg19_ry.fa
  sed -i 's/G/A/g' hg19_ry.fa



# Create Strong/Weak Genome (sw)
# GT genome
  cp hg19.fa hg19_sw.fa

  # Collapse Strong nt
  sed -i 's/c/g/g' hg19_sw.fa
  sed -i 's/C/G/g' hg19_sw.fa

  # Collapse Weak nt
  sed -i 's/a/t/g' hg19_sw.fa
  sed -i 's/A/T/g' hg19_sw.fa



# Create Amino/Keto Genome (mk)
# GA genome
  cp hg19.fa hg19_mk.fa

  # Collapse Amino nt
  sed -i 's/c/a/g' hg19_mk.fa
  sed -i 's/C/A/g' hg19_mk.fa

  # Collapse Keto nt
  sed -i 's/t/g/g' hg19_mk.fa
  sed -i 's/T/G/g' hg19_mk.fa

In [None]:
## Lossy Encodings (hg19)
# ==============================================================
# To save space, not every single nucleotide genome is included. Arbitrarily H was choosen.

# As a control, include NOT-N genomes in the analysis.
# They too include binary data of the genome but strip alot of the data
# i.e.
# not-G (H)
#   A C T --> C
#   G     --> G

# not-A (B)
#   G C T --> T
#   A     --> A

# not-T (V)
#   A C G --> C
#   T     --> T

# not-C (D)
#   A G T --> T
#   C     --> C

# Note V and D must use the same characters,
# at least one of the genomes has to be ambiguous
# AT AC AG TC TG GC are all combinations possible

# Create not-G Genome (h)
# GA genome
  cp hg19.fa hg19_h.fa
  # Collapse not-G nts
  sed -i 's/[at]/c/g' hg19_h.fa
  sed -i 's/[AT]/C/g' hg19_h.fa

# CODE FOR B V D genomes


# Create not-A Genome (b)
# GA genome
  cp hg19.fa hg19_b.fa
  # Collapse not-A nts
  sed -i 's/[gc]/t/g' hg19_b.fa
  sed -i 's/[GC]/T/g' hg19_b.fa

# Create not-T Genome (v)
# GA genome
  cp hg19.fa hg19_v.fa
  # Collapse not-A nts
  sed -i 's/[ag]/c/g' hg19_v.fa
  sed -i 's/[AG]/C/g' hg19_v.fa

# Create not-C Genome (d)
# GA genome
  cp hg19.fa hg19_d.fa
  # Collapse not-A nts
  sed -i 's/[ag]/t/g' hg19_d.fa
  sed -i 's/[AG]/T/g' hg19_d.fa

## Further Genomes / Sub-genomes to analyze

* Human Coding Sequences (CDS)
* Human Transcriptome (GENCODE annotation)
* Human Transposable Elements (RepeatMasker)
* Human Genome hg38 with haplotype data
* Mouse mm10
* E. Coli Genome
