# Defining Nucleotide Encodings
```
pi:ababaian
start: 2016 12 20
complete : 2017 01 02
```
## Introduction to IUPAC Characters

   In bioinformatics, nucleotide-sequencing information is usually represented using the characters {A, T, U, C, G, N}. Additionally [IUPAC](http://www.bioinformatics.org/sms/iupac.html) defines {R, Y, M, K, S, W, H, B, V, D, X} for all different combinations of {A, T, C, G}
    
| IUPAC nucleotide code	| Base  | Represents |
|-----------------------| ----- | ---------- |
|A                      |   Adenine      | 1 |
|T                      |	Thymine      | 1 |
|U                      |	Uracil       | 1 |
|C                      |   Cytosine     | 1 |
|G                      |   Guanine      | 1 |
| | | |
|S                      |	G or C       | 2 |
|W                      |	A or T       | 2 |
|R                      |	A or G       | 2 |
|Y                      |	C or T       | 2 |
|M                      |	A or C       | 2 | 
|K                      |	G or T       | 2 |
| | | |
|B                      |	C or G or T  | 3 |
|D                      |	A or G or T  | 3 |
|H                      |	A or C or T  | 3 |
|V                      |	A or C or G  | 3 |
| | | |
|N                      |	any base     | 4 |
|X                      |	unknown base | 0 |
|-                      |	gap          | 0 |


 

## ASCII Nucleotide Encodings

### Standard Encoding (ACTG)
The standard or ACTG-encoding for this project is the broadly used {A, T, C, G}. One character perfectly represents one single nucletide. For the 16-bit string {ATGAACGT}, 8-bit substrings in standard encoding are {ATGA} {TGAA} {GAAC} {AACG} {ACGT}.
    
### Non-standard Encodings
Alternative encodings of the nucleotides are provided in the IUPAC format as well as using standard encoding. This is to allow for analysis software to parse the genomes without throwing errors.
    
#### Strong-Weak (SW)
    SW-encoding is {S, W} in IUPAC, but will also be {G, T} respectively.
    The 16-bit string {ATGAACGT} can only be incompleteley represented in 8-bits by {WWSWWSSW}

#### Pyrimidine-Purine (RY)
    RY-encoding is {R, Y} in IUPAC, but will also be {C, A} respectively.
    The 16-bit string {ATGAACGT} can only be incompleteley represented in 8-bits by {RYRRRYRY}
    
#### Amin-Keto (MK)
    MK-encoding is {M, K} in IUPAC, but will also be {A, G} respectively.
    The 16-bit string {ATGAACGT} can only be incompleteley represented in 8-bits by {MKKMMMKK}
    
### Lossy Encodings
Finally, there are lossy B-, D-, H-, V-encodings.
    
    B-encoding is {B, A} in IUPAC or {T, A} respectively.
    V-encoding is {V, T} in IUPAC or {C, T} respectively. 
    D-encoding is {D, C} in IUPAC or {T, C} respectively. 
    H-encoding is {H, G} in IUPAC or {C, G} respectively.
    
Unlike the Non-standard Encodings, two lossy encodings cannot be combined to recapitulate the complete original information of nucleotide string.

## Binary Encodings
The set {A, T, C, G} can be minimally represented as 2-bits of information. Which means in binary this can be written as permutations of two 1's or 0's. The convention  in .2bit file format is:

|        |  A  |  T  |  G  |  C  |
|--------|-----|-----|-----|-----|
| Binary |  10 |  00 |  11 |  01 |


An alternative interpreation of this binary encoding is that the first bit represents if the nucleotide is a Purine or Pyrimidine, and the second bit represents if it is a Strong or Weak nucleotide.
    
In this way any two non-standard encodings (excluding the lossy-encodings) can completely represent the information in standard-encoding string.


|        |  A  |  T  |  G  |  C  |
|--------|-----|-----|-----|-----|
| Binary |  10 |  00 |  11 |  01 |
||||||
| Purine |  1  |  0  |  1  |  0  |
| Strong |  0  |  0  |  1  |  1  |
| Amino  |  1  |  0  |  0  |  1  |


For the string 16-bit string:
        - Standard: { A T G A A C G T }           16 bits
        - 2bit:     { 10 00 11 10 10 01 11 00 }   16 bits
        - RY:       { 1 0 1 1 1 0 1 0 }            8 bits
        - SW:       { 0 0 1 0 0 1 1 0 }            8 bits
        - MK:       { 1 0 0 1 1 1 0 0 }            8 bits
        
So 2-bit encoding is equivalent to RY-SW encoding, but the alternatives of RY-MK or SW-MK are just as feasible. The information is additive.
    

## Biological interpretation of Non-standard Encodings

### RY
The puRines, A and G, are synthesized by a distinct pathways from the pYrimidines, T and C. Purines and pyrimidines have distinct chemical structures and the mutation rate within the same class of nucleotide (transitions) is higher then a change in nucleotide class (transversions).
    
This means that given two homologous sequences in the genome, they will accumulate transition differences faster then they accumulate transversion differences. That is to say that for homologous sequences, the RY encoding is expected to be more similar for homologous sequences then any other encodings since transition mutations are not represented.
    
### SW 
The 'Strong' nucleotides, guanine and cytosine which form three hydrogen bonds when base-paired are distinct from the 'Weak' nucleotides, adenine and thymine which form 2 hydrogen bonds when base-paired. 
    
Blocks of high GC-content or AT-content, called isochores, stain differently by Giemsa and segregate gene-rich and gene-depleted genomics regions. For a long period of time there is a noted differenece in these blocks of GC content, which means that given a string with high GC content, any adjacent characters are more likely to also have a high GC content.
 
### MK 
Finally, the IUPAC division of 'Amino' A and C and 'Keto' G and T division is based on a difference in chemical structure of the nucleotides. This is probably the least well-defined grouping of nucleotides compared to Strong-Weak and Purine-Pyrimdine axis.