First I created a small CRM program, when given a nucleotide sequence, translates it into its protein counterpart

In [1]:
import subprocess

subprocess.call('cat nuc2protein.crm', shell=True)

0

Talked with Jarrod, and we both came up with ideas about the direction of the project.

###Initial Need

Initially the idea was a fuzzy protein search. You input 'PXFARX{5}C' into the search and it gives you the location, promoters, etc. associated with derivatives of PFARQ. Basically we're trying to make a fuzzy search engine for [sequence motifs](https://en.wikipedia.org/wiki/Sequence_motif)

[Promoter Region][PXFARX{5}CXXCH}[Region until Stop codon]

N = A,G,C,T
R = A/G
Y = C/T

PXFARX{5}CXXCH = CCN N{3} TTY GCN CGN|AGR N{15} TGY N{6} TGY CAY

Someone already built a [Regex search engine for DNA](http://benchling.engineering/dna-regex-search/).

###Derivatives of PFARQ

We defined derivatives as amino acids that are similar in some way. For example, P (tryptophan) and Y (tyrosine) are both aromatic groups, so a derivative of PFARQ could be YFARQ for example.

Other differentiators could be amino acid size, charge (+/-), acidic groups, basic groups, polar/nonpolar, purine vs. pyridine

###How to find PFARQ derivatives
As I was reading, I came across the concept of the Levenshtein edit distance. I mentioned it to Jarrod, and we thought a way we could merge the edit distance concept with the PFARQ derivatives.

*Edit distance*: p*ho*to -> p*oh*to (1 operation, switch h and o)

The genomics version would be with amino acid sequences. Say you have the amino acid F (phenylalanine). From to F -> [WY] (both W/Y are aromatic), so the edit distance weight would only be 0.5. From F->G (both are completely unrelated) so the edit distance weight would be 2.

I came up with the idea of constructing a matrix of these weights between amino acids. 

   **P F A R Q**

**P** .5 2 .7 ..

**F**

**A**
..

I then also thought about the idea of emergence, where maybe we input the weights in the matrix initially, but over time the most optimal weight arrangement would appear. 

Jarrod then suggested I can create synthetic sequences by changing the "initial conditions" of these weights and use a machine learning classifier/look through a database of known sequences to see if these synthetic sequences are legitimate.

His inspiration coming from a [chaotic mapping to generate musical variations paper](http://www.eecs.qmul.ac.uk/~eniale/teaching/ise575/c/presentations/7-1-dabby_chuan.pdf) 
https://dspace.mit.edu/bitstream/handle/1721.1/27282/Dabby_Diana_PhD_1995.pdf?sequence=1




###Literature search

Obviously before I try this, I want to see what other people have done.

####Fuzzy string matching libraries/classifiers written in Python

 * Fuzzy string matching library in Python based off of Levenshtein distance (https://github.com/seatgeek/fuzzywuzzy)

 * [Fuzzy matching classifier from yhat](http://blog.yhat.com/posts/fuzzy-matching-with-yhat.html)


####Motifs in DNA and their significance

[A Guided Tour to Approximate String Matching] (http://www.ling.helsinki.fi/kit/2002s/ctl190net/materiaali/p31-navarro.pdf)

With the rise of high-throughput sequencing techniques, DNA sequence data has become abundant. Recently, scientists have become more interested in searching for gene sequence motifs because they often are locations for sequence-specific binding sites or are involved in important RNA processes [6].. Motifs are short and recurring patterns in DNA (normally between 5 to 20 base-pairs long) that can be found similar genes in different organisms. Motifs can encode for structural motifs in protein sequences, which encode for the secondary structure of proteins (maybe explain more about importance of motifs) [2]. Many algorithms have been developed to identify motifs in DNA: Bailey and Elkan [3] developed the MEME algorithm based off of expectation maximization; Liu *et al* [4] used self-organizing neural networks to find motifs in DNA and protein sequences; other tools like CLUSTAL W [5] to first align multiple sequences and then identify motifs from conserved regions. 

List of other algorithms: [Quang, Daniel, and Xiaohui Xie. "EXTREME: an online EM algorithm for motif discovery." Bioinformatics 30, no. 12 (2014): 1667-1673.], [Yao, Zizhen, Zasha Weinberg, and Walter L. Ruzzo. "CMfinder—a covariance model based RNA motif finding algorithm." Bioinformatics 22, no. 4 (2006): 445-452.]

Once a motif is identified, biologists are often interested in finding its occurences (need more info why, Jarrod any insight?) in other gene sequences (Talk about example with motifs being found in only certain types of organisms, in certain places, etc, any insight Jarrod?). DNA can be seen simply as long streams of text that adhere to an alphabet composed of A, C, G, and T, and thus computer algorithms that search for patterns in text can be used to find specified sequences like motifs. Unfortunately, exact searching of patterns is often futile because gene sequences may have small difference between them due to sequencing error, evolutionary differences between organisms, and mutations. Thus there is tremendous value in developing a "fuzzy search" that would allow a certain amount of error in the search [9]. One way to allow a certain amount of error in the search is via *edit distance*, most commonly exemplified by Levenshtein distance [10], a measure of the minimum number of operations needed to transform one string into another using only single-character insertions, deletions, or substitutions. In the evolution of DNA, certain kinds of point mutations via single-nucleotide deletions, insertions, and substitutions are more likely to occur than others [1]. A *weighted edit distance* can be created by assigning By assigning weights to certain editing operations, a *weighted edit distance*  is more interesting to computational biologists as it 

The nucleotide sequence of a motif can be described via a Position Frequency Matrix (PFM), which records how often each base occurs at every given site. This PFM can then be visualized via sequence logos [7] in which motifs can be easily identified via visual inspection as seen in Figure 1. of [8]. 

####Reference

1. Freese, Ernst. "The difference between spontaneous and base-analogue induced mutations of phage T4." Proceedings of the National Academy of Sciences 45, no. 4 (1959): 622-633.

2. Das, Modan K., and Ho-Kwok Dai. "A survey of DNA motif finding algorithms." BMC bioinformatics 8, no. Suppl 7 (2007): S21.

3. Bailey, Timothy L., and Charles Elkan. "Unsupervised learning of multiple motifs in biopolymers using expectation maximization." Machine learning 21, no. 1-2 (1995): 51-80.

4. Liu, Derong, B. DasGupta, and Huaguang Zhang. "Motif discoveries in unaligned molecular sequences using self-organizing neural networks." Neural Networks, IEEE Transactions on 17, no. 4 (2006): 919-928.

5. Thompson, Julie D., Desmond G. Higgins, and Toby J. Gibson. "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic acids research 22, no. 22 (1994): 4673-4680.

6. D'haeseleer, Patrik. "What are DNA sequence motifs?." Nature biotechnology 24, no. 4 (2006): 423-425.

7. Schneider, Thomas D., and R. Michael Stephens. "Sequence logos: a new way to display consensus sequences." Nucleic acids research 18, no. 20 (1990): 6097-6100.

8. D'haeseleer, Patrik. "What are DNA sequence motifs?." Nature biotechnology 24, no. 4 (2006): 423-425.

9. Navarro, Gonzalo. "A guided tour to approximate string matching." ACM computing surveys (CSUR) 33, no. 1 (2001): 31-88.

10. Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." In Soviet physics doklady, vol. 10, no. 8, pp. 707-710. 1966.

11. Behura, Susanta K., and David W. Severson. "Codon usage bias: causative factors, quantification methods and genome‐wide patterns: with emphasis on insect genomes." Biological Reviews 88, no. 1 (2013): 49-61.

The Levenshtein distance So in terms of a "distance function" that would be most appropriate, then the Levenshtein distance would be most appropriate. 

This PDF (https://web.stanford.edu/class/cs124/lec/med.pdf) has a pretty good explanation of the "weighted minimum edit distance" which is what I'm interested in, but so far it's in the realm of sequence alignment.

*Citation Bank*:  Needleman and Wunsch [1970], Waterman [1995], Wagner-Fischer algorithm, 
[Gusfield 1997 (p.224) on weighted edit distance](http://s3.amazonaws.com/academia.edu.documents/31053376/Algorithms_on_String_Trees_and_Sequences.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1466020763&Signature=vf5TMiTSz4AhH%2BrbUI%2FF%2FgboGvw%3D&response-content-disposition=inline%3B%20filename%3DAlgorithms_on_strings_trees_and_sequence.pdf), 

Basically in my searches (in which I discover that [Ben Langmead is a legend](http://www.langmead-lab.org/teaching-materials/)), the minimum edit distance concept seems to be mostly applied to alignment type things. Let's see if I can find papers that show it being applied in a PFARQ-like manner...

Distance d(A,B): minimum number of operations needed to transform genome A into genome B

###DCJ (Double Cut Join) operations
Model for edit distance based on gene order and orientation rather than nucleotide sequences. (lmao not what I want I don't think...). Basically Bergeron, et al defined DCJ distance as some graph theory thing. "The genomic distance problem in the Hannenhalli-Pevzner theory is the following: Given two genomes whose chromosomes are linear, calculate the minimum number of inversions and translocations that transform one genome into the other." - so this is the basis for synteny algorithms...


###What is a PWM (position weight matrix)?

Has 4 rows (if doing nucleotide comparison) or 20 rows for amino acids. First, create a position frequency matrix (PFM) by counting occurences of each nucleotide at each position. Then create a position probability matrix (PPM) by dividing that former nucleotide count by # of sequences. Then you figure how frequently each letter appears in the dataset; assuming each letter appears equally then probability distribution (bK) would be 0.25 for nucleotides and 0.05 for amino acids. So you do some ln(m/bK) to transform it.

When you add the relevant values at each position in the PWM, you get a "sequence score" which measures how different the sequence is from a random one. Score of 0 indicates same probability of being both a random/non-random site. Greater than 0 means more likely to be non-random, etc.

###Another interesting topic: Substitution matrices for amino acids/nucleotides

PAM1 means probability of each amino acid changing to another is ~1% and probability of not changing is ~99% (doesn't seem legit...). [This link explains the PAM matrix really well](http://www.math.lsa.umich.edu/~dburns/seqalmit2.pdf). So the PAM250 Matrix is a table of the expected change in amino acids?? Is there a way to consolidate this model with nucleotide changes, since this model appears to ignore that. PAM250 matrix means 250% expected change. 

Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of Protein
Sequence and Structure, vol. 5, suppl. 3, 345–352. National Biomedical Research
Foundation, Silver Spring, MD, 1978

Expected Similarity: PAM120-40%, PAM80-50%, PAM60-60%, PAM250-15030% similarity. You have to run all of them and check which gives highest ungapped alignment score.

*Derivation*: maximum parsimony (emphasizes least number of substitutions), relative mutability (ratio of total # of times amino acid has changed vs. # of occurences), **relative mutability** based on chemical properties, effective frequency (consider difference in variability of primary protein structure), **mutational probability matrix** (probability of amino acid being substituted over a given unit of evolutionary time), log-odd (used most in practice)

tl;dr: [summary of types of substitution matrices for amino acids (PAM, BLOSUM, etc.)](http://homepages.ulb.ac.be/~dgonze/TEACHING/pam_blosum.pdf))

####Substitution matrices for protein sequences

**PAM Matrix is based on rate of divergence between sequences** (. PAMn = PAM1^n (gives the probability to observe the changes i->j per 100xn mutations). *J. van Helden (log-odd PAM250)*

Schwartz RM, Dayhoff M. Matrices for detecting distant relationships. In: Dayhoff M, editor. Atlas of Protein Sequence and Structure. supplement 3. volume 5. National Biomedical Research Foundation; Silver Spring, MD: 1978. pp. 353–358.

**BLOSUM Substitution matrix - designed to find conserved regions in proteins**, not based on evolutionary distances, based on protein chemical similarities (considers hydrophobia, aromaticity, polarity, basicity, acidity)

 *Henikoff S and Henikoff JG (1992). Amino acid substitution matrices from
protein blocks. PNAS 89:10915-10919.* 

Improved BLOSUM62 (MP Styczynski, KL Jensen, I Rigoutsos, G Stephanopoulos
Nat. Biotech. 26: 274–275, 2008)

**GONNET Substitution matrix** - based on exhaustive sequence alignment analysis(Gonnet, Cohe

####Substitution matrices for nucleotides

Considering transition (purine to purine, pyrimidine to pyrimidine) and transversion effects (purine -> pyrimidine and vice versa)

Li, Wen-Hsiung. "Unbiased estimation of the rates of synonymous and nonsynonymous substitution." Journal of molecular evolution 36, no. 1 (1993): 96-99.

####Idea: use the probability distribution found in PAM matrices as a basis for amino acid mutations in project. use PWM as scoring metric. MACHINE LEARN THESE TABLES

How to choose appropriate PAM matrix? PAM120 most appropriate for database searches, PAM200 most appropriate for comparing two specific proteins with suspected homology (Altschul SF, 1991)



The wikipedia article on [sequence motifs](https://en.wikipedia.org/wiki/Sequence_motif) led me to a bunch more papers (like this related but not quite relevant one: [Finding Tandem sequences (sequences that repeat one after another) algorithm](http://bioinformatics.oxfordjournals.org/content/23/2/e30.full))

###List of papers

* MaMF [A deterministic motif finding algorithm with application to the human genome, Hon and Jain (2006)](http://bioinformatics.oxfordjournals.org/content/22/9/1047.full): given a set of N promoters and motif width (w), MaMF can find high scoring motifs across different promoters. Their scoring function is based on motif conservation, uniqueness of motif relative to the genome (seeing how many motifs and approximate cousins live in the same genome?). They align two motifs using PWMs.

* [EXTREME: an online EM algorithm for motif discovery](http://bioinformatics.oxfordjournals.org/content/30/12/1667.full.pdf+html) (it's more advanced cousin using neural networks http://nar.oxfordjournals.org/content/early/2016/04/15/nar.gkw226.full.pdf+html)

* [Approximate String Searching under Weighted Edit Distance (Kurtz, 1996)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.6709&rep=rep1&type=pdf)

####References
* [Use of perceptron alagorithm to distinguish translational initiation sites in E.Coli (the OG paper that spawned the PWM)](http://nar.oxfordjournals.org/content/10/9/2997.full.pdf+html)

* [Prior knowledge in finding Motifs using MEME](http://www.aaai.org/Papers/ISMB/1995/ISMB95-003.p
* (J. Korhonen, P. Martinmäki, C. Pizzi, P. Rastas and E. Ukkonen. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25(23), pages 3181-3182. (2009)) 

###Pattern Matching on Weighted Sequences AKA fuzzy matching for protein sequences
* **[Pattern Matching on Weighted Sequences (Christodoulakis, 2004)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.63.604&rep=rep1&type=pdf), Approximate Matching in Weighted Sequences - Amir 2006.** - I think this is the jackpot. Basically PXFARX{5}CXXCH devolves into CCN N{3} TTY GCN CGN|AGR N{15} TGY N{6} TGY CAY. *This nucleotide sequence can be represented by a Position Weight Matrix, of which we can then apply Pattern Matching* Probably need to somehow bring in BLOSUM65 matrices into this...



##side idea: machine learning actual gene sequences, creating synthetic ones, blasting them to get scores, building 2d protein structures, assess stability, retrain

###Reading list for that:


