<h4>What is homology?</h4>

* Two genes are said to be homologous if they have both evolved from a common ancestor 
* Homology cannot be MEASURED i.e. it is impossible to be 50% HOMOLOGOUS – it is either YES or NO
* However, the LIKELIHOOD that two genes are related (i.e. are homologous) can be ESTIMATED

<h4>What are the measures of Amino acid similarity</h4>

1. Identity
2. Similarity in properties (e.g. hydrophobicity or size)
3. Similarity in Genetic code (codons)
4. Exchange propensities
   - Dayhoff Matrix
   - BLOSUM Matrix
   
   
---------------

<h4>Explain mutation (aka. substitution) matrices</h4>

* These are ways of calculating the similarity between two sequences in order to infer evolution
* The log-odds scores are used to express the probability of transformation:
$$log\frac{observed freq.}{expected freq.} = log \frac{prob. \ of \ j \ given \ i \ * freq. of\ i }{freq \ of\  j \ * freq \ of\  i}$$

<b>PAM</b>

The PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. Using this logic, Dayhoff derived matrices as high as PAM250. 

A matrix for more distantly related sequences can be calculated from a matrix for closely related sequences by taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the WIKI1 matrix by saying W 2 = W 1 2 W_{2}=W_{1}^{2}} W_{2}=W_{1}^{2} where W 1 {\displaystyle W_{1}} W_1 is WIKI1 and W 2 {\displaystyle W_{2}} W_2 is WIKI2. This is how the PAM250 matrix is calculated.

<b>BLOSUM</b>

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. 

The probabilities used in the matrix calculation are computed by comparing present-day related sequences ton each other, rather than to inferred common ancestors

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.

Differences between PAM and BLOSUM

PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution.

The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.

-------------------------

<h4>Explain Dynamic Programming for global sequence comparison</h4>

* Used to compare the similarity in two sequences of possibly different lengths (i.e. allowing gaps where they don't match)
* But where sequences are presumed to be related over their entire lengths

<img src="../_img/bio_2.jpg" width="100">

* Put your two sequences into a matrix - one along the top and the other down the side
* Decide how you are going align your sequences - (this means where they don't match you might leave a gap in one so that other portions do match)
* This means you could have one of these scenarios:
   - Match: The two letters are the same
   - Mismatch: The two letters are differential
   - Indel (INsertion or DELetion) : One letter aligns to a gap in the other string
   
There are various ways to score these three scenarios. 

---------------

* Constant penalty: 1 gap, 7 non-gaps, so 7 -1 = 6
* Linear Gap Penalty: length of gap =3, so 7 -3 = 4
* Affine Gap Penalty: Let A = 1, B=2 so gap cost = A + (B* Gap Length) = 7


<h4>NW Global Alignment</h4>
* Needleman and Wunsch; matches are given +1, mismatches are given -1 and indels are given -1   

<img src="../_img/bio_3.jpg" width="500">

<img src="../_img/bio_4.jpg" width="500">

<b>The Smith-Waterman Algorithm</b>
* The SW algorithm guarantees to find the optimum scoring local
alignment between two sequences
* The optimum local alignment could be identical to the optimum global
alignment
* There could be other significant local alignments, but SW will just return
the highest scoring one
* We can find other significant alignments by masking out the initial SW
alignment and re-running the algorithm
* Note that SW will only produce a local alignment under certain
conditions i.e. with certain score and gap penalty choices e.g. a score
matrix without any negative scores is unlikely to result in a local
alignment
* Necessary conditions for local alignment behaviour are hard to
predetermine, but as a rule of thumb, for aligned random sequences,
the expectation value of the alignment score should be negative for local
behaviour to result.

<b>The FASTA method</b>

<img src="https://github.com/BadrulAlom/Data-Science-Notes/raw/master/_img/bio/bio099.png" height="100" width="300">

Notes:
* Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required

FASTA can miss significant similarity since
* For proteins, similar sequences do not have to share identical
residues
* Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even
with ktuple size of 1 since none of the amino acids matches
* Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but
there is no match with ktuple size of 2 (no common words of
length 2)

<h4> Explain multiple sequence alignment</h4>

* Although pairwise sequence comparison is the single most widely used similarity/homology detection method, multiple sequence methods are much more powerful

* Allows a statistical model to be built of a whole family of sequences rather than just an isolated pair of sequences -- basically given a sequence, you get a feel for how many different (or not) places it seems to occur across biology

Typical algorithm:

* Compare all pairwise sequences. 
* Construct a guide tree based on these alignments
* Build the multiple alignment by following the order of the tree branches – i.e. first aligning the most similar pair of sequences within each cluster, then the next most similar pair and so on, until all sequences have been aligned.

<img src="https://github.com/BadrulAlom/Data-Science-Notes/raw/master/_img/bio/bio098.png" height="100" width="400">


<h4>Explain Sequence Profiles</h4>

* One way of profiling sequences is on the exact letter ordering they contain
* Sequence Profilng provide a more general alternative to regular expression patterns
* It takes any sequence and turns each letter into a probability based on the frequency within that sequence

----------------

<h4> What is PFam?</h4>

* Protein families database of alignments and HMMs
* Uses profile-HMMs to represent families

Allows you to:
- Look at multiple alignments
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures

---------------

<h4>  Describe the 3 main HMM algorithms </h4>

1. The Viterbi algorithm: get the most probable state sequence i.e. alignment

2. The Forward algorithm: get the probability of each state at each position and the overall match probability

3. Expectation Maximization: derive the parameters of the model from the data

HMMER is now the most widely used tool for building and using protein HMMs

• Uses a slight simplification of the state model called “Plan 7” (but with some extra features)

• Now includes several heuristic speedups to allow searching of very large databases

------------------

<h4> Iterative HMM </h4>
lose homologues of a seed sequence are first found using a rapid
databank search technique (usually BLAST) and a suitably conservative
E-value threshold.
These sequences are multiply aligned (either using an HMM approach
or a classic multiple sequence alignment technique such as
CLUSTALW)
An initial HMM is constructed based on this seed alignment (most
often with Viterbi training for speed)
The HMM is used to identify new members of the family in the
database according to the log-odds ratio above (i.e. sequences which
score above a set threshold and which were not already found). If no
new members are found then the process terminates

<h4> Compared to BLAST</h4>

Profile HMMs have higher sensitivity at 1%
FPR compared to PSIBLAST
• However, better PSIBLAST statistics generally
results in much sharper convergence


---------------------

<h4>Describe Secondary (or more) Proteing Structure Predictions</h4>


The greatest unsolved problem in molecular
biology: The Protein Folding Problem - how does protein foind into secondart,tertiary , quaternary structure to form what wew see?

* 1st gen method: Based on propensity analysis (i.e. probability).  
Two ways to view the significance of this
preference (or propensity)

– It may control or affect the folding of the protein
in its immediate vicinity (amino acid determines
structure)

– It may constitute selective pressure to use
particular amino acids in regions that must have a
particular structure (structure determines amino
acid)

* 2nd gen method: Use neighbouring residue info to help better predict

* 3rd gen method: NN, e.g. PSPIRED
PROs
– High residue accuracy
– Less underprediction of strands
– Good quality segment predictions
• CONs
– Provides prediction for FAMILY CONSENSUS
structure NOT THE STRUCTURE OF THE TARGET
SEQUENCE

<b>PSPIRED</b>

Works directly on PSI-BLAST profiles
* Uses 2-stages of feedforward neural
networks
* First network predicts secondary structure
* Second network cleans outputs from 1 st net


<h4> What are common Measures of Secondary Structure Prediction Accuracy</h4>
* Q3 scores give the percentage of correctly
predicted residues across 3 states (H,E,C)
* Sov scores give the percentage of correctly
predicted SEGMENTS across 3 states
* Other scores such as Matthew’s Correlation Coefficient try to identify accuracy for
individual states (Coil, Strand, Helix)

<h4>Advantages of more data </h4>
* In most molecular evolution applications a common simplifying assumption is made that
mutations occurring at one site in a protein are independent of mutations occurring in other
sites

* This simplification allows the use of Markovian methods e.g. HMMs and profiles

* This is also the best we can do with limited data. With lots of sequence data, however, we can start considering coevolutionary epistatic effects