# Motif search and HMMs

In this lab, we will introduce BioPython's facilities to create and search for motifs, including building HMMs.

## Sequence motifs

Routines for dealing with motifs are implemented in the [Bio.motifs](https://biopython.org/docs/latest/api/Bio.motifs.html) package. It contains classes to interface with several motif databases such transcription factors databases [TRANSFAC](https://genexplain.com/transfac/) or [JASPAR](https://jaspar.genereg.net/) using the [Bio.motifs.transfac](https://biopython.org/docs/latest/api/Bio.motifs.transfac.html) and [Bio.motifs.jaspar](https://biopython.org/docs/latest/api/Bio.motifs.jaspar.html) modules, or motif generation tool [MEME](https://meme-suite.org/meme/) using the [Bio.motifs.meme](https://biopython.org/docs/latest/api/Bio.motifs.meme.html) module. 

Here, we will show how to utilize BioPython directly to obtain a motif. The main class here is the [Bio.motifs](https://biopython.org/docs/latest/api/Bio.motifs.html), which can be used to create a motif from a set of [Bio.Seq](https://biopython.org/docs/latest/api/Bio.Seq.html) objects. Let's create a DNA motif.

In [None]:
from Bio import motifs
from Bio.Seq import Seq

dna_motif = motifs.create([
    Seq("TACAA"),
    Seq("TACGC"),
    Seq("TACAC"),
    Seq("TACCC"),
    Seq("AACCC"),
    Seq("AATGC"),
    Seq("AATGC")
    ])
print("The motif consists of the following {} sequences:".format(len(dna_motif)))
print(dna_motif)

The development of the motif package was more motivated by DNA motifs as evidenced by the available parsers as well as the fact that the default alphabet to be used with motifs is the nucleotide one. In order to create a protein motif, one needs to specify the alphabet to be used (notice, that the [Bio.Alphabet](https://biopython.org/docs/1.76/api/Bio.Alphabet.html) package has been retired from BioPython and thus the allowed characters need to be explicitely enumerated). Let's use the coronavirus spike proteins PFAM family from the previous labs.

In [None]:
from Bio import SeqIO 
spike_proteins = list(SeqIO.parse("data/PF01600_serialized.faa", "fasta"))
pt_alphabet='ACDEFGHIKLMNPQRSTVWY-'
pt_motif = motifs.create([p.seq for p in spike_proteins], alphabet=pt_alphabet)

In [None]:
print(pt_motif)

The original sequences are stored in `motif.instances`, but more usefull is the fact that by creating a motif, a consensus sequence and count matrix is created.

In [None]:
print("DNA motif consensus: {}\nProtein motif consensus: {}".format(dna_motif.consensus, pt_motif.consensus))

The motif object also contains the `count` matrix which counts all the occurence of all the characters in the motif.

In [None]:
print(dna_motif.counts)
for symbol in dna_motif.alphabet:
    print("Counts of {}: {}".format(symbol, dna_motif.counts[symbol]))


For practical use, more interesting is the availability of the [position weight matrix (PWM)](https://biopython.org/docs/1.75/api/Bio.motifs.matrix.html?highlight=log_odds#Bio.motifs.matrix.PositionWeightMatrix.log_odds) and [position specific scoring matrix (PSSM)](https://biopython.org/docs/1.75/api/Bio.motifs.matrix.html?highlight=log_odds#Bio.motifs.matrix.PositionSpecificScoringMatrix). PWM is just the normalized count matrix (contains frequencies).

In [None]:
print(dna_motif.pwm)
print(dna_motif.pssm)

As seen above, neither PWM nor PSSM contain pseudocounts. To get PWM and PSSM with pseudocounts, we can use the count matrix and compute them by ourselves. Or, we can use the `normalize` method.

In [None]:
pwm = dna_motif.counts.normalize(pseudocounts=0.5)
#pwm = dna_motif.counts.normalize(pseudocounts={'A':0.6, 'C': 0.4, 'G': 0.4, 'T': 0.6})
print(pwm)

To obtain the PSSM, we can use the [log_ods](https://biopython.org/docs/latest/api/Bio.motifs.matrix.html#Bio.motifs.matrix.PositionWeightMatrix.log_odds) method of the `PositionWeightMatrix`. It computes log odds based on the frequencies and background distribution (uniform, by default).

In [None]:
pssm = pwm.log_odds()
print(pssm)
print(dna_motif.pssm)

Using a motif, we can search for exact matches (of any sequence forming the motif) in a sequence.

In [None]:
dna_seq=Seq("TACACTGCATTACAACCCAAGCATTA")

In [None]:
for pos, seq in dna_motif.instances.search(dna_seq):
    print("{} {}".format(pos, seq))

Hower, having PSSM enables also to search for motifs which are more probable than background (score > 0) or specified threshold.

In [None]:
for position, score in pssm.search(dna_seq, threshold=3.0, both=True):
    print("Position {}: score = {:.3f}".format(position, score))

The minus values are hits of the reverse complementary motif. Thus if you recreate a motif from reverse complement of the sequences, you would get the same positions but with oposite sign.

If we were interested in scores at every position we could do that, too.

In [None]:
pssm.calculate(dna_seq)

The motif object includes an interface to the [WebLogo](http://weblogo.threeplusone.com/) tool. Be aware that it is indeed just an API call.

In [None]:
dna_motif.weblogo("dna_logo.png")
pt_motif.weblogo("spike_protein_logo.png",  alphabet='alphabet_protein', sequence_type='protein')

In [None]:
pt_motif.pssm.calculate(spike_proteins[0])

### ---- Begin Exercise ----
If you try to compute PSSM for a protein motiv (`pt_motif.pssm.calculate(spike_proteins[0])`), you get an error message along the following lines:
```
C:\Python39\lib\site-packages\Bio\motifs\matrix.py in calculate(self, sequence)
    340         # TODO - Code itself tolerates ambiguous bases (as NaN).
    341         if sorted(self.alphabet) != ["A", "C", "G", "T"]:
--> 342             raise ValueError(
    343                 "PSSM has wrong alphabet: %s - Use only with DNA motifs" % self.alphabet
    344             )
```
This means that PSSM is actually implemented only for nucleotide sequences in the current (1.84) version of BioPython.
 
Implement code to carry out the `pssm.calculate` method for protein sequences. The method can be implemented either as an individual function or by extending the respective implementations, for example, by adding `calculate_proteins` into the PSSM class. 

But first, notice that to fit into the PSSM framework, we passed in the MSA directly, meaning we also passed the gaps. But gap symbols can't actually be part of the PSSM as when aligning a sequence to the profile; we don't know where the gaps should be located (remember the cons of PSSM from the lecture - can't handle insertions and deletions). Therefore, first identify in the MSA the longest region not containing gaps (i.e., the longest span of positions where there is no gap in any of the columns) and treat that as the input to the profile. Alternatively, you can use the count matrix of the motif matrix.

You can build the motif from spike proteins and test on SARS-CoV2:

```
spike_proteins = list(SeqIO.parse("data/PF01600_serialized.faa", "fasta"))
sars_cov2_spike = list(SeqIO.parse("data/YP_009724390.1_spike_protein.fa", "fasta"))[0] 
```

### ---- End Exercise ----

## HMMs

BioPython supports hidden markov models via the [Bio.HMM](https://biopython.org/docs/1.74/api/Bio.HMM.MarkovModel.html) model. Specifically, it has the capability of building and using HMMs in the [Bio.HMM.MarkovModel](https://biopython.org/docs/1.74/api/Bio.HMM.MarkovModel.html) module and training in the [Bio.HMM.Trainer](https://biopython.org/docs/1.74/api/Bio.HMM.Trainer.html) module. The trainer includes both supervised and unsupervised (Baum-Welch algorithm) training.

In the following example, we will implement the dishonest casino example (see the lecture).

First, we create the emission probabilities for the fair and loaded dice.

In [None]:
import numpy as np
from random import choices

fair_weights = list(np.ones(6) * 1/6)
loaded_weights = [1/10, 1/10, 1/10, 1/10, 1/10, 1/2]

Next, we load the required packages and create the model using `MarkovModelBuilder`. The builder is simply passed all the model-defining transition and emission probabilities.

In [None]:
from Bio.HMM import MarkovModel
from Bio.HMM.Utilities import pretty_print_prediction

In [None]:
states = {'fair': 'F', 'loaded': 'L'}
alphabet = list(range(1,7))
mm_builder = MarkovModel.MarkovModelBuilder(state_alphabet=[states['fair'], states['loaded']], emission_alphabet=alphabet)
mm_builder.set_initial_probabilities({states['fair']: 0.5, states['loaded']: 0.5})
mm_builder.allow_transition(states['fair'], states['fair'], 0.95)
mm_builder.allow_transition(states['fair'], states['loaded'], 0.05)
mm_builder.allow_transition(states['loaded'], states['loaded'], 0.9)
mm_builder.allow_transition(states['loaded'], states['fair'], 0.1)
for symbol in alphabet:
    mm_builder.set_emission_score(states['fair'], symbol, fair_weights[symbol-1])
    mm_builder.set_emission_score(states['loaded'], symbol, loaded_weights[symbol-1])
    
m = mm_builder.get_markov_model()

Let's apply the model on a simple example where the rolls are generated either from the fair or loaded states.

In [None]:
rolls=choices(population=list(range(1,7)), weights=fair_weights, k=10)
print(rolls)
path = m.viterbi(rolls, states.values())
print(path)
rolls=choices(population=list(range(1,7)), weights=loaded_weights, k=10)
print(rolls)
path = m.viterbi(rolls, states.values())
print(path)


Now lets try to generate a sequence of rolls based on the given probabilities and let the model label it. 

In [None]:
s = choices(population=list(['F', 'L']), weights=[0.5,0.5], k=1)[0]
t= {
    'F': [0.95, 0.05],
    'L': [0.1, 0.9]    
}
e={
    'F': fair_weights,
    'L': loaded_weights
}
rolls = []
true_states = []
for i in range(200):
    true_states.append(s)
    rolls.append(choices(population=list(range(1,7)), weights=e[s], k=1)[0])
    s = choices(population=list(['F', 'L']), weights=t[s], k=1)[0]
print(rolls)
print(true_states) 
    

In [None]:
path = m.viterbi(rolls, states.values())
print(path)

In [None]:
pretty_print_prediction([str(r) for r in rolls], true_states, [s for s in str(path[0])], line_width=40)

### ---- Begin Exercise ----
**This excercise is optional, but if you hand it in you earn extra 5 points for the exam.**

Implement Profile HMM for the spike protein family's MSA (`data/PF01600_serialized.faa`) based on the description in the lecture and show alignment with the SARS-CoV2 spike protein ("`data/YP_009724390.1_spike_protein.fa`)
### ---- End Exercise ----