# Lecture 5.3: Tree Likelihoods

In [1]:
import numpy as np
import math
import matplotlib.pyplot as plt
import random
from scipy.special import comb
from scipy.linalg import expm

## Example 5.9 Tree Likelihood

**Parameters:**
$$
\begin{aligned}
\mu=1.2\\
t_1=0.3\\
t_2=0.5
\end{aligned}
$$

**1.What is the likelihood of this tree given the observed sequences under the JC69 model?**

To answer this we need to calculate $\mathbf{P}(t)$ for which we need to start by stating what $\mathbf{Q}$ is:

In [2]:
#Parameters
mu=1.2
t1=0.3;
t2=0.5;
#JC69
QJC69=np.array([[-3*mu,mu,mu,mu],[mu,-3*mu,mu,mu],[mu,mu,-3*mu,mu],[mu,mu,mu,-3*mu]])
print(QJC69)
#Stationary distribution
piVec=[0.25,0.25,0.25,0.25]

[[-3.6  1.2  1.2  1.2]
 [ 1.2 -3.6  1.2  1.2]
 [ 1.2  1.2 -3.6  1.2]
 [ 1.2  1.2  1.2 -3.6]]


Luckily there is a built-in function that calculates $\mathbf{Q}$ for us. So we have:

In [3]:
def P(t):
    return expm(QJC69*t)

To test this out evaluate $P(0)$, $P(0.01)$, and $P(10)$

For $P(0)$ we would expect the system to stay put.

In [23]:
P(0)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

For small $t$ we would expect the system to make only a small change

In [24]:
P(0.01)

array([[0.96485034, 0.01171655, 0.01171655, 0.01171655],
       [0.01171655, 0.96485034, 0.01171655, 0.01171655],
       [0.01171655, 0.01171655, 0.96485034, 0.01171655],
       [0.01171655, 0.01171655, 0.01171655, 0.96485034]])

For large $t$ we are near the stationary distribution

In [25]:
P(10)

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]])

Our goal is to calculate:
$$
\Pr(\mathcal{D}|\mathcal{T},\Theta)=\sum_{Y\in\{A,C,G,T\}}Pr(Y) P_{Y,C}(t_1+t_2)\sum_{X\in\{A,C,G,T\}}P_{X,Y}(t_2)P_{G,X}(t_1)P_{G,X}(t_1)
$$
- we use the following indexes for the nucleotides $\{T:0,C:1,A:2,G:3\}$
- Here $Pr(Y)$ is the probability of having state $Y$ at the root which is given by the stationary distribution $\Pr(Y)=\pi_Y$ assuming that evolution has occurred for a long time before ever having reached the root.

In [8]:
Lik=np.sum([piVec[y]*P(t1+t2)[y,1]*np.sum([P(t2)[x,y]*P(t1)[3,x]*P(t1)[3,x] for x in range(4)]) for y in range(4)])
print(Lik)

0.01823845983359636


<span style="color:green;">**Discussion:** This value is quite small.  Does this make sense?</span>

**2.What is the probability that ancestor $Y$ is a $G$ given this tree topology and molecular model?**

To calculate this we use a modified version of Bayes Theorem that allows for additional conditioning of all terms:

$$
\Pr(A|B,C)=\frac{\Pr(B|A,C)\Pr(A|C)}{\Pr(B|C)}
$$

Here we consider a specific tree $\mathcal{T}$ (see the tree topology above) and molecular model $\Theta$ (JC69 with $\mu=1.2$) so:

$$
\Pr(Y=G|\mathcal{D},\mathcal{T},\Theta)=\frac{\Pr(\mathcal{D}|Y=G,\mathcal{T},\Theta)\Pr(Y=G|\mathcal{T},\Theta)}{\Pr(\mathcal{D}|\mathcal{T},\Theta)}
$$

where $\mathcal{D}$ is the 'data' at the tips.

Here $\Pr(Y=G|\mathcal{T},\Theta)=\pi_G$ is the "prior" probability that the root is a G, assuming that evolution has occurred for a long time before this tree our best guess for the state is the stationary distribution of the JC69 model or $\pi_G=\frac{1}{4}$

The denominator $\Pr(\mathcal{D}|\mathcal{T},\Theta)$ is the likelihood we calculated in the last part.

Finally, $\Pr(\mathcal{D}|Y=G,\mathcal{T},\Theta)$ is very similar to the likelihood above but fixing the root state at a G

$$
\Pr(\mathcal{D}|Y=G,\mathcal{T},\Theta)=P_{\textcolor{red}{G},C}(t_1+t_2)\sum_{X\in\{A,C,G,T\}}P_{X,\textcolor{red}{G}}(t_2)P_{G,X}(t_1)P_{G,X}(t_1)
$$

In [19]:
def LikY(y):
    return P(t1+t2)[y,1]*np.sum([P(t2)[x,y]*P(t1)[3,x]*P(t1)[3,x] for x in range(4)])
print([LikY(0),LikY(1),LikY(2),LikY(3)])

[0.01705096667800147, 0.01854911398017205, 0.01705096667800147, 0.02030279199821045]


Now we calculate the corresponding probability that the state is $Y=y$ where we are interested in $Y=3$ in particular.

As a check consider the total probability need to add up to 1.

In [17]:
def probY(y):
    return LikY(y)*0.25/Lik
print([probY(0),probY(1),probY(2),probY(3)])
np.sum([probY(y) for y in range(4)])

[0.2337226777037465, 0.2542582288939147, 0.2337226777037465, 0.27829641569859237]


1.0

## Example 5.10 Tree Likelihood 2

Here we want to use the same function as in part 1 above but now modifying the function so we can easily change the data at the tips.

In [21]:
def LikSite(data):
    return np.sum([piVec[y]*P(t1+t2)[y,data[2]]*np.sum([P(t2)[x,y]*P(t1)[data[0],x]*P(t1)[data[1],x] for x in range(4)]) for y in range(4)])
#Example with data from Ex 5.9
print(LikSite([3,3,1]))

0.01823845983359636


To get the likelihood of the tree with multiple states we have to multiply three of these individual site likelihoods together.

Using the incies: $\{T:0,C:1,A:2,G:3\}$

$\[T,T,T\]\to\[0,0,0\]$, $\[G,G,C\]\to\[3,3,1\]$, and $\[A,C,C\]\to\[2,1,1\]$

In [22]:
LikSite([0,0,0])*LikSite([3,3,1])*LikSite([2,1,1])

4.928652094978466e-06

Note that this probability is really small because we are calculating $\Pr(\mathcal{D}|\mathcal{T})$ and the state space of $\mathcal{D}$ is getting very large.

In this case, it is often better to calculate the log-probability (log-likelihood) so:

$$
\begin{algined}
\mathcal{l}(\mathcal{T}|\mathcal{D})=&\ln\left(\mathcal{L}(\mathcal{T}|\mathcal{D})\right)=\ln\left(\Pr(\mathcal{D}|\mathcal{T})\right)\\
=& 
\end{aligned}
$$