# Counting and Sampling and SCJ Small Parsimony Solutions

I. Miklos, S. Kiss, E. Tannier (2014)

Article link: https://www.sciencedirect.com/science/article/pii/S0304397514005969

The paper discusses the problem of enumerating the solutions to the SPP under the SCJ distance. The authors use Sankoff's algorithm to reconstruct the genomes at the internal nodes using the presence/absence of adjacencies. 

## Why sample solutions?

A single solution can not be considered in place of the whole solution space as it may lead to incorrect results/hypotheses. On the other hand, enumerating all possible best solutions is not possible in polynomial time. However, considering a few samples chosen from a uniform distribution may suffice.

Solving the distance, median and SPP is possible in polynomial time for the SCJ model. However, even an easy optimization problem often leads to multiple solutions that may be in stark contrast with each other. Even estimating the number of solutions may not be possible in polynomial time unless RP = NP. (RP is the class of problems for which random algorithm exists with 3 properties: 1. It runs in polynomial time, 2. If the answer is "no", the output is always "no" and 3. If the answer "yes", the output is "yes" with probability 1/2.)

NOTE: Although not of immediate interest this paper (and another one by Miklos and Smith, 2015) provide a lot of information on the computational complexity of counting problems. 

## Computing internal nodes

The problem is to find the most parsimonious solution for the SPP under the SCJ distance. The paper illustrates the use of Fitch and Sankoff algorithms for computing internal nodes of the tree.

Initially, the focus is on an individual adjacency. By applying bottom-up and top-down passes of the Fitch algorithm for the presence/absence of an adjacency $\alpha$, it can be seen that often the algorithm results in an ambiguous solution (as both the presence and absence yield equal SCJ scores). In general, it is not expected that the assignments for different adjacencies thus inferred will be consistent with each other. However, Feijao and Meidanis (2011) showed that Fitch solutions for the SPP under SCJ form valid genomes. The drawback of the Fitch algorithm is that it does not find all possible solutions.

This is taken care of through the Sankoff algorithm. The Sankoff algorithm is a dynamic programming algorithm and keeps track of two values $s0$ and $s1$. $s0(\alpha,u)$ counts the minimum number of edges in the subtree rooted at $u$ along which the existence of $\alpha$ changes when $\alpha$ is absent in $u$. $s1$ has a similar definition except that $\alpha$ is present at $u$. At the root, whichever value, $s0$ or $s1$ is the minimum value provides the best SPP solution for the adjacency $\alpha$. Whereas it has been proved that this method accounts for all solutions, not all solutions thus obtained are consistent (leading to valid genomes).

## Sampling solutions

The paper provides insight into the enumeration of SCJ SPP solutions. The authors provides a polynomial algorithm that constructs an instance of the SCJ SPP from a 3CNF formula, that guarantees the correctness of at least half the solution for the SPP if the 3CNF formula is satisfiable. 

They provide a three-step process to obtain an SPP instance from a clause. The first step involves constructs called elementary subtrees (which help in enumeration) and building a "unit" subtree using these elementary subtrees. The second step involves repetition of this unit subtree and the final step, they amend it with another subtree in order to have all adjacencies ambiguous.  

## Key takeaway

Here we are interested in obtaining the genomes at the internal nodes. In the presence of gene losses, I was wondering if we could adapt the Sankoff algorithm to our problem with lost adjacencies. 

The problem we had faced with accounting for gene losses was as follows: Let $A = abcde$ and $b,c,d$ be lost in $D$. $a$ and $e$ would be joint in $A$ (considered as an adjacency) only if $D$ contains adjacency $ae$. Using the scores $s0$ and $s1$ does not provide a way around accounting for all possible lost gene sequences between $a$ and $e$.

I did realize that we face a decision only if the said genes are lost along exactly branch. If they are lost in both branches, then any adjacencies involving lost genes are not present in two of the three genomes. Hence, the ILP will prefer to exclude it from the parent. 