# On Computing Breakpoint Distances for Genomes with Duplicate Genes

M. Shao and B. Moret (2017)

Article link: https://link.springer.com/chapter/10.1007/978-3-319-31957-5_14

The paper aims at computing the distance between two genomes under the breakpoint model, in the presence of duplicate genes. It looks at three variants of the problem: Exemplar, Maximum matching and Intermediate models. Finally, the authors provide an ILP-based algorithm for solving the problems. 

## Breakpoint distance models

Models prescribed before this paper were based on obtaining a matching between duplicate genes while leaving out copies of unmatched genes. The matched copies are treated as new gene families thus reducing the problem to one without any duplicates. The ultimate aim is to choose a matching minimizing the specified distance.

1. Exemplar model: (Sankoff, 99) <br>
Select exactly one matched gene pair each gene family and discard the rest.
2. Max. matching: (Blin, 04) <br>
Match as many gene pairs as possible $\implies$ if gene $g$ has $c_a$ copies Genome $A$ and $c_b$ copies in Genome $B$ and $c_b < c_a$, then $c_b$ copies should be matched.
3. Intermediate: (Angibaud, 07) <br>
Match at least one gene pair for each gene family.

Under all three models, the respective distance problems are NP-hard. The Exemplar model is oversimplified as it assumes each gene family to have a single ortholous pair in the two genomes. On the other hand, the Maximum matching model assumes that each gene in one genome has an ortholog in the other, thus indicating that duplications and losses are less likely as compared to rearrangements. Hence, the paper deems the Intermediate BreakPoint Distance model as the best option. The paper focusses on the latter two models.

## Notation and problem statement

$\mathcal{F}$ is the set of all gene families and $F(G,f)$ is the set of genes belonging to a family $f \in \mathcal{M}$ from a genomes $G$.

Let $G_1$ and $G_2$ be the genomes in question. If $g_1h_1 \in G_1$ and $g_2h_2 \in G_2$ form a "pair of shared adjacencies (PSA)" if if the corresponding genes ($g_1$ and $g_2$, $h_1$ and $h_2$) are from the same family and have the same sign OR ($g_1$ and $h_2$, $h_1$ and $g_2$) are from the same family and have the opposite sign. 

A matching is a 1-to-1 correspondence between genes in $G_1$ and $G_2$. $\mathcal{M}$ is the set of possible matchings while $M$ is a specific matching in $\mathcal{M}$. $M(f)$ is the number of edges in the matching the vertices of which belong to $f$.

The score $S(M)$ of the matching is the total number of PSAs in the matching. Thus, higher the score, lower the distance.

For each model, the matchings are defined as follows:
1. $M_e = \{M \in \mathcal{M} | |M(f)| = 1, \forall f \in \mathcal{F}\}$
2. $M_i = \{M \in \mathcal{M} | |M(f)| \geq 1, \forall f \in \mathcal{F}\}$
3. $M_m = \{M \in \mathcal{M} | |M(f)| = min\{|F(G_1,f)|, |F(G_2,f)|\}, \forall f \in \mathcal{F}\}$

Thus, for any $x \in \{e,i,m\}$, the problem is to maximize $|S(M)|$ over all matchings $M \in M_x$.

$[g,h]$ is the segment of genes from $g$ to $h$ including $g$ and $h$ while $(g,h)$ is the segment of genes from $g$ to $h$ excluding $g$ and $h$. Accordingly, $[g,h]$ forms a "potential" adjacency if all genes $(g,h)$ are allowed to be removed. Thus, they define "pair of shared potential adjacencies (PSPA)" as pairs of segments $<[g_1,h_1],[g_2,h_2]>$ which can potentially be reduced to a PSA $g_1h_1,g_2h_2$ by removing all genes between the ends, kepping in mind the requirements of the model.

## ILP formulation

The formulation uses three types of constraints, to capture the essence of the model. The papers shows the formulation for the Intermediate Breakpoint Distance Problem. The same formulation can be generalized for the other two models after making relevant changes.

Variables:
1. $x_g$ is a binary variable indicating the existence of $g$ as a vertex of an edge chosen in the optimal matching.
2. $y_{g1,g2}$ is a binary variable indicating the existence of an edge $g_1,g_2$ in the optimal matching.
3. $z_p$ is a binary variable indicating the existence of a potential PSA (which they call PSPA) in the genomes corresponding to the optimal matching.

Constraints:
1. Existence of a family representative: <br>
This constraint is determined by the definitions of matching provided above. Thus, for the Intermediate model, it will be <br>
$\sum_{g \in F(G1,f)} x_g \geq 1$ <br>
$\sum_{g \in F(G2,f)} x_g \geq 1$ <br>
Thus, at least one gene is covered in the matching for all $f \in \mathcal{F}$.


2. Valid coverage of a gene: <br>
This constraint ensures that a gene is covered only if an edge in the matching includes $g$. <br>
$\sum_{g1 \in F(G1,f)} y_{g1,g2} \geq x_{g2}$ $\forall g1 \in F(G_1,f)$ <br>
$\sum_{g2 \in F(G2,f)} y_{g1,g2} \geq x_{g1}$ $\forall g2 \in F(G_2,f)$


3. Existence of PSPAs: <br>
$y_{g1,g2},y_{h1,h2} \geq z_p$ for all potential adjacencies $<[g_1,h_1],[g_2,h_2]>$
$1 - z_p \geq x_g$ for all genes in $(g_1,h_1) \cup (g_2,h_2)$

## Key takeaway

The notion of PSPAs can be investigated for in the case of losses. We are interested in pair of genes (extremities) that may form an adjacency if all the genes between them are lost. There are some differences in the usage wherein the above problem allows you to list the PSPAs due to the fixed genomes which is not the case in the SPP.