# Plasmids Assembly MILP formulation

## Input and output

We use the following as input:
1. Assembly graph (vertices and edges),
2. Length of contig ( len[c] )
3. Gene coverage of contig ( gd[c] )
4. Read depth of each contig ( rd[c] )
5. GC content of each contig ( GC[c] )

We expect the following as output:
1. List of contigs belonging to plasmids,
2. Read depth of each contig in a specific plasmid ( rd[p][c] ),
3. List of edges/links belonging to plasmids

## Motivation for objective function

The problem can be approached in one of two ways:
1. We start with a limited number ($k$) of possible plasmids. For each value of $k$, we determine the value of the objective function (described below). The value of $k$ that gives us the best optimal value of the objective function is chosen as the final answer.
2. We may determine and remove one plasmid at a time. Once that has been done, we rerun the MILP, iteratively removing plasmids till a stopping criterion has been reached.

We form an objective function that is a linear combination of:

- Read depth difference: Firstly, we try to obtain uniform read depth along the plasmid. We try to minimize the difference between the read depth of the overall plasmid and the read depth contribution of the contig itself. Furthermore, as each plasmid may have different total length, we consider the wieghted read depth deviation. Thus, rd[p][c] is the read depth contribution of a contig to a plasmid, len[c] is the length of a contig and mean$_$rd[p] is the mean read depth of the plasmid, the expression we try to minimize is: $\sum_{c \in p}$(|mean rd[p] - rd[p][c]|).len[c]/len[p]. 


From what I have seen so far, the quantity |mean rd[p] - rd[p][c]| is very close to 0. Thus, the value of this term will lie between 0 and 1.

- %GC content difference: The %GC content of plasmids is slightly lower than that of chromosomes. Keeping this criterion in mind, the objective would be to maximize the difference between the %GC content of the plasmid and the overall %GC content. Here, we represent the %GC content of a contig as GC[c] and GC mean as the overall %GC content. This is a constant and can be computed in advance. NOTE: As the final objective function expression will be a linear combination of multiple terms, it is advisable for these terms to be of comparable values. Having a term account for most of the objective function value may lead to the term dominating others. Hence, we again weight the %GC content difference according by the length of the contig. $\sum_{c \in p}$(GC$_$mean - GC[c]).len[c]/len[p].

If the %GC content of a contig c is 90, then we use the GC[c] = 0.9. Thus, the term is expected to contribute a value $\in \{0, 1\}$ to the objective function.

- Gene coverage: We also try to maximize the gene coverage or density, which is the percentage of plasmids covered by genes. The gene density of a contig is represented as gd[c], which is a number between 0 and 1. Once again, we weight the gene density of selected contigs by their lengths. $\sum_{c \in p}$(gd[c]).len[c]/len[p].

As the term is the length-weighted sum of the gene densities of contigs, this is also expected to contribute between 0 and 1 to the objective function.

## Objective function

Thus, our objective is to minimize: 

$\alpha_1 .\sum_{c \in p}$[ ($|$mean$_$rd[p] - rd[p][c]$|$).len[c] $/\sum_{c \in p}$len[c]\} ]

$- \alpha_2 .\sum_{c \in p}$[ (GC$_$mean - GC[c]).len[c]) $/\sum_{c \in p}$len[c]\} ] 

$- \alpha_3 .\sum_{c \in p}$[ gd[c].len[c] $/\sum_{c \in p}$len[c]\} ] 

where $\alpha_1+\alpha_2+\alpha_3 = 1$

## Variables

1. contigs[p][c] = 1 iff contig c ∈ plasmid p. Var Type = Binary.
2. links[p][e] = 1 iff link e ∈ plasmid p. Var Type = Binary.
3. contigs ext[p][ext] = 1 iff extremity ext ∈ link e such that e ∈ plasmid p.
Var Type = Binary.
The links and contigs ext variables enable us to model the structure of
plasmids as cycles and paths (for now, a disjoint union of paths).
4. rd[p][c] denotes read depth contribution of contig c to plasmid p. Var
Type = Continuous.
5. mean rd[p] denotes the length-weighted average read depth of a plasmid.
Var Type = Continuous.
6. counted seed[p][c] = 1 if c ∈ p and c is a seed, else 0. Var Type = Binary.
7. counted ln[p][c] = length of c if c ∈ p, else 0. Var Type = Continuous. In
other words, counted ln[p][c] = ln[c].contigs[p][c].
Notice that the objective function consists of terms $\sum_{c\in p}$len[c]. These are dependent on contig c being a part of plasmid p, which makes the objective function nonlinear. However, it is possible to linearize the objective using extra variables. An example of such a linearization is shown below. I chose to describe these variables separately to facilitate ease of understanding.

## Constraints

We now look at the constraints that help us model the problem.

- A link e belongs to plasmid p only if both endpoints of the link are in p. Here, end1 and end2 are endpoints of link e. For all link variables,

links[p][e] $==$ floor[ (contigs ext[p][end1]+contigs ext[p][end2])$/$2 ]

- A contig extremity can occur only once in a plasmid. It can not have multiple edges incident on it in the same plasmid for consistency. Hence, for all extremities,

$\sum_{e~inc.~on~ext}$ links[p][e] $<=$ 1


- A contig c belongs to p if at least one of its extremities is in p. Here, end1 and end2 are extremities of contig c.

contigs[p][e] $==$ ceil[ (contigs ext[p][end1]+contigs ext[p][end2])$/$2 ]

- The total read depth contribution from a contig to all plasmids should be less than the read depth of the contig itself. Here rd[c] is the total read depth of the contig c as provided in the input. Note that this constraint is necessary only if we assemble multiple plasmids in a single run of the MILP. 

$\sum_{p}$ rd[p][c] $<=$ rd[c]

- The length of c in p is relevant only if c ∈ p. To account for this ’if’ condition, we have

counted ln[p][c] $==$ ln[c] . contigs[p][c]

- If c is a seed in p is relevant only if c ∈ p. To account for this ’if’ condition, we have

counted seed[p][c] $==$ seed[c] . contigs[p][c]

-  The mean read depth of a plasmid is given by:

mean$_$rd[p] . $\sum_{c}$(len[c] . contigs[p][c]) $==$ $\sum_{c}$(rd[p][c] . ln[c] . contigs[p][c])

This is not a linear constraint as rd[p][c] and contigs[p][c] are both variables. However, it can represented by a group of linear constraints, as shown below. This is possible since contigs[p][c] is a binary variable.

- Each plasmid should have at least one seed. Hence, for each plasmid,

$\sum_{c}$ counted seed[p][c] $>=$ 1

## Additional constraints

- Handling ’if ’ conditions: 
Multiple terms in the objective function depend on the inclusion of a contig c in plasmid p. For instance, the read depth contribution of a contig towards a plasmid is to be ”counted” only if it is a part of the plasmid. So, we introduce the term counted rd[p] [c]. 
counted rd[p][c] == rd[p][c].contigs[p][c]
This is a nonlinear constraint. However, since contigs[p][c] is a binary variable and rd[p][c] has a well-defined upper limit (rd[c]), we can model this using a set of linear constraints. Here U = rd[c].

counted$_$rd[p][c] $<=$ U . contigs[p][c] <br>
counted$_$rd[p][c] $<=$ rd[p][c] <br>
counted$_$rd[p][c] $>=$ rd[p][c] - (1 - contigs[p][c]).U <br>
counted$_$rd[p][c] $>=$ 0

Notice that counted rd[p][c] takes exactly the value rd[p][c] if contigs[p][c] = 1 and 0 otherwise.

- Handling absolute values: The objective function contains the absolute value of mean read depth (mean rd[p]) and contig read depth in the plasmid (counted rd[p][c]). This in itself is not linear. So, we introduce a variable diff[p][c] which is the absolute value of the difference between the mean read depth and contig read depth for a specific plasmid. For each plasmid-contig pair we add the following constraints:

diff[p][c] $>=$ mean$_$rd[p] - counted$_$rd[p][c] <br>
diff[p][c] $>=$ counted$_$rd[p][c] - mean$_$rd[p]