# Formulation for obtaining all plasmids simultaneously

The details of the modified formulation have been provided in this notebook. We change the objective function to mimic the one from the greedy heuristic as closely as possible. We explore the idea of using a binary variable for the potential multiple occurrences of a contig in a plasmid.


The objective function will still be a linear combination of three terms. However, we introduce a constraint on the read depth of a contig. As we now use the copy number of a contig in a plasmid instead of the read depth, we use the binary variables k[p][c][i] instead of rd[p][c].

## 1. Copy number variables and constraints

If a plasmid $p$ contains $i$ copies of contig $c$, then k[p][c][i] = 1, else k[p][c][i] = 0. Since, the total copy number for a particular contig cannot exceed rd[c], we use the following constraint for all contigs:

$\sum_{p \in \mathcal{P}} \sum_{i=1}^{n}$ i . k[p][c][i] $<=$ rd[c]  

Here $\mathcal{P}$ is the set of all potential plasmids and n is the floor of rd[c]. Also, note here that $i$ is a constant and not a variable, hence preserving linearity.
Furthermore, the same plasmid $p$ cannot have k[p][c][i] = 1 for multiple $i$ for the same contig $c$. Hence, we have:

$\sum_{i=1}^{n}$ k[p][c][i] $<=$ 1

## 1. Read depth term

The aim is to minimize the deviation between the read depth in a particular plasmid. To this end, we compute the mean read depth of a plasmid (mean_rd[p]) and try to minimize the sum of the differences between the mean read depth of the plasmid and the individual read depths of the contigs involved (k[p][c]). This term is weighted by the length of the contig.

### Objective
$\sum_{c \in p}$( $|$ mean_rd[p] - k[p][c] $|$ . len[c]/len[p] )

Here, len[p] = $\sum_{c \in p}$( len[c] ) and mean_rd[p] = $\sum_{c \in p}$( k[p][c] . len[c]/len[p] )

### Greedy objective
$|$ 1 − depth(c) / average_depth(p) $| = |$ ( average_depth(p) − depth(c) ) / average_depth(p) $|$ 

Thus, the objective function in for the MILP tries to minimize the numerator of the objective function of the greedy heuristic, as required. 

### Related constraints

We have a variable diff[p][c] = $|$ mean_rd[p] - k[p][c] $|$. Thus, diff[p][c] $=$ max(mean_rd[p] - k[p][c], k[p][c] - mean_rd[p])

diff[p][c] >= mean_rd[p] - k[p][c] <br>
diff[p][c] >= k[p][c] - mean_rd[p]

We are only interested in considering the deviation from the mean_rd[p] for those contigs that are in p. So, for each contig, we have a variable counted_diff[p][c] = diff[p][c] . contigs[p][c]. However, this is not linear as both terms on the right are variables.

counted_diff[p][c] $<=$ diff[p][c] <br> 
counted_diff[p][c] $<=$ UBD(diff[p][c]) where UBD is the upper bound on the diff[p][c]. Here, max_rd where max_rd is the maximum read depth of all contigs, seems like a good choice for the upper bound. <br>
counted_diff[p][c] $>= 0$

Finally, we wish to compute the weighted deviation of the read depths, with the weights being len[c]/len[p]. So, we have a variable counted_wtd_diff equal to the above objective function. Thus, wtd_diff[p] $= \sum_{c \in p}$( $|$ mean_rd[p] - rd[c] $|$ . len[c]/len[p] ). Also, this can be re written as follows:

wtd_diff[p] . len[p] $= \sum_{c \in p}$( $|$ mean_rd[p] - rd[c] $|$ . len[c])

Again both terms on the left are variables. Thus, we introduce a variable counted_wtd_diff[p][c] and following constraints to ensure that counted_wtd_diff[p][c] = wtd_diff[p] . contigs[p][c]

counted_wtd_diff[p][c] $<=$ wtd_diff[p] <br> 
counted_wtd_diff[p][c] $<=$ UBD(wtd_diff[p]) <br>
counted_wtd_diff[p][c] $>= 0$

Thus, ultimately, we simply wish to minimize $\sum_{all~c}$( counted_wtd_diff[p][c] )

## 2. Gene density term

In this case, higher the gene density, higher the plasmid genes used in the predicted plasmids. Hence, the aim is to maximize the gene density. We weight the gene density for a contig by the contig length as above. 

### Objective
$\sum_{c \in p}$(- gd[c] . len[c]/len[p] ) 

### Greedy objective
1- density(c)

Thus, minimizing the MILP objective is equivalent to minimizing the greedy objective as well.

### Related constraints
We wish to compute the weighted gene density with the weight len[c]/len[p]. So, we use a variable wtd_gd[p]  $= \sum_{c \in p}$( gd[c] . len[c]/len[p] ). We rewrite the equation as follows:

wtd_gd[p] . len[p] $= \sum_{c \in p}$( gd[c] . len[c])

Both terms on the left are variables. Thus, we introduce a variable counted_wtd_gd[p][c] and following constraints to ensure that counted_wtd_gd[p][c] = wtd_gd[p] . contigs[p][c]

counted_wtd_gd[p][c]  $<=$  wtd_gd[p] <br>
counted_wtd_gd[p][c]  $<=$  UBD(wtd_gd[p]) Here, UBD is the upper bound on the wtd_gd[p]. Here, UBD = max_gd where max_gd is the maximum gene density of all contigs.<br>
counted_wtd_gd[p][c]  $>=0$ 

Thus, ultimately, we minimize  $\sum_{all~c}$ (- counted_wtd_gd[p][c] )

## 3. GC content term

The aim is to minimize the deviation between the GC content in a particular plasmid. To this end, we compute the mean GC content of a plasmid (mean_GC[p]) and try to minimize the sum of the differences between the mean GC content of the plasmid and the individual GC contents of the contigs involved (GC[c]). This term is weighted by the length of the contig.

### Objective
$\sum_{c \in p}$( $|$ mean_GC[p] - GC[c] $|$ . len[c]/len[p] )

Here, len[p] = $\sum_{c \in p}$( len[c] ) and mean_GC[p] = $\sum_{c \in p}$( GC[c] . len[c]/len[p] )

### Greedy objective
$|$ gc content(p) − gc content(c) $|$

Thus, the objective function in for the MILP tries to minimize the numerator of the objective function of the greedy heuristic, as required. 

### Related constraints

We have a variable GC_diff[p][c] = $|$ mean_GC[p] - GC[c] $|$. Thus, GC_diff[p][c] $=$ max(mean_GC[p] - GC[c], GC[c] - mean_GC[p])

GC_diff[p][c] >= mean_GC[p] - GC[c] <br>
GC_diff[p][c] >= GC[c] - mean_GC[p]

We are only interested in considering the deviation from the mean_GC[p] for those contigs that are in p. So, for each contig, we have a variable counted_GC_diff[p][c] = GC_diff[p][c] . contigs[p][c]. However, this is not linear as both terms on the right are variables.

counted_GC_diff[p][c] $<=$ GC_diff[p][c] <br> 
counted_GC_diff[p][c] $<=$ UBD(GC_diff[p][c]) where UBD is the upper bound on the GC_diff[p][c]. Here, UBD = max_GC where max_GC is the maximum GC content of all contigs. <br>
counted_diff[p][c] $>= 0$

Finally, we wish to compute the weighted deviation of the GC contents, with the weights being len[c]/len[p]. So, we have a variable wtd_GC_diff equal to the above objective function. Thus, wtd_GC_diff[p] $= \sum_{c \in p}$( $|$ mean_GC[p] - GC[c] $|$ . len[c]/len[p] ). Also, this can be re written as follows:

wtd_GC_diff[p] . len[p] $= \sum_{c \in p}$( $|$ mean_GC[p] - GC[c] $|$ . len[c])

Again both terms on the left are variables. Thus, we introduce a variable counted_wtd_GC_diff[p][c] and following constraints to ensure that counted_wtd_GC_diff[p][c] = wtd_GC_diff[p] . contigs[p][c]

counted_wtd_GC_diff[p][c] $<=$ wtd_GC_diff[p] <br> 
counted_wtd_GC_diff[p][c] $<=$ UBD(wtd_GC_diff[p]) <br>
counted_wtd_GC_diff[p][c] $>= 0$

Thus, we minimize $\sum_{all~c}$( counted_wtd_GC_diff[p][c] )