# Improvements in Plasmids Assembly formulation

## Single plasmid per iteration

#### Possibility of multiple occurences of same contig
Current iteration of the ILP removes a single plasmid per run of the MILP. However, the putative plasmids do not contain the same contig twice whereas the reference plasmids can. To handle such a possibility, we change the contigs[p][c] variables from Binary to Integers. Currently the variables take value 1 or 0 respectively if the contig 'c' is in plasmid 'p' or not. If it is an integer, the variable will represent the number of times the contig has occurred in the same plasmid.

Accordingly, we now look for walks instead of paths. Also, only cycles that are not part of walks will be considered in the cycle removal stage (to ensure that we have a single connected component). 

#### Changes in objective function
1. The objective term that accounts for variations in coverage (depth) changes to accommodate the above change. Previously the variation in coverages (rd_diff) was counted as follows.

    rd_diff[i] = max(0, 1 - rd[i])

    where 'i' is the contig, rd[i] is the remaining (unused) coverage for the contig and rd_diff[i] is the coverage penalty for the contig. 

    However, now we also need to consider that the same contig may occur multiple times. Hence, the new penalty term becomes

    rd_diff[i] = max(0, contigs[p][c] - rd[i])

    Thus, effectively, if the read depth has to be split or shared between multiple occurences of the contig, the penalty will be distributed too.

2. The gene density term will be multiplied by contigs[p][c].

3. The GC_diff term computes the penalty for variations in the GC content of contigs. It is computed as:

    GC_diff[i] = $|$ GC_mean[p] - GC[i] $|$ 

    The term will be multiplied by the contigs[p][c].

#### Weighting by length
The read depth and GC content terms be weighted by length of contig. The current formulation of the MILP will have two versions: one where none of the terms in the objective have been weighted by the lengths and another where the read depth and GC content penalty terms are weighted by length of contig. This is done in order to avoid penalizing shorter contigs too much even if they have high penalties. 

## Multiple plasmids obtained at same time

#### Penalize edges
Edges between contigs with the high gene denisty penalized if NOT chosen in any plasmid. This provides incentive to the MILP to assign these edges to at least one plasmid. This is done to prevent the MILP to output individual contigs as plasmids. 

#### Preprocessing
Preprocess graph to remove a 'desert' contig. For each contig, we check the distance to the two nearest seed contigs, if the distance is larger than a threshold, the contig and related edges would be removed from the assembly graph. This is done in order to avoid large stretches of contigs without plasmid genes (very low gene density).

#### Copy number
Some plasmid might have higher copy number. In the single plasmid version, the same plasmid would be output multiple times according to its copy number. If we remove multiple plasmids, we can follow the same method and conider multiple occurences of the same plasmid as different plasmids. Another way to handle this would be to compute a copy number for each plasmid and consider it just once. 

If the number of plasmids is given as an input, the first strategy might not be the best as multiple copies of the same plasmid may leave fewer spots for potential plasmids that have a lesser objective value than the plasmid with higher copy number.

On the other hand, if we assign a copy number to each plasmid, we add another variable thereby making it difficult to obtain a linear formulation.

#### Coverage term
If we allow multiple plasmids to be assembled simultaneously, the coverage of a contig 'i' will be split between the plasmids. Hence, in order to compute the penalty for deviation from mean coverage, we would have to sum the number of occurences of a contig in various plasmids. The new penalty term will look as follows:

max(0, $\sum_{p}$ (contigs[p][i] - cov[i])) for all contigs 'i'.

Note that this penalty term holds only if not normalized by length. If the read depth penalty is to be normalized by length, the penalties for individual plasmids would be computed separately.

$\sum_{p}$ (max(0, contigs[p][i] - cov[i]) * len[i]/len[p])
