# Compute uni-mutational fractions

In [1]:
%run "Header.ipynb"
%run "LoadMutationJSONData.ipynb"
%run "GeneUtils.ipynb"

## Compute uni-mutational fractions: $N_1$, $N_2$, $N_3$ for each position in the genome

NOTE: in the paper we use different numbers (N2 instead of N1, etc), but the idea is the same

(We make the simplifying assumption in the following text that all reads contain only nucleotides, and don't contain degenerate bases like $Y$ or $N$.)

For a genomic position $P$ with a reference nucleotide $R \in \{A, C, G, T\}$, we consider three values: $N_1$, $N_2$, and $N_3$. __Ignoring reads aligned to $P$ that match $R$__, a read aligned to $P$ can have one of three other nucleotides at this position (for example, if $R$ is $A$, then these "non-matching" reads should only describe $C$, $G$, or $T$).

$N_1$, $N_2$, and $N_3$ define the number of non-matching reads aligned to $P$ for each of these three remaining nucleoides, such that $N_1 \geq N_2 \geq N_3$:

- $N_1$ is the number of non-matching reads that contain the most common "alternate" nucleotide at $P$.
- $N_2$ is the number of non-matching reads for the second-most common "alternate" nucleotide at $P$.
- $N_3$ is the number of non-matching reads for the third-most (least) common "alternate" nucleotide at $P$.

For example, if a given position in a reference sequence is $G$ and the covering reads for this position are:

- 100 $G$
- 1 $T$
- 50 $C$
- 5 $A$

Then we would say that $N_1$ = 50 (for $C$), $N_2$ = 5 (for $A$), and $N_3$ = 1 (for $T$).

We note that, for low-coverage and/or stable genomic positions, some or all of $N_1$, $N_2$, and $N_3$ may be zero.

We are interested in computing $f = \dfrac{N_1}{N_1 + N_2 + N_3}$ (which is only defined when $N_1 > 0$).

For a position with reasonably high coverage, high values of this fraction (at most $f = 1$) indicate that the metagenome likely has one primary "alternate" nucleotide at this position, while lower values (at lowest $f = \dfrac{1}{3}$) indicate that there are likely multiple "common" alternate nucleotides.

In [2]:
seq2pos2f = {}
for seq in SEQS:
    seq2pos2f[seq] = {}
    for pos in seq2pos2matchct[seq]:
        non_matches = seq2pos2mismatches[seq][pos]
        if len(non_matches) > 0:
            non_match_cts = non_matches.values()
            f = max(non_match_cts) / sum(non_match_cts)
            seq2pos2f[seq][pos] = f
#             if max(non_match_cts) > seq2pos2matchct[seq][pos]:
#                 print("Seq {} at pos {} has reference that isn't consensus of aligned reads.".format(
#                     seq2name[seq], pos
#                 ))
#                 print("Num matches: {}, mismatches: {}".format(seq2pos2matchct[seq][pos], non_matches))
        else:
            # There are no non-matches, so N1, N2, and N3 are all 0. This means that f is undefined
            # (since it'd be 0 / 0).
            seq2pos2f[seq][pos] = None
            
# The following code computes "f" across the genomes.
# The plots are... not very informative.
#     pyplot.scatter([int(x) for x in seq2pos2f[seq].keys()], seq2pos2f[seq].values(), c=SCATTERPLOT_PT_COLOR)
#     pyplot.title("{}: f across genome, where defined and where coverage $\geq$ {}".format(
#         seq2name[seq], MIN_COV
#     ))
#     pyplot.xlabel("Sequence position")
#     pyplot.ylabel(r"$\dfrac{N_1}{N_1 + N_2 + N_3}$")
#     pyplot.locator_params(axis="x", nbins=5)
#     use_thousands_sep(pyplot.gca().xaxis)
#     pyplot.show()