Design the implementation of precision--recall calculations

## Our goal

Let $V$ be the set of (somatic) variants in our sample. From sequencing data we produce callsets with various callers and filtering.  It is useful to think of precision and recall as a functions on the domain of callsets.  For each callset $C$ the definition of precision and recall

$$
\begin{eqnarray}
\mathrm{precision}&=& p(C) = \frac{|C \cap V|}{|C|} = \frac{\#\text{true calls}}{\#\mathrm{calls}}
\; \text{: the fraction of calls that are true} \\
\mathrm{recall}&=& r(C) = \frac{|C \cap V|}{|V|} = \frac{\#\text{true calls}}{\#\mathrm{variants}}
\; \text{: the fraction of variants that are called}
\end{eqnarray}
$$

We need the following
1. $V$ based on four germline callsets---one callset from the unmixed DNA of NA12889, and three others from NA12890, NA12891, and NA12892
1. $C$ based on a DNA mix of NA12889, ..., NA12892
1. an implementation of precision and recall calculation


## Terms, quantities

* everything below refers to a given variant $v$
* reference allele: $R$ or $0$; alternative allele: $A$ or $1$
* subject/individual: $s \in \{1, 2, 3, 4\}$ or $s \in \{\mathrm{NA12889}, ..., \mathrm{NA12892}\}$
* genotype of subject $s$: $g_s \in \{0, 1, 2\}$ or $g_s \in \{RR, RA, AA\}$ as homozyg. ref, heterozyg., homozyg. alt
* genotype vector (for all subjects): $g = (g_1, ..., g_4)$
* note that $g \in G_{(1)} \times ... \times G_{(4)} \setminus \{(0,0,0,0)\}$, where each $G_{(s)} \equiv \{0, 1, 2\}$ so that $g$ can take $3^4 - 1 = 63$ values; (when $g = (0,0,0,0)$ then none of the four subjects contains the variant)
* mix $m \in \{1, 2, 3\}$ or $m \in \{\mathrm{Mix}1, \mathrm{Mix}2, \mathrm{Mix}3\}$
* each mix $m$ is defined by the mixing ratios $q_m \equiv (q_{1m}, q_{2m}, q_{3m}, q_{4m})$, which are normalized so $\sum_{s=1}^4 q_{sm} = 1$
* for mix $m$ the alternative allele frequency $f_m = \sum_{s=1}^4 g_s q_{sm}$ or simply written as the dot product $g \cdot q_m$

1. germline callsets from our lab `NA12889-chess.vcf`, ..., `NA12892-chess.vcf` and illumina `NA12889-illumina.vcf`, ..., `NA12892-illumina.vcf`
1. take intersections `NA12889-chess.vcf` $\cap$ `NA12889-illumina.vcf` $\rightarrow$ `NA12889.vcf`; likewise obtain `NA12890.vcf`, ...
1. separate `NA12889.vcf` $\rightarrow$ `NA12889-snvs.vcf`, `NA12889-indels.vcf`; the treatment of indels is the same as that of snvs from this point on
1. let $V_1\equiv$ `NA12889-snps.vcf`, ..., $V_4\equiv$ `NA12892-snps.vcf`; the notation $V_s$ makes the point that we consider these sets as the sets of all *variants*, i.e. the biological reality
1. take union $V = \bigcup_{s=1}^4 V_s$ a.k.a `snps.vcf`

## Refinements

### Conditioning on allele frequency

Our DNA mix design results in modeled somatic variants, whose alternative allele frequency depends on the mixing ratios as well as the presence/absence and zygosity of the variant in each of NA12889, ..., NA12892 (see below).  Let $V_f \subset V$ be the set of variants whose frequency is $f$.  Then the definitions above can be refined as

$$
\begin{eqnarray}
\text{precision at}\, f &=& p(C, f) = \frac{|C \cap V_f|}{|C|} \\
\text{recall at}\, f &=& r(C, f) = \frac{|C \cap V_f|}{|V_f|}
\end{eqnarray}
$$

### 