Added whitepaper describing fingerprinting Math #1247

Merged
merged 5 commits into from Jun 18, 2019

Conversation

Contributor

yfarjoun commented Nov 1, 2018

 CheckFingerprints, CrosscheckFingerprints, and IdentifyContaminant are based on some probabilistic calculations that are not properly represented anywhere, really. This PR adds a short paper that describes the math.
 - Added whitepaper describing fingerprinting Math 
 a6597cc 

coveralls commented Nov 1, 2018 • edited

 Coverage increased (+0.7%) to 82.137% when pulling 40f078c on yf_add_fingerprinting_whitepaper into 1342f59 on master.

takutosato requested review from takutosato and maddyduranDec 23, 2018

Contributor Author

yfarjoun commented Apr 8, 2019

 @takutosato and @madduran are you going to get to this or should I find someone else to review?
Contributor

takutosato commented Apr 9, 2019

 @yfarjoun we will review today
reviewed
Contributor

 Looks good to me! My comments are about typos. After section 1.2 Contaminated Samples, things seem less complete? Is this intentional?
 \label{odds_swap} \frac{p(s = 1 \, | \, x,y)}{p(s = 0 \, | \, x,y)} = \frac{p(x,y \, | \, s = 1) \, p(s = 1)}{p(x,y \, | \, s = 0) \, p(s = 0)} \end{align} In particular, if sample swap rarely occurs then the posterior log odds of a swap is well-approximated by

Contributor

"if sample swap" --> "if a sample swap"

 Here \eqref{total_prob} is the law of total probability, \eqref{xy_cond_indep_s} uses that $x$ and $y$ are conditionally independent of $s$ given $\theta$ and $\varphi$, \eqref{apply_xy_cond_indep} applies \eqref{xy_cond_indep}, and \eqref{apply_theta_phi_cond_indep} applies \eqref{theta_phi_cond_indep}. Substituting \eqref{apply_theta_phi_cond_indep} into \eqref{odds_swap}, we conclude that the posterior odds of a swap is: \begin{equation} \label{odds} \boxed{\frac{\sum_\theta p(x \, | \, \theta) \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} p(x \, | \, \theta) \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.}

Contributor

is the period needed after "\frac{p(s=1)}{p(s=0)}"
it looks almost like the \cdot in the multiplication?

 For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed) Sequence data arrives in the form of reads. We assume that evidence for haplotype $h_i\in\{A,B\}$ with probability of error $e_i\in(0,1)$ are given at a certain haplotype block. We further assume that said evidence is independent (so, for example, reads have been duplicate marked, and close SNPs from the same read-pair are not used twice)

Contributor

period needed after "...not used twice)"

 \end{cases} \end{equation} Where $I_x$ is the indicator function of $x$ and the assumption is that an error will cause a switch in the interpretation of the haplotype between $A$ and $B$. This assumes that we throw away non-conformant haplotypes, and ignores the possibility a non-conformant haplotype erroneously looking conformant. (By conformant we mean either $A$ or $B$.)

Contributor

" ignores the possibility a non-conformant" --> " ignores the possibility of a non-conformant"

 At times, one knows that data from a particular sample is contaminated at a known level (that level can be estimated using VarifyBamID, or ContEst, for example). However, the identity of the contaminator is unknown. In this section we describe how the diploid Haplotype likelihood of the contaminator can be (sometimes) extracted from the data. For the calculation we will need the prior on the haplotypes, $p(\theta)$ (which can be calculated from the haplotype frequency by assuming Hardy-Weinberg equilibrium.

Contributor

the parentheses don't close
need one after Hardy-Weinberg equilibrium

 ... \subsection{LoH samples} When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost).

Contributor

"hetrozygosity" --> "heterozygosity"

 ... \subsection{LoH samples} When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost). This makes all the hetrozygous haplotypes (from the germline) in that region of the chromosome seem homozygous since the only evidence comes from the remaining copy.

Contributor

"hetrozygous" --> "heterozygous"

 \begin{equation} \boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.} \end{equation} This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

Contributor

"HapotypeProbability" --> "HaplotypeProbability"

 \boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.} \end{equation} This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

Contributor

is this finished? ends abruptly

 \section{Haplotype Likelihoods} In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data. \subsection{Sequence data} For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

Contributor

"data come from" --> "data comes from" or "data came from" ?

period needed after "...rarely needed)"

Contributor

still missing a period here

yfarjoun Jun 18, 2019

Author Contributor

data is plural

approved these changes
added 2 commits May 31, 2019
 -responding to review comments. 
 edccf24 
 - review response 
 7bb530c 
Contributor Author

yfarjoun commented May 31, 2019

approved these changes
Contributor

 I just pointed out a missing period, other than that I think it looks good!
 \section{Haplotype Likelihoods} In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data. \subsection{Sequence data} For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

Contributor

still missing a period here

Contributor

takutosato commented Jun 10, 2019

 👍
added 2 commits Jun 18, 2019
 - review comments 
 b43e5b1 
 - review comments 
 40f078c 

yfarjoun merged commit aefcf21 into master Jun 18, 2019 2 of 3 checks passed

2 of 3 checks passed

code-review/pullapprove Approval required by 1 of: als364, apchagi, chrisfarnham, gbggrant, gordonwade, hensonc, infispiel, jacarey, jessicaway, kishorikonwar, ktib
Details
Travis CI - Branch Build Passed
Details
Travis CI - Pull Request Build Passed
Details

yfarjoun deleted the yf_add_fingerprinting_whitepaper branch Jun 18, 2019

added a commit to mjhipp/picard that referenced this pull request Aug 1, 2019
 Added whitepaper describing fingerprinting Math (broadinstitute#1247) 
- Added a whitepaper describing fingerprinting Math
 8c97119