Skip to content

# broadinstitute/picard

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# Added whitepaper describing fingerprinting Math #1247

Merged
merged 5 commits into from Jun 18, 2019

## Conversation

Contributor

### yfarjoun commented Nov 1, 2018

 CheckFingerprints, CrosscheckFingerprints, and IdentifyContaminant are based on some probabilistic calculations that are not properly represented anywhere, really. This PR adds a short paper that describes the math.
 - Added whitepaper describing fingerprinting Math 
 a6597cc 

### coveralls commented Nov 1, 2018 • edited

 Coverage increased (+0.7%) to 82.137% when pulling 40f078c on yf_add_fingerprinting_whitepaper into 1342f59 on master.

### takutosato requested review from takutosato and maddyduranDec 23, 2018

Contributor Author

### yfarjoun commented Apr 8, 2019

 @takutosato and @madduran are you going to get to this or should I find someone else to review?
Contributor

### takutosato commented Apr 9, 2019

 @yfarjoun we will review today
reviewed
Contributor

### maddyduran left a comment

 Looks good to me! My comments are about typos. After section 1.2 Contaminated Samples, things seem less complete? Is this intentional?
 \label{odds_swap} \frac{p(s = 1 \, | \, x,y)}{p(s = 0 \, | \, x,y)} = \frac{p(x,y \, | \, s = 1) \, p(s = 1)}{p(x,y \, | \, s = 0) \, p(s = 0)} \end{align} In particular, if sample swap rarely occurs then the posterior log odds of a swap is well-approximated by

#### maddyduran Apr 9, 2019

Contributor

"if sample swap" --> "if a sample swap"

 Here \eqref{total_prob} is the law of total probability, \eqref{xy_cond_indep_s} uses that $x$ and $y$ are conditionally independent of $s$ given $\theta$ and $\varphi$, \eqref{apply_xy_cond_indep} applies \eqref{xy_cond_indep}, and \eqref{apply_theta_phi_cond_indep} applies \eqref{theta_phi_cond_indep}. Substituting \eqref{apply_theta_phi_cond_indep} into \eqref{odds_swap}, we conclude that the posterior odds of a swap is: \begin{equation} \label{odds} \boxed{\frac{\sum_\theta p(x \, | \, \theta) \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} p(x \, | \, \theta) \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.}

#### maddyduran Apr 9, 2019

Contributor

is the period needed after "\frac{p(s=1)}{p(s=0)}"
it looks almost like the \cdot in the multiplication?

 For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed) Sequence data arrives in the form of reads. We assume that evidence for haplotype $h_i\in\{A,B\}$ with probability of error $e_i\in(0,1)$ are given at a certain haplotype block. We further assume that said evidence is independent (so, for example, reads have been duplicate marked, and close SNPs from the same read-pair are not used twice)

#### maddyduran Apr 9, 2019

Contributor

period needed after "...not used twice)"

 \end{cases} \end{equation} Where $I_x$ is the indicator function of $x$ and the assumption is that an error will cause a switch in the interpretation of the haplotype between $A$ and $B$. This assumes that we throw away non-conformant haplotypes, and ignores the possibility a non-conformant haplotype erroneously looking conformant. (By conformant we mean either $A$ or $B$.)

#### maddyduran Apr 9, 2019

Contributor

" ignores the possibility a non-conformant" --> " ignores the possibility of a non-conformant"

 At times, one knows that data from a particular sample is contaminated at a known level (that level can be estimated using VarifyBamID, or ContEst, for example). However, the identity of the contaminator is unknown. In this section we describe how the diploid Haplotype likelihood of the contaminator can be (sometimes) extracted from the data. For the calculation we will need the prior on the haplotypes, $p(\theta)$ (which can be calculated from the haplotype frequency by assuming Hardy-Weinberg equilibrium.

#### maddyduran Apr 9, 2019

Contributor

the parentheses don't close
need one after Hardy-Weinberg equilibrium

 ... \subsection{LoH samples} When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost).

#### maddyduran Apr 9, 2019

Contributor

"hetrozygosity" --> "heterozygosity"

 ... \subsection{LoH samples} When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost). This makes all the hetrozygous haplotypes (from the germline) in that region of the chromosome seem homozygous since the only evidence comes from the remaining copy.

#### maddyduran Apr 9, 2019

Contributor

"hetrozygous" --> "heterozygous"

 \begin{equation} \boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.} \end{equation} This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

#### maddyduran Apr 9, 2019

Contributor

"HapotypeProbability" --> "HaplotypeProbability"

 \boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.} \end{equation} This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

#### maddyduran Apr 9, 2019

Contributor

is this finished? ends abruptly

 \section{Haplotype Likelihoods} In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data. \subsection{Sequence data} For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

#### maddyduran Apr 9, 2019

Contributor

"data come from" --> "data comes from" or "data came from" ?

period needed after "...rarely needed)"

#### maddyduran Jun 10, 2019

Contributor

still missing a period here

#### yfarjoun Jun 18, 2019

Author Contributor

data is plural

approved these changes
added 2 commits May 31, 2019
 -responding to review comments. 
 edccf24 
 - review response 
 7bb530c 
Contributor Author

### yfarjoun commented May 31, 2019

 a bit slow...but I responded...please re-read.
approved these changes
Contributor

### maddyduran left a comment

 I just pointed out a missing period, other than that I think it looks good!
 \section{Haplotype Likelihoods} In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data. \subsection{Sequence data} For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

#### maddyduran Jun 10, 2019

Contributor

still missing a period here

Contributor

### takutosato commented Jun 10, 2019

 👍
added 2 commits Jun 18, 2019
 - review comments 
 b43e5b1 
 - review comments 
 40f078c 

### yfarjoun merged commit aefcf21 into master Jun 18, 2019 2 of 3 checks passed

#### 2 of 3 checks passed

code-review/pullapprove Approval required by 1 of: als364, apchagi, chrisfarnham, gbggrant, gordonwade, hensonc, infispiel, jacarey, jessicaway, kishorikonwar, ktib
Details
Travis CI - Branch Build Passed
Details
Travis CI - Pull Request Build Passed
Details

### yfarjoun deleted the yf_add_fingerprinting_whitepaper branch Jun 18, 2019

added a commit to mjhipp/picard that referenced this pull request Aug 1, 2019
 Added whitepaper describing fingerprinting Math (broadinstitute#1247) 
- Added a whitepaper describing fingerprinting Math
 8c97119 
to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.