Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added whitepaper describing fingerprinting Math #1247

Merged
merged 5 commits into from Jun 18, 2019

Conversation

@yfarjoun
Copy link
Contributor

commented Nov 1, 2018

CheckFingerprints, CrosscheckFingerprints, and IdentifyContaminant are based on some probabilistic calculations that are not properly represented anywhere, really.

This PR adds a short paper that describes the math.

@coveralls

This comment has been minimized.

Copy link

commented Nov 1, 2018

Coverage Status

Coverage increased (+0.7%) to 82.137% when pulling 40f078c on yf_add_fingerprinting_whitepaper into 1342f59 on master.

@yfarjoun

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2019

@takutosato and @madduran are you going to get to this or should I find someone else to review?

@takutosato

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

@yfarjoun we will review today

@maddyduran
Copy link
Contributor

left a comment

Looks good to me! My comments are about typos. After section 1.2 Contaminated Samples, things seem less complete? Is this intentional?

\label{odds_swap}
\frac{p(s = 1 \, | \, x,y)}{p(s = 0 \, | \, x,y)} = \frac{p(x,y \, | \, s = 1) \, p(s = 1)}{p(x,y \, | \, s = 0) \, p(s = 0)}
\end{align}
In particular, if sample swap rarely occurs then the posterior log odds of a swap is well-approximated by

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

"if sample swap" --> "if a sample swap"

Here \eqref{total_prob} is the law of total probability, \eqref{xy_cond_indep_s} uses that $x$ and $y$ are conditionally independent of $s$ given $\theta$ and $\varphi$, \eqref{apply_xy_cond_indep} applies \eqref{xy_cond_indep}, and \eqref{apply_theta_phi_cond_indep} applies \eqref{theta_phi_cond_indep}. Substituting \eqref{apply_theta_phi_cond_indep} into \eqref{odds_swap}, we conclude that the posterior odds of a swap is:
\begin{equation}
\label{odds}
\boxed{\frac{\sum_\theta p(x \, | \, \theta) \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} p(x \, | \, \theta) \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.}

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

is the period needed after "\frac{p(s=1)}{p(s=0)}"
it looks almost like the \cdot in the multiplication?

For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)
Sequence data arrives in the form of reads.
We assume that evidence for haplotype $h_i\in\{A,B\}$ with probability of error $e_i\in(0,1)$ are given at a certain haplotype block.
We further assume that said evidence is independent (so, for example, reads have been duplicate marked, and close SNPs from the same read-pair are not used twice)

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

period needed after "...not used twice)"

\end{cases}
\end{equation}
Where $I_x$ is the indicator function of $x$ and the assumption is that an error will cause a switch in the interpretation of the haplotype between $A$ and $B$.
This assumes that we throw away non-conformant haplotypes, and ignores the possibility a non-conformant haplotype erroneously looking conformant. (By conformant we mean either $A$ or $B$.)

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

" ignores the possibility a non-conformant" --> " ignores the possibility of a non-conformant"

At times, one knows that data from a particular sample is contaminated at a known level (that level can be estimated using VarifyBamID, or ContEst, for example).
However, the identity of the contaminator is unknown.
In this section we describe how the diploid Haplotype likelihood of the contaminator can be (sometimes) extracted from the data.
For the calculation we will need the prior on the haplotypes, $p(\theta)$ (which can be calculated from the haplotype frequency by assuming Hardy-Weinberg equilibrium.

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

the parentheses don't close
need one after Hardy-Weinberg equilibrium


...
\subsection{LoH samples}
When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost).

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

"hetrozygosity" --> "heterozygosity"

...
\subsection{LoH samples}
When a sample comes from a tumor there is the possibility that it has undergone a loss of hetrozygosity (LoH) where large sections of chromosomes are lost (whole arms, and sometimes one copy of a whole chromosome can be lost).
This makes all the hetrozygous haplotypes (from the germline) in that region of the chromosome seem homozygous since the only evidence comes from the remaining copy.

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

"hetrozygous" --> "heterozygous"

\begin{equation}
\boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.}
\end{equation}
This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

"HapotypeProbability" --> "HaplotypeProbability"

\boxed{\frac{\sum_\theta \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(\theta) \, \sum_{\varphi} p(y \, | \, \varphi) \, p(\varphi)}{\sum_{\theta = \varphi} \sum_{\theta'} p(x' \, | \, \theta')T_{\theta}^{\theta'} \, p(y \, | \, \varphi) \, p(\theta)} \cdot \frac{p(s=1)}{p(s=0)}.}
\end{equation}
This needs to be understood in-terms of the objects in the Picard code-base, namely HapotypeProbability class and its descendants.

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

is this finished? ends abruptly

\section{Haplotype Likelihoods}
In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data.
\subsection{Sequence data}
For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

This comment has been minimized.

Copy link
@maddyduran

maddyduran Apr 9, 2019

Contributor

"data come from" --> "data comes from" or "data came from" ?

period needed after "...rarely needed)"

This comment has been minimized.

Copy link
@maddyduran

maddyduran Jun 10, 2019

Contributor

still missing a period here

This comment has been minimized.

Copy link
@yfarjoun

yfarjoun Jun 18, 2019

Author Contributor

data is plural

@yfarjoun

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

a bit slow...but I responded...please re-read.

@maddyduran
Copy link
Contributor

left a comment

I just pointed out a missing period, other than that I think it looks good!

\section{Haplotype Likelihoods}
In this section we describe how the haplotype likelihood $p(x|\theta)$ can be computed for various kinds of data.
\subsection{Sequence data}
For sequence data, we assume that the data come from a single individual (i.e. not contaminated, see subsection below) and that there is no reference bias (can correct for that, but it's rarely needed)

This comment has been minimized.

Copy link
@maddyduran

maddyduran Jun 10, 2019

Contributor

still missing a period here

@takutosato

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

👍

@yfarjoun yfarjoun merged commit aefcf21 into master Jun 18, 2019

2 of 3 checks passed

code-review/pullapprove Approval required by 1 of: als364, apchagi, chrisfarnham, gbggrant, gordonwade, hensonc, infispiel, jacarey, jessicaway, kishorikonwar, ktib
Details
Travis CI - Branch Build Passed
Details
Travis CI - Pull Request Build Passed
Details

@yfarjoun yfarjoun deleted the yf_add_fingerprinting_whitepaper branch Jun 18, 2019

mjhipp added a commit to mjhipp/picard that referenced this pull request Aug 1, 2019
Added whitepaper describing fingerprinting Math (broadinstitute#1247)
- Added a whitepaper describing fingerprinting Math
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.