## This notebook explains the theory behind microbial sourcetracking using a Gibb's sampler. 

## The theory and examples are based on work detailed in [Knights et al. 2011](http://www.nature.com/nmeth/journal/v8/n9/abs/nmeth.1650.html): if you find this work helpful please consider citing Dan Knights paper.


#### Note:
The formula for calculating the probability that a held-out sequence will be assigned to a given environment is reported incorrectly in [Knights et al. 2011](http://www.nature.com/nmeth/journal/v8/n9/abs/nmeth.1650.html). The corrected formula is:

$$P( z_{i} = v \mid  z^{ \neg i}, x)  \propto P(x_{i} \mid v)  \times P(v \mid x^{ \neg i}) = \begin{cases}
\frac{m_{x_{i}v} + \alpha_{1}}{m_{v} + \tau \alpha_{1}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v < V
\\
\\
\frac{m_{x_{i}v} + n \alpha_{2}}{m_{v} + \tau n \alpha_{2}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v = V
\end{cases}$$

This updated formula is what is truly being calculated by both the [R sourcetracking algorithm](https://github.com/danknights/sourcetracker) and this repository (personal communication with Dan Knights).


## Problem

Given a number of `sources`, determine the contribution of each `source` to a `sink`. 

The Gibb's sampler works in four basic steps:
    1. Randomly assign the sequences in a sink to source environments.These random assignments represent which source a given sink sequence came from.
    2. Select one of the sequences from step 1, calculate the *actual* probabilities of that sequence having come from any of the source environments, and update the assigned source environment of the sequence based on a random draw with the calculated probabilities. Repeat many times. 
    3. At intervals in the repeats of step 2, take the source environment assingments of all the sequences in a sink and record them. 
    4. After doing step 3 a certain number of times (i.e. recording the full assignments of source environments for each sink sequence), terminate the iteration and move to the next sink.  

## Machinery

There are two matrices that form the basis of the Gibb's calculation. 

### `p_tv`
The first is `p_tv`, the probability of seeing a taxon, given any of the sources. This matrix is calculated with the following sequence



## Terminology

**`Source`**: an