# Waiting distance likelihoods (stats only)
This notebook addresses the question of how to apply weighting to likelihood calculations of ARGs.

In [1]:
from scipy import stats
import toytree
import ipcoal
import numpy as np
import toyplot

### Waiting distances
Given a recombination rate (r) and the sum branch lengths of each genealogy in an ARG (L) we can calculate the expected waiting distance between recombination events. Here these are calculated as a vector of rate parameters ($\Lambda$). 

#### Stats example
**Question**: what is the likelihood of this observation given the model.  
**The observation**: the breakpoints (interval lengths) on a continuous sequence of finite length.  
**The model**: breakpoints are added moving from left to right as waiting distances until the next event given a global rate parameter $\lambda$.

#### Example 1
Generate breakpoints along a sequene of length 1e6 and fit a model to this observation. Test alternative rates (Models) on the same data (Observation). This demonstrates that we can compare the likelihoods of different parameters to find the best fit to an observed ARG.

In [2]:
def get_breakpoint_rvs(rate, max_length, random_seed):
    """Return a sequence of distances sampled from an expon dist that sum to less that max_length"""
    mean_len = 1 / rate
    buffer = mean_len / 2.
    
    # calculate many waiting dists as random variables
    dists = stats.expon.rvs(scale=1 / rate, size=int(max_length / buffer), random_state=random_seed)
    
    # get index of the rv that puts the summed value over the finite limit of max_length
    idx = 0
    sumdist = 0
    while 1:
        sumdist += dists[idx]
        if sumdist >= max_length:
            break
        idx += 1

    # subset dist observation to include only those that occur up to last break before CHROMLEN.
    return dists[:idx - 1]

In [3]:
# example
get_breakpoint_rvs(rate=1 / 50, max_length=1e5, random_seed=123)

array([59.61360717, 16.85337413, 12.86420401, ..., 19.66368561,
        9.51715173, 25.70912602])

In [4]:
# set global rate as recomb rate * avg gtree sum edges in generations
RECOMB_RATE = 2e-8
MEAN_GTREE_SUM_EDGES = 1e6 
WAITING_DIST_RATE = RECOMB_RATE * MEAN_GTREE_SUM_EDGES
CHROMLEN = 1e6

# example observation dataset
observation = get_breakpoint_rvs(WAITING_DIST_RATE, CHROMLEN, 123)

# confirm observation sums less than CHROMLEN, but within 10%
assert observation.sum() < CHROMLEN
assert np.allclose(observation.sum(), CHROMLEN, rtol=0.1)

# calculate likelihood of observations given true known rate parameter
print(f"RATE = {WAITING_DIST_RATE:.4f}, loglik = {-stats.expon.logpdf(observation, scale=1 / WAITING_DIST_RATE).sum():.3f} TRUE")

# calculate likelihood of observations under alternative rate parameters (should be worse)
for rate in np.linspace(WAITING_DIST_RATE / 2, WAITING_DIST_RATE * 2, 10):
    print(f"RATE = {rate:.4f}, loglik = {-stats.expon.logpdf(observation, scale=1 / rate).sum():.3f}")

RATE = 0.0200, loglik = 98559.923 TRUE
RATE = 0.0100, loglik = 102480.366
RATE = 0.0133, loglik = 100036.248
RATE = 0.0167, loglik = 98888.192
RATE = 0.0200, loglik = 98559.923
RATE = 0.0233, loglik = 98797.382
RATE = 0.0267, loglik = 99448.917
RATE = 0.0300, loglik = 100416.711
RATE = 0.0333, loglik = 101633.974
RATE = 0.0367, loglik = 103053.068
RATE = 0.0400, loglik = 104638.818


#### Example 2
Now consider that we generated multiple different ARGs under the same model (i.e., the same model parameters) and we want to compare them and ask which ones are a better fit to our model. Is this a valid test, or does it always favor the observation that just happens to have the most breakpoints? Just using the mean or sum of likelihoods of each interval length (plots 1-2) clearly does not work. Those with more observations of shorter lengths always score best. However, weighting each likelihood calcuation per-unit, by dividing by the length of the interval, does appear to give a distribution that is not biased by observation size (plots 3-4), so that is good. But does this truly represent that some observations are better than others, and what is the proper absolute likelihood score to calculate?

In [5]:
NREPS = 1000

# generate data under this rate (e.g., simetimes few long intervals other times many shorter intervals)
observations = [get_breakpoint_rvs(WAITING_DIST_RATE, CHROMLEN, i) for i in range(NREPS)]

# calculate likelihood across replicates for this data given the one true rate
logliks = [stats.expon.logpdf(i, scale=1 / WAITING_DIST_RATE) for i in observations]

# plot 1: obviously bad
toyplot.scatterplot(
    [i.sum() for i in logliks], [i.size for i in observations], 
    width=250, height=250, size=6, opacity=0.5, label='plot 1', xlabel="loglik", ylabel="nbreakpoints");

# plot 2: also bad
toyplot.scatterplot(
    [i.mean() for i in logliks], [i.size for i in observations], 
    width=250, height=250, size=6, opacity=0.5, label='plot 2', xlabel="loglik", ylabel="nbreakpoints");

# plot 3: looking valid
toyplot.scatterplot(
    [np.sum(i / j) for (i, j) in zip(logliks, observations)], [i.size for i in observations], 
    width=250, height=250, size=6, opacity=0.5, label='plot 3', xlabel="loglik", ylabel="nbreakpoints");

# plot 4: ...
# toyplot.scatterplot(
#     [np.mean(i / j) for (i, j) in zip(logliks, observations)], [i.size for i in observations], 
#     width=250, height=250, size=6, opacity=0.5, label='plot 4', xlabel="loglik", ylabel="nbreakpoints");

# # plot 5: but what should the absolute likelihood value be (sum, mean, or something else?)
# toyplot.scatterplot(
#     [np.sum(i * (j / sum(j))) for (i, j) in zip(logliks, observations)], [i.size for i in observations], 
#     width=250, height=250, size=6, opacity=0.5, label='plot 5', xlabel="loglik", ylabel="nbreakpoints");

# # plot 6: but what should the absolute likelihood value be (sum, mean, or something else?)
# toyplot.scatterplot(
#     [np.mean(i * (j / sum(j))) for (i, j) in zip(logliks, observations)], [i.size for i in observations], 
#     width=250, height=250, size=6, opacity=0.5, label='plot 6', xlabel="loglik", ylabel="nbreakpoints");

Let's say we go with the method in plot 3 above. Next let's check if the ones with better loglikelihood scores have mean waiting distances that are closer to the mean expectation. This doesn't tell us much, they are all pretty close to the expected mean. So we'll have to examine something else...

In [7]:
mlogliks = [np.sum(i / j) for (i, j) in zip(logliks, observations)]
meandists = [np.mean(j) for j in observations]

# mean waiting distance does not correlated with loglik score
toyplot.scatterplot(
    mlogliks, meandists,
    width=250, height=250, size=6, opacity=0.25, label='plot 3', xlabel="loglik", ylabel="mean dist", xscale="log");


### Comparing different datasets using same model
The entropy measurement, or relative entropy measurement (KL divergence), can tell about the information contained in a distribution. Here we examine the entropy of distributions of likelihoods where we calculate the likelihood of several different datasets evaluated given one assumed model (fixed lambda). The dataset that was generated under this lambda has the highest entropy, whereas datasets generated under a different lambda have lower entropies. We can use this method to compare the different ARGs given one fixed lambda rate. 

In [9]:
for x in [0.2, 0.5, 1, 2, 5]:
    
    # generate alt data under a faster rate
    alt_observations = [get_breakpoint_rvs(WAITING_DIST_RATE * x, CHROMLEN, i) for i in range(NREPS)]
    
    # calculate likelihood across replicates for some other data given the true rate
    alt_logliks = [stats.expon.logpdf(i, scale=1 / WAITING_DIST_RATE) for i in alt_observations]
    
    # get sum logliks
    alt_mlogliks = [np.sum(i / j) for (i, j) in zip(alt_logliks, alt_observations)]

    # show results across several different data sets
    print(f"rate={WAITING_DIST_RATE * x:.3f}, {stats.entropy(alt_mlogliks):.3f}")

rate=0.004, 6.2044183034661105
rate=0.010, 6.417084892053115
rate=0.020, 6.54135140698647
rate=0.040, 6.345547787975862
rate=0.100, 3.9321297734646072


In addition, I believe we can use the *relative entropy* measurement to compare the likelihood distributions calculated using recomb, tree, topo, or combination distances, to ask *how much more informative* is one distribution than another. 