# Estimating quantiles with the t-digest

The [t-digest](https://github.com/tdunning/t-digest) is a compact data structure for summarizing observed cumulative probability distributions.  Like the other structures we've discussed, it's incremental, scalable, and parallel (although the implementation in this notebook is not parallel).  Also like the other structures we've discussed, it has many useful applications in systems, performance analysis, and data science.

In [None]:
import math
import numpy as np
import scipy
import scipy.stats
import pandas as pd
import altair as alt
alt.renderers.enable("notebook")

# T-Digest

In the following cell we will import a `TDigest` class that implements the `update` operation for inserting data elements into the sketch. This tutorial implementation does not provide the `merge` operation, which is useful for combining partial results in a distributed computing setting.

For those interested in how t-digests operate, the implementation is in the `tdigest.py` file of this workshop repository.

In [None]:
from detail.tdigest import TDigest

## Sketching data with the t-digest

This data has a similar shape to the Poisson distribution, which means that it could resemble latencies for a network service.

In [None]:
mu1 = 1
mu2 = 7
w1 = 0.5
w2 = 0.5

def distsamp():
    r = scipy.stats.uniform.rvs(size=1)[0]
    if (r <= w1):
        return scipy.stats.gamma.rvs(mu1, size=1)[0]
    else:
        return scipy.stats.gamma.rvs(mu2, size=1)[0]

sketch = TDigest(compression = 0.1)

for p in [distsamp() for x in range(100000)]:
    sketch.update(p)

### Visualizing the CDF

In [None]:
xvals = np.arange(sketch.cdfi(0), sketch.cdfi(1)).tolist()
df = pd.DataFrame()
df["x"] = xvals + xvals
df["cdf"] = [sketch.cdf(x) for x in xvals] + [(w1 * scipy.stats.gamma.cdf(x, mu1)) + (w2 * scipy.stats.gamma.cdf(x, mu2)) for x in xvals]
df["src"] = (["tdigest"] * len(xvals)) + (["cdf"] * len(xvals))
alt.Chart(df).mark_line().encode(x="x", y="cdf", color="src")

In [None]:
class TDigestAnomalyDetector(object):
    def __init__(self, td):
        self.td = td

    # use the negative log of the tail probability as the score
    def anomaly(self, x):
        cdf = self.td.cdf(x)
        # Here we'll assume we're only testing for "large" anomalies
        t = 1 - cdf
        # make sure we don't try to take the logarithm of zero
        t = max(t, 1e-100)
        return -math.log(t)

In [None]:
detector = TDigestAnomalyDetector(sketch)
data = [0, 5, 10, 15, 20, 25]

In [None]:
scores = pd.DataFrame({
  'x': data,
  'score': [detector.anomaly(x) for x in data]
})
scores

In [None]:
alt.Chart(scores).mark_line(point=True,clip=True).encode(
    alt.Y('score', scale=alt.Scale(domain=(0, 10))),
    x='x'
)

# Exercises
