# Detecting Data Drift
The characteristics of systems, and the data they generate, often change over time.
This type of change is sometimes referred to as "data drift".

Data Drift is important to measure.
It indicates that system inputs or operating characteristics are changing.
Problems detectable via data drift include
* predictive models experiencing performance loss, due to changing input features
* degrading latency characteristics from software or hardware failures
* changes to customer or user behavior patterns
* system intrusion, malware or other compromised security

Data Drift shows up in changes to the shape of the data's _distribution_ and it is detectable with statistical tools.
This notebook explores techniques for detecting data drift using the Kolmogorov Smirnov test.

# Kolmogorov Smirnov Test
The [Kolmogorov Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) test, or KS-test, is a statistical test designed to measure the difference between two cumulative distributions.
Like most statistical tests, it has two components: a KS test _statistic_, and a corresponding _p-value_.
The KS statistic measures the absolute difference between two CDFs, as illustrated here:

![KS-statistic](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/KS_Example.png/450px-KS_Example.png)

The KS _sampling distribution_ can be used to convert a KS-statistic into a corresponding p-value: the probability that a sample comes from some distribution, or two samples were drawn from the same distribution.

In [None]:
import math
import numpy as np
import scipy
import scipy.stats
from scipy.stats import gamma, kstest
import pandas as pd
import altair as alt
from detail.altairdf import altairDF
alt.renderers.enable("notebook")

# T-Digest Distribution Sketches
As we saw in previous notebooks, t-digets sketches provide efficient and accurate approximations to distributions with arbitrary shapes.
That makes them an excellent tool for detecting data drift.
In this cell we import the python implementation of t-digests used in this workshop.

In [None]:
from detail.tdigest import TDigest

# Sketching a Baseline Distribution
Data drift can only be measured with respect to some reference, or baseline, data.
In the following cell we sketch some example data drawn from a Gamma distribution.
The resulting sketch will serve as the baseline for our experiments.

In [None]:
sketch = TDigest(compression = 0.1)
for p in gamma.rvs(1.0, size=100000):
    sketch.update(p)

# Effects of sample size on KS tests
The following cell explores how sample size affects the output of a KS test.
It generates a table of the KS statistic and the corresponding p-value for each sample size:

In [None]:
ssv = [100, 1000, 10000, 100000]
ksv = [kstest(gamma.rvs(1.0, size=ss), lambda xv: [sketch.cdf(x) for x in xv]) for ss in ssv]
ksdf = pd.DataFrame()
ksdf["ss"] = ssv
ksdf["stat"] = [ks.statistic for ks in ksv]
ksdf["pval"] = [ks.pvalue for ks in ksv]
ksdf

In the table above, we see two things.
As the test sample size increases, the KS statistic stabilizes at just above 0.01 and the corresponding p-values approach zero.
Why is this happening?
Remember that our sketch is an _approximation_ of the true distribution.
The KS-statistic becomes small (but never zero) and the p-values eventually approach zero, because the two distributions are definitely _not_ the same, merely similar.

The following plot shows what this looks like visually.
As you can see, a KS statistic of ~0.01 represents distributions that are almost exactly the same, which in this case is a reflection of the fidelity of the t-digest sketch.

In [None]:
plotdf = altairDF(np.linspace(sketch.cdfi(0), sketch.cdfi(1), num=20).tolist(), \
                  [lambda x: sketch.cdf(x), lambda x: gamma.cdf(x, 1.0)], \
                  labels = ["sketch", "gamma(1.0)"], ycol="CDF")
alt.Chart(plotdf).mark_line().encode(x="x", y="CDF", color="color")

# Introducing Some Drift

So far, we have not examined any changes to the underlying distribution.
If we change the shape parameter to our Gamma distribution, that will simulate what happens when something changes in our software systems.
In the following cell, we sample using a shape parameter of 1.25 instead of 1.0, and see how that affects the outputs of our KS testing:

In [None]:
ksv = [kstest(gamma.rvs(1.25, size=ss), lambda xv: [sketch.cdf(x) for x in xv]) for ss in ssv]
ksdf = pd.DataFrame()
ksdf["ss"] = ssv
ksdf["stat"] = [ks.statistic for ks in ksv]
ksdf["pval"] = [ks.pvalue for ks in ksv]
ksdf

In this table, we can see that the p-values approach zero even faster, which reflects that our distribution has drifted from our baseline.
More importantly, our KS-statistic now bottoms out at ~0.45: the KS-statistic is providing a numeric measure of how much drift has actually occurred.

Another pattern we can see from our tables is that the KS-statistic does not stabilize until our sample size reaches at least 1000 to 10000.
That suggests a "best practice" for measuring data drift.

The following plot visually shows the drift from our baseline sketch:

In [None]:
plotdf = altairDF(np.linspace(sketch.cdfi(0), sketch.cdfi(1), num=20).tolist(), \
                  [lambda x: sketch.cdf(x), lambda x: gamma.cdf(x, 1.25)], \
                  labels = ["baseline sketch", "drift"], ycol="CDF")
alt.Chart(plotdf).mark_line().encode(x="x", y="CDF", color="color")

# Increasing Drift

Using the intuitions from our previous table, we'll fix a sample size of 10000 and generate a table where we alter our shape parameter to simulate increasing drift:

In [None]:
ss = 10000
shapes = [1.25, 1.5, 2.0, 3.0]
ksv = [kstest(gamma.rvs(a, size=ss), lambda xv: [sketch.cdf(x) for x in xv]) for a in shapes]
ksdf = pd.DataFrame()
ksdf["shape"] = shapes
ksdf["stat"] = [ks.statistic for ks in ksv]
ksdf["pval"] = [ks.pvalue for ks in ksv]
ksdf

As drift increases the KS-statistic also increases, reflecting the increasing difference between the distributions and our baseline.
The following plot visually shows what increasing drift looks like with respect to the CDFs of the distribution.

In [None]:
plotdf = altairDF(np.linspace(sketch.cdfi(0), sketch.cdfi(1), num=20).tolist(), \
                  [lambda x: sketch.cdf(x), lambda x: gamma.cdf(x, 1.25), lambda x: gamma.cdf(x, 1.5), lambda x: gamma.cdf(x, 2.0), lambda x: gamma.cdf(x, 3.0)], \
                  labels = ["1. baseline", "2. drift 1.25", "3. drift 1.5", "4. drift 2.0", "5. drift 3.0"], \
                  ycol="CDF")
alt.Chart(plotdf).mark_line().encode(x="x", y="CDF", color="color")

# Exercises
1. Try running these experiments using a different distribution family.
1. What are some applications of data drift detection that would be useful to you?
1. Based on the experiments above, what would be a good rule of thumb for picking a KS-statistic threshold for flagging significant data drift?