## Data drift

**Data Drift** is when the statistical properties of data change over time, causing models trained on historical data to become less accurate. This shift can be subtle (e.g., gradual changes in the distribution) or more abrupt (e.g., sudden changes in data gathering processes). Detecting and measuring data drift helps ensure that machine learning models stay reliable and consistent.

- **Jensen-Shannon Distance**  
  Good for measuring dissimilarity in general between two distributions. It’s symmetric (0 if identical, 1 if totally different) but can be computationally heavy, especially if there are many sparse categories.
- **Hellinger Distance**  
  Best when you want to capture moderate differences in probability distributions. If the overlap remains constant, it may miss larger drifts.
- **Kolmogorov-Smirnov Test (KS Test)**  
  Great non-parametric approach that handles continuous data well. It’s easy to interpret but can be sensitive in the tails and to sample size.
- **Chi-Squared Test**  
  Standard method for categorical data to check if two distributions are significantly different. Struggles with small samples or many sparse categories.
- **L-Infinity Distance**  
  Measures the maximum (peak) difference between two distributions. Very quick and simple but may overlook finer shifts.

In [5]:
"""
Below is an example that creates:
1) 10 pairs of *continuous* 'before'/'after' data (using normal/uniform distributions)
2) 10 pairs of *categorical* 'before'/'after' data (using random category probabilities)

We then measure distribution shifts using:

For *continuous data*:
- Jensen-Shannon Distance (requires converting to PMFs/bins)
- Hellinger Distance (requires PMFs/bins)
- Kolmogorov-Smirnov Test (directly on samples)
- L-Infinity Distance (on PMFs/bins)
- (Chi-Squared is typically not appropriate for continuous data, so we skip it.)

For *categorical data*:
- Jensen-Shannon Distance (using category PMFs)
- Hellinger Distance (using category PMFs)
- Chi-Squared Test (using contingency tables)
- L-Infinity Distance (using PMFs)
- (Kolmogorov-Smirnov is typically not appropriate for categorical data, so we skip it.)

The final results are combined in a DataFrame. For continuous distributions, Chi-Squared fields
will be set to NaN; for categorical, KS fields will be set to NaN. Significance is determined by:
- KS p<0.05 for continuous
- Chi-Squared p<0.05 for categorical
"""

import numpy as np
import pandas as pd
from scipy.spatial import distance
from scipy.stats import ks_2samp, chi2_contingency

# ----- Helper Functions -----
def jensen_shannon_distance(p, q):
    """Compute the Jensen-Shannon distance between two discrete distributions p and q."""
    p = np.asarray(p, dtype=float)
    q = np.asarray(q, dtype=float)
    m = 0.5 * (p + q)
    def kl_divergence(a, b):
        mask = (a != 0)
        return np.sum(a[mask] * np.log(a[mask] / b[mask]))
    js_div = 0.5*kl_divergence(p, m) + 0.5*kl_divergence(q, m)
    return np.sqrt(js_div)

def hellinger_distance(p, q):
    """Compute the Hellinger distance between two discrete distributions p and q."""
    return distance.euclidean(np.sqrt(p), np.sqrt(q)) / np.sqrt(2)

def linfinity_distance(p, q):
    """Chebyshev (L-Infinity) distance between two distributions."""
    return np.max(np.abs(p - q))

def to_pmf(samples, bins=10, categories=None):
    """Turn samples into a discrete PMF. If categorical, pass a list/array of categories."""
    if categories is not None:
        # Categorical
        counts = np.array([np.sum(samples == c) for c in categories], dtype=float)
        pmf = counts / counts.sum()
        return pmf
    else:
        # Continuous
        hist, _ = np.histogram(samples, bins=bins, density=True)
        pmf = hist / hist.sum()
        return pmf

def chi_squared_test(counts1, counts2):
    """Perform Chi-Squared test on two sets of counts (same length). Returns (stat, p_value)."""
    contingency = np.stack([counts1, counts2], axis=0)
    chi2_stat, p_val, _, _ = chi2_contingency(contingency)
    return chi2_stat, p_val

# ----- Data Generation -----
np.random.seed(42)

# Generate 10 Continuous 'before'/'after' distributions
continuous_pairs = []
for i in range(10):
    if i < 5:
        # normal
        before = np.random.normal(loc=i*0.2, scale=1.0, size=1000)
        after  = np.random.normal(loc=i*0.2 + 0.3, scale=1.0, size=1000)
    else:
        # uniform
        before = np.random.uniform(low=0,     high=1 + i*0.1, size=1000)
        after  = np.random.uniform(low=0.1,   high=1 + i*0.1, size=1000)
    continuous_pairs.append((before, after))

# Generate 10 Categorical 'before'/'after' distributions
categorical_pairs = []
for i in range(10):
    # random category probabilities for 6 categories
    before_probs = np.random.dirichlet(np.ones(6), size=1).flatten()
    after_probs  = np.random.dirichlet(np.ones(6), size=1).flatten()
    categories = np.arange(6)

    before_cat_samples = np.random.choice(categories, p=before_probs, size=1000)
    after_cat_samples  = np.random.choice(categories, p=after_probs, size=1000)

    categorical_pairs.append((before_cat_samples, after_cat_samples, categories))

# ----- Evaluate and store results -----
results = []
dist_id = 1

# Evaluate Continuous
for before, after in continuous_pairs:
    # Convert to PMFs for JS, Hellinger, L-Infinity
    pmf_before = to_pmf(before, bins=10)
    pmf_after  = to_pmf(after,  bins=10)
    
    # JSD, Hellinger, L-Infinity
    js_val = jensen_shannon_distance(pmf_before, pmf_after)
    hel_val = hellinger_distance(pmf_before, pmf_after)
    l_inf_val = linfinity_distance(pmf_before, pmf_after)
    
    # KS test (directly on raw samples)
    ks_stat, ks_p = ks_2samp(before, after)
    
    # Chi2 is not appropriate for continuous => set to NaN
    chi2_stat, chi2_p = np.nan, np.nan
    
    # Significance (continuous => KS-based)
    significant = (ks_p < 0.05)
    
    results.append({
        'Dist_ID': f'C{dist_id}',
        'Type': 'Continuous',
        'JSD': js_val,
        'Hellinger': hel_val,
        'KS_stat': ks_stat,
        'KS_pval': ks_p,
        'Chi2_stat': chi2_stat,
        'Chi2_pval': chi2_p,
        'LInf': l_inf_val,
        'SignificantShift': significant
    })
    dist_id += 1

# Evaluate Categorical
for i, (before_cat, after_cat, cats) in enumerate(categorical_pairs):
    # Convert to PMFs for JS, Hellinger, L-Infinity
    pmf_before = to_pmf(before_cat, categories=cats)
    pmf_after  = to_pmf(after_cat,  categories=cats)
    
    # JSD, Hellinger, L-Infinity
    js_val = jensen_shannon_distance(pmf_before, pmf_after)
    hel_val = hellinger_distance(pmf_before, pmf_after)
    l_inf_val = linfinity_distance(pmf_before, pmf_after)
    
    # KS is not typical for categorical => set to NaN
    ks_stat, ks_p = np.nan, np.nan
    
    # Chi2 test
    counts_before = (pmf_before * 1000).astype(int)
    counts_after  = (pmf_after  * 1000).astype(int)
    chi2_stat, chi2_p = chi_squared_test(counts_before, counts_after)
    
    # Significance (categorical => Chi2-based)
    significant = (chi2_p < 0.05)
    
    results.append({
        'Dist_ID': f'Cat{i+1}',
        'Type': 'Categorical',
        'JSD': js_val,
        'Hellinger': hel_val,
        'KS_stat': ks_stat,
        'KS_pval': ks_p,
        'Chi2_stat': chi2_stat,
        'Chi2_pval': chi2_p,
        'LInf': l_inf_val,
        'SignificantShift': significant
    })

df_results = pd.DataFrame(results)
df_results


Unnamed: 0,Dist_ID,Type,JSD,Hellinger,KS_stat,KS_pval,Chi2_stat,Chi2_pval,LInf,SignificantShift
0,C1,Continuous,0.105641,0.106254,0.155,6.763913e-11,,,0.063,True
1,C2,Continuous,0.115753,0.116493,0.123,5.214703e-07,,,0.062,True
2,C3,Continuous,0.073288,0.073368,0.119,1.378745e-06,,,0.039,True
3,C4,Continuous,0.252251,0.256351,0.154,9.23442e-11,,,0.12,True
4,C5,Continuous,0.130671,0.131155,0.121,8.513708e-07,,,0.068,True
5,C6,Continuous,0.04917,0.049196,0.078,0.004543822,,,0.026,True
6,C7,Continuous,0.047322,0.047339,0.073,0.009677755,,,0.022,True
7,C8,Continuous,0.05929,0.059333,0.068,0.01960232,,,0.031,True
8,C9,Continuous,0.035384,0.035391,0.067,0.02243866,,,0.016,True
9,C10,Continuous,0.054904,0.054948,0.09,0.0006029006,,,0.032,True
