# Embedding vector drift detection

This notebook shows how to perform a simple drift detection on embedding vectors. This will help detect changes in the underlying reference data used in the RAG pattern.

## Method

Embedding vectors represent an item in n-dimensional space. `n` is often large; Bedrock's Titan embedding model creates vectors of size 4096. We will start by performing dimension reduction using PCA. Then we'll use KMeans to identify a set of cluster centers.

We'll calculate the following information to set a baseline:

* The number of dimensions found in PCA
* The location of each cluster centroid
* The proportion of samples in each cluster
* The mean and standard deviation of the location of samples compared to their cluster centroid
* The mean of the difference of the distance from each sample to its closest and farthest centroids
* Inertia (sum of squared distances to cluster centroids)

When we want to compare a newer set of embeddings, we'll compute a new baseline. Then we'll compare:

* Change in dimensions after PCA
* How far the cluster centroids have moved
* Change in proportion of samples in each cluster
* Change in mean and standard deviation of sample distance
* Change in mean of the difference of the distance from each sample to its closest and farthest centroids
* Change in inertia

## Establish first baseline

In [4]:
import pandas as pd
import sklearn as sk
import numpy as np

In [7]:
from sklearn.decomposition import PCA

rng = np.random.default_rng()
embed_dim = 4096
num_embeds = 1000

X = np.array([rng.standard_normal(embed_dim) for idx in range(num_embeds)])


In [9]:
X.shape

(1000, 4096)

In [11]:
pca = PCA(n_components=0.95)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [14]:
reduced = pca.transform(X)

In [15]:
reduced.shape

(1000, 862)

In [16]:
from sklearn.cluster import KMeans

In [18]:
num_clusters = 10
kmeans = KMeans(n_clusters=num_clusters).fit(reduced)

In [20]:
kmeans.cluster_centers_.shape

(10, 862)

In [23]:
unique, counts = np.unique(kmeans.labels_, return_counts=True)

In [25]:
counts

array([  7,  45,  30,  10, 319, 388,   3,  14,  22, 162])

In [29]:
proportion = counts / reduced.shape[0]

In [30]:
proportion

array([0.007, 0.045, 0.03 , 0.01 , 0.319, 0.388, 0.003, 0.014, 0.022,
       0.162])

In [33]:
kmeans.inertia_

3850626.7938258965

In [34]:
distances = kmeans.transform(reduced)

In [35]:
distances.shape

(1000, 10)

In [40]:
distances_to_center = np.array([d[idx] for d, idx in zip(distances, kmeans.labels_)])

In [41]:
distances_to_center.mean()

62.040904793487655

In [44]:
distances_to_center.std()

1.2461646084280062

In [46]:
distances_span = np.array([(d.max() - d.min()) for d in distances])

In [47]:
distances_span.mean()

11.013843036234153

In [48]:
distances_span.std()

0.9421704989282121

## Compute new baseline

In [49]:
Q = np.array([rng.standard_normal(embed_dim) for idx in range(num_embeds)])

In [50]:
pca_q = PCA(n_components=0.95)
pca_q.fit(Q)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [51]:
reduced_q = pca.transform(Q)

In [52]:
kmeans_q = KMeans(n_clusters=num_clusters).fit(reduced_q)

In [53]:
unique_q, counts_q = np.unique(kmeans_q.labels_, return_counts=True)

In [54]:
proportion_q = counts_q / reduced_q.shape[0]

In [55]:
reduced_q.shape

(1000, 862)

In [56]:
kmeans_q.inertia_

849432.8058074604

In [57]:
distances_q = kmeans_q.transform(reduced_q)

In [58]:
distances_to_center_q = np.array([d[idx] for d, idx in zip(distances_q, kmeans_q.labels_)])

In [59]:
distances_to_center_q.mean()

29.13537454834681

In [60]:
distances_to_center_q.std()

0.7501704706298035

In [61]:
distances_span_q = np.array([(d.max() - d.min()) for d in distances_q])

In [62]:
distances_span_q.mean()

2.744750933658466

In [63]:
distances_span_q.std()

0.4567290935252769

## Comparison

In [64]:
print(f"Change in dimensions after PCA: {reduced_q.shape[1] - reduced.shape[1]}")

Change in dimensions after PCA: 0


In [72]:
print(f"Change in centroid location: {[np.linalg.norm(c1-c2) for c1, c2 in zip(kmeans_q.cluster_centers_, kmeans.cluster_centers_)]}")

Change in centroid location: [24.29478196785469, 9.921150675881297, 12.906074809768342, 20.964729505574148, 4.679796653390379, 7.267394171524955, 39.6715833862202, 17.480444818721494, 14.339360910407517, 5.461923124989665]


In [67]:
print(f"Change in proportion of samples in each cluster: {proportion_q - proportion}")

Change in proportion of samples in each cluster: [ 0.135  0.078  0.013  0.127 -0.2   -0.362  0.004  0.048  0.075  0.082]


In [68]:
print(f"Change in mean of sample distance: {distances_to_center_q.mean() - distances_to_center.mean()}")

Change in mean of sample distance: -32.905530245140845


In [69]:
print(f"Change in standard deviation of sample distance: {distances_to_center_q.std() - distances_to_center.std()}")

Change in standard deviation of sample distance: -0.4959941377982028


In [70]:
print(f"Change in mean of sample span: {distances_span_q.mean() - distances_span.mean()}")

Change in mean of sample span: -8.269092102575687


In [71]:
print(f"Change in standard deviation of sample span: {distances_span_q.std() - distances_span.std()}")

Change in standard deviation of sample span: -0.48544140540293523


In [66]:
print(f"Change in inertia: {kmeans_q.inertia_ - kmeans.inertia_}")

Change in inertia: -3001193.988018436
