# Faiss building blocks: clustering, PCA, quantization

https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

Faiss内置一些高效的基本算法: k-means clustering, PCA, PQ encoding/decoding.

# Clustering

Faiss 提供了一个高效的 k-means 实现. 把 2-D tensor `x` 聚类成一个向量集合，如下所示:

In [11]:
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
print('xb', xb.shape)
print('xb', xb[:1])
print('xq', xq.shape)
print('xq', xq[:1])

import faiss

ncentroids = 1024
niter = 20
verbose = True
d = xb.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(xb)

xb (100000, 64)
xb [[0.19151945 0.62210876 0.43772775 0.7853586  0.77997583 0.2725926
  0.27646425 0.8018722  0.95813936 0.87593263 0.35781726 0.5009951
  0.6834629  0.71270204 0.37025076 0.5611962  0.50308317 0.01376845
  0.7728266  0.8826412  0.364886   0.6153962  0.07538124 0.368824
  0.9331401  0.65137815 0.39720258 0.78873014 0.31683612 0.56809866
  0.8691274  0.4361734  0.8021476  0.14376682 0.70426095 0.7045813
  0.21879211 0.92486763 0.44214076 0.90931594 0.05980922 0.18428709
  0.04735528 0.6748809  0.59462476 0.5333102  0.04332406 0.5614331
  0.32966843 0.5029668  0.11189432 0.6071937  0.5659447  0.00676406
  0.6174417  0.9121229  0.7905241  0.99208146 0.95880175 0.7919641
  0.28525096 0.62491673 0.4780938  0.19567518]]
xq (10000, 64)
xq [[0.81432974 0.7409969  0.8915324  0.02642949 0.24954738 0.75948536
  0.33756447 0.0388501  0.06253924 0.04496585 0.6500265  0.14300306
  0.10555115 0.7554373  0.8733019  0.91065574 0.949595   0.4678057
  0.7957018  0.06088004 0.5086471  0.77

493312.94

结果 `centroids` 位于 `kmeans.centroids`.

In [12]:
kmeans.centroids.shape

(1024, 64)

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj.

In [13]:
kmeans.obj

array([804842.8 , 506015.06, 499761.9 , 497422.06, 496135.38, 495346.94,
       494836.62, 494480.78, 494210.84, 494013.75, 493859.06, 493735.47,
       493647.5 , 493568.6 , 493505.4 , 493452.9 , 493410.2 , 493369.8 ,
       493339.38, 493312.94], dtype=float32)

In [14]:
len(kmeans.obj)

20

### Assignment

To compute the mapping from a set of vectors x to the cluster centroids after kmeans has finished training, use:

In [20]:
x=xq[:5]
D, I = kmeans.index.search(x, 1)
print(I)

[[313]
 [ 51]
 [309]
 [ 51]
 [753]]


This will return the nearest centroid for each line vector in x in `I`. `D` contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in x to the computed centroids, a new index must be used:

In [22]:
index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search (kmeans.centroids, 15)
print(I)

[[79083 78056 78305 ... 78937 79683 78890]
 [42086 41925 41587 ... 41972 42339 41297]
 [36808 36129 35861 ... 36321 36296 35554]
 ...
 [18369 18180 18085 ... 17047 17510 18735]
 [66867 67113 67147 ... 66796 66963 66965]
 [50521 50623 51618 ... 51415 50469 51024]]


I of size (ncentroids, 15) contains the nearest neighbors for each centroid.

### Clustering on the GPU
Clustering on one or several GPUs requires to adapt the indexing object a bit.

# Computing a PCA

让我们把 40D vectors 降到 10D.

In [24]:
# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print((tr ** 2).sum(0))

[116.01529  115.3129   108.345406 107.58896  105.79919  101.523346
 100.9948    98.78388   98.579926  96.67803 ]


Note that in C++, apply_py is replaced with apply (apply is a reserved word in Python).

### PQ encoding / decoding
The ProductQuantizer object can be used to encode or decode vectors to codes:

In [27]:
d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

A scalar quantizer works similarly:

In [28]:
d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()