Faiss building blocks: clustering, PCA, quantization

Matthijs Douze edited this page Jun 28, 2018 · 2 revisions

Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

Clustering

Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x is done as follows:

ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)

The resulting centroids are in kmeans.centroids.

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj.

Assignment

To compute the mapping from a set of vectors x to the cluster centroids after kmeans has finished training, use:

D, I = kmeans.index.search(x, 1)

This will return the nearest centroid for each line vector in x in I. D contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in x to the computed centroids, a new index must be used:

index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

I of size (ncentroids, 15) contains the nearest neighbors for each centroid.

Clustering on the GPU

Clustering on one or several GPUs requires to adapt the indexing object a bit.

Computing a PCA

Let's reduce 40D vectors to 10D.

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

Note that in C++, apply_py is replaced with apply (apply is a reserved word in Python).

PQ encoding / decoding

The ProductQuantizer object can be used to encode or decode vectors to codes:

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

A scalar quantizer works similarly:

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.