Faiss building blocks: clustering, PCA, quantization

Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

Clustering

Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x is done as follows:

ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(x)

The resulting centroids are in kmeans.centroids.

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj, and more extensive statistics in kmeans.iteration_stats.

To run it on GPU, add the option gpu=True to the Kmeans constructor. This will use all available GPUs on the machine.

Additional options

The Kmeans object is mainly a layer of the C++ Clustering object, and all fields of that object can be set via the constructor. The fields include:

nredo: run the clustering this number of times, and keep the best centroids (selected according to clustering objective)
verbose: make clustering more verbose
spherical: perform spherical k-means -- the centroids are L2 normalized after each iteration
int_centroids: round centroids coordinates to integer
update_index: re-train index after each iteration?
min_points_per_centroid / max_points_per_centroid: below, you get a warning, above the training set is subsampled
seed: seed for the random number generator

Assignment

To compute the mapping from a set of vectors x to the cluster centroids after kmeans has finished training, use:

D, I = kmeans.index.search(x, 1)

This will return the nearest centroid for each line vector in x in I. D contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in x to the computed centroids, a new index must be used:

index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)

I of size (ncentroids, 15) contains the nearest neighbors for each centroid.

Clustering on the GPU

Clustering on one or several GPUs can be done via the gpu=True (use all gpus) or gpu=3 (use 3 gpus) constructor option in the KMeans object.

Computing a PCA

Let's reduce 40D vectors to 10D.

# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

Quantizers

The quantizer objects inherit from Quantizer that offers three common methods (see impl/Quantizer.h) :

train: train the quantizer on a matrix of vectors
compute_codes and decode: encoder and decoder. The encoder is usually lossy and returns a uint8 matrix of codes for each input vector.
get_DistanceComputer is a method returning a DistanceComputer object (see below).

The state of the Quantizer object is the result of the training (codebook tables or normalization coefficients). The field code_size indicates the number of bytes per codes generated by the quantizer.

Types of quantizers

The supported quantizer types are:

ScalarQuantizer: quantizes each vector component separately in a linear range.
ProductQuantizer: performs vector quantization on sub-vectors
AdditiveQuantizer: encodes a vector as a sum of codebook entries, see Addtive Quantizers for details. Additive quantizers can be trained in several ways, hence the sub-classes ResidualQuantizer, LocalSearchQuantizer, ProductAdditiveQuantizer.

Interestingly, each quantizer is a superset of the previous one.

Each quantizer has a corresponding index type that also stores a set of quantized vectors.

Quantizer class	Flat index class	IVF index class
`ScalarQuantizer`	`IndexScalarQuantizer`	`IndexIVFScalarQuantizer`
`ProductQuantizer`	`IndexPQ`	`IndexIVFPQ`
`AdditiveQuantizer`	`IndexAdditiveQuantizer`	`IndexIVFAdditiveQuantizer`
`ResidualQuantizer`	`IndexResidualQuantizer`	`IndexIVFResidualQuantizer`
`LocalSearchQuantizer`	`IndexLocalSearchQuantizer`	`IndexIVFLocalSearchQuantizer`

In addition all indexes except the ScalarQuantizer ones exist in a FastScan version, see Fast accumulation of PQ and AQ codes

Distance computers

Some quantizers return a DistanceComputer object. Its purpose is to compute vector-to-code distances efficiently, when one vector is compared with many codes. In that case, it is often possible to precompute a set of tables to compute the distances directly in the compressed domain.

Therefore, the DistanceComputer offers:

a set_query method that sets the current uncompressed vector to compare with
a distance_to_code method that computes the actual distance with a given code.

Example: PQ encoding / decoding

The ProductQuantizer object can be used to encode or decode vectors to codes:

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

A scalar quantizer works similarly:

d = 32  # data dimension

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode 
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()