Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Faiss building blocks: clustering, PCA, quantization
Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.
Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor
x is done as follows:
ncentroids = 1024 niter = 20 verbose = True d = x.shape kmeans = faiss.Kmeans(d, ncentroids, niter, verbose) kmeans.train(x)
The resulting centroids are in
The values of the objective function (total square error in case of k-means) along iterations is stored in the variable
To compute the mapping from a set of vectors
x to the cluster centroids after kmeans has finished training, use:
D, I = kmeans.index.search(x, 1)
This will return the nearest centroid for each line vector in
D contains the squared L2 distances.
For the reverse operation, eg. to find the 15 nearest points in
x to the computed centroids, a new index must be used:
index = faiss.IndexFlatL2 (d) index.add (x) D, I = index.search (kmeans.centroids, 15)
I of size (ncentroids, 15) contains the nearest neighbors for each centroid.
Clustering on the GPU
Clustering on one or several GPUs requires to adapt the indexing object a bit.
Computing a PCA
Let's reduce 40D vectors to 10D.
# random training data mt = np.random.rand(1000, 40).astype('float32') mat = faiss.PCAMatrix (40, 10) mat.train(mt) assert mat.is_trained tr = mat.apply_py(mt) # print this to show that the magnitude of tr's columns is decreasing print (tr ** 2).sum(0)
Note that in C++,
apply_py is replaced with
apply (apply is a reserved word in Python).
PQ encoding / decoding
ProductQuantizer object can be used to encode or decode vectors to codes:
d = 32 # data dimension cs = 4 # code size (bytes) # train set nt = 10000 xt = np.random.rand(nt, d).astype('float32') # dataset to encode (could be same as train) n = 20000 x = np.random.rand(n, d).astype('float32') pq = faiss.ProductQuantizer(d, cs, 8) pq.train(xt) # encode codes = pq.compute_codes(x) # decode x2 = pq.decode(codes) # compute reconstruction error avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
A scalar quantizer works similarly:
d = 32 # data dimension # train set nt = 10000 xt = np.random.rand(nt, d).astype('float32') # dataset to encode (could be same as train) n = 20000 x = np.random.rand(n, d).astype('float32') # QT_8bit allocates 8 bits per dimension (QT_4bit also works) sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit) sq.train(xt) # encode codes = sq.compute_codes(x) # decode x2 = sq.decode(codes) # compute reconstruction error avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()