-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Faiss building blocks: clustering, PCA, quantization
Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.
Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x
is done as follows:
ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(x)
The resulting centroids are in kmeans.centroids
.
The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj
.
To compute the mapping from a set of vectors x
to the cluster centroids after kmeans has finished training, use:
D, I = kmeans.index.search(x, 1)
This will return the nearest centroid for each line vector in x
in I
. D
contains the squared L2 distances.
For the reverse operation, eg. to find the 15 nearest points in x
to the computed centroids, a new index must be used:
index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)
I
of size (ncentroids, 15) contains the nearest neighbors for each centroid.
Clustering on one or several GPUs requires to adapt the indexing object a bit.
Let's reduce 40D vectors to 10D.
# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
Note that in C++, apply_py
is replaced with apply
(apply is a reserved word in Python).
The ProductQuantizer
object can be used to encode or decode vectors to codes:
d = 32 # data dimension
cs = 4 # code size (bytes)
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
A scalar quantizer works similarly:
d = 32 # data dimension
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)
# encode
codes = sq.compute_codes(x)
# decode
x2 = sq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark