# Faiss building blocks: clustering, PCA, quantization

Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.

## Clustering

Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor `x` is done as follows:

```ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)```

The resulting centroids are in `kmeans.centroids`.

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable `kmeans.obj`.

### Assignment

To compute the mapping from a set of vectors `x` to the cluster centroids after kmeans has finished training, use:

`D, I = kmeans.index.search(x, 1)`

This will return the nearest centroid for each line vector in `x` in `I`. `D` contains the squared L2 distances.

For the reverse operation, eg. to find the 15 nearest points in `x` to the computed centroids, a new index must be used:

```index = faiss.IndexFlatL2 (d)
D, I = index.search (kmeans.centroids, 15)```

`I` of size (ncentroids, 15) contains the nearest neighbors for each centroid.

### Clustering on the GPU

Clustering on one or several GPUs requires to adapt the indexing object a bit.

## Computing a PCA

Let's reduce 40D vectors to 10D.

```# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)```

Note that in C++, `apply_py` is replaced with `apply` (apply is a reserved word in Python).

### PQ encoding / decoding

The `ProductQuantizer` object can be used to encode or decode vectors to codes:

```d = 32  # data dimension
cs = 4  # code size (bytes)

# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
```

A scalar quantizer works similarly:

```d = 32  # data dimension

# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()```
##### Clone this wiki locally
You can’t perform that action at this time.
Press h to open a hovercard with more details.