# Faiss 索引的组合
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)

本小节是Faiss更高级的应用. 通过结合前一节中描述的几种索引方法，可以获得最佳操作点.

## 数据准备

In [2]:
import numpy as np 
d = 512          
n_data = 2000   
np.random.seed(0) 
data = []
mu = 3
sigma = 0.1
for i in range(n_data):
    data.append(np.random.normal(mu, sigma, d))
data = np.array(data).astype('float32')

#query
query = []
n_query = 10
np.random.seed(12) 
query = []
for i in range(n_query):
    query.append(np.random.normal(mu, sigma, d))
query = np.array(query).astype('float32')

In [5]:
print(data.shape)
print(query.shape)

(2000, 512)
(10, 512)


# 带乘积量化的cell-probe作为粗量化器
乘积量化器（product quantizer）也可以作为粗量化器（coarse quantizer）。 This corresponds to the Multi-Index described in [The inverted multi-index, Babenko & Lempitsky, CVPR'12]. For a PQ with m segments each encoded as c centroids, the number of inverted lists is c^m. For a PQ with m segments each encoded as c centroids, the number of inverted lists is c^m. 实际使用中，一般直接让`m=2`。  

In FAISS, the corresponding coarse quantizer index is the MultiIndexQuantizer. This index is special because no vector is added to it. Therefore a specific flag (quantizer_trains_alone) has to be set on the IndexIVF.

In [13]:
import sys
# sys.path.append('/home/maliqi/faiss/python/')
import faiss

nbits_mi = 5   # c
M_mi = 2       # m
coarse_quantizer_mi = faiss.MultiIndexQuantizer(d, M_mi, nbits_mi) #不需要add任何数据
ncentroids_mi = 2 ** (M_mi * nbits_mi)

index = faiss.IndexIVFFlat(coarse_quantizer_mi, d, ncentroids_mi)
index.quantizer_trains_alone = True  #表示这是粗量化器的flag
index.train(data)
index.add(data)
index.nprobe = 50
dis, ind = index.search(query, 10)
print(ind)

[[1269 1028 1895  120 1267  178 1061 1972 1029 1913]
 [1398  289   70 1023 1177  940  940  969  969 1568]
 [ 345  389 1904 1992 1612 1623 1632  539  366 1805]
 [ 112  112 1412 1624  879  394 1506 1398   91  440]
 [  94 1459 1517 1723 1255   66  238 1755  472  375]
 [ 574  574 1523   91  456  296  296  444 1384  103]
 [1391  876   91 1914   78   78  969  732  732  999]
 [1662 1654  722 1070  121 1496  631 1442 1442 1738]
 [ 154   99   99   31 1237  289  661  426 1008 1727]
 [ 375 1826  610  750 1430  459 1339  471  441  818]]


与`IndexFlat`相比，在快速、低精度的场景，`MultiIndexQuantizer`更合适。

# Pre-filtering PQ codes with polysemous codes
It is about 6x faster to compare codes with Hamming distances than to use a product quantizer. However, by a proper reordering of the quantization centroids, the Hamming distances between PQ codes become correlated with the true distances. The by applying a threshold on the Hamming distance, most expensive PQ code comparisons can be avoided.

In [16]:
index = faiss.IndexPQ (d, 16, 8)
# before training
index.do_polysemous_training = True
index.train(data)

# before searching
index.search_type = faiss.IndexPQ.ST_polysemous
index.polysemous_ht = 54    # the Hamming threshold
D, I=index.search (query, 10)
print(I)
print(D)

[[-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 ...
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]
 [-1 -1 -1 ... -1 -1 -1]]
[[3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]
 [3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]
 [3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]
 ...
 [3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]
 [3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]
 [3.4028235e+38 3.4028235e+38 3.4028235e+38 ... 3.4028235e+38
  3.4028235e+38 3.4028235e+38]]


对于`IndexIVFPQ`:

```
index = faiss.IndexIVFPQ (coarse_quantizer, d, 16, 8)
# before training
index. do_polysemous_training = True
index.train(data)

# before searching
index.polysemous_ht = 54 # the Hamming threshold
D, I=index.search(query, 10)
print(I)
print(D)
```

若想设置合理的阈值, 记住:

the threshold should be between 0 and the number of bits per code (128 = 16*8 in this case), and codes follow a binomial distribution

setting the threshold to 1/2 the number of bits per code will spare 1/2 of the code comparisons, which is not enough. It should be set to a lower value (hence the 54 for 128 bit codes).

# IndexIVFPQR: 使用一个额外的乘积量化器PQ细化`IVFPQ`搜索结果
The `IndexIVFPQR` adds an additional level of quantization (the third!) on top of an IndexIVFPQ. Similar to the IndexRefineFlat It refines the distances computed by an IndexIVFPQ and reorders the results based on these.