# Guidelines to choose an index

https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

下面是帮助如何选择索引的一些基本问题. 他们主要适用于L2距离. We indicate:
- the index_factory string for each of them.
- if there are parameters, we indicate them as the corresponding ParameterSpace argument.

In [1]:
import numpy as np

d = 64                           # dimension
nb = 10000                       # database size
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

# Will you perform few searches?

If you plan to perform only a few searches (say 1000-10000), the index building time will not be amortized by the search time. 那么直接计算是最有效的选择.

这可以通过 "Flat" 索引完成. 如果整个数据集在内存内放不下, 那么你可以分批构建小的索引, 再把查找结果合并 (参考：[Combining results of several searches](https://github.com/facebookresearch/faiss/wiki/Brute-force-search-without-an-index#combining-the-results-from-several-searches) on how to do this) .

# 是否需要精确结果?

### 如果是，那么应该使用 "Flat"

能保证精确结果的只有 `IndexFlatL2` 或 `IndexFlatIP`（Inner Product）. 它们作为其他索引的基线结果. It does not compress the vectors, but does not add overhead on top of them. 不支持用 `add_with_ids` 方法添加ids, 只是顺序添加, 因此，如果你需要`need add_with_ids`, 应该使用`"IDMap,Flat"`. `Flat`索引不需要训练，也没有参数.

支持GPU

In [2]:
import faiss

index = faiss.index_factory(d, "Flat")
index.add(xb)
D, I = index.search(xb[:5], 10)
print(I)

[[   0  393  363   78  924  364  100  677  491  247]
 [   1  555  277  364  617  175 1063  756   77  191]
 [   2  304  101   13  801  134  365  225  837  397]
 [   3  173   18  182  484   64  527  887  409  316]
 [   4  288  370  531  178  381  175  270   18  364]]


In [4]:
ids=np.arange(0+100,10000+100)

index = faiss.index_factory(d, "IDMap,Flat")  
index.add_with_ids(xb, ids)
D, I = index.search(xb[:5], 10)
print(I)

[[ 100  493  463  178 1024  464  200  777  591  347]
 [ 101  655  377  464  717  275 1163  856  177  291]
 [ 102  404  201  113  901  234  465  325  937  497]
 [ 103  273  118  282  584  164  627  987  509  416]
 [ 104  388  470  631  278  481  275  370  118  464]]


# 是否关心内存?

注意：所有类型的索引, Faiss会全部保存在内存中.
如果不需要精确的结果, 并且内存有限, 那么在有限的内存中，要在精确与速度之间做出平衡.

### 如果不在乎内存，那么应该使用 "HNSWx"

如果你的内存很大，或数据集很小，那么 `HNSW` 是最好的选择, 它是非常快、精确的索引. The 4 <= x <= 64 is the number of links per vector, higher is more accurate but uses more RAM. The speed-accuracy tradeoff is set via the efSearch parameter. The memory usage is (d * 4 + x * 2 * 4) bytes per vector.

`HNSW` does only support sequential adds (not add_with_ids) so here again, prefix with IDMap if needed. HNSW does not require training and does not support removing vectors from the index.

不支持GPU（but see below, the clustering method must be supported as well）

In [23]:
index = faiss.index_factory(d, "HNSW8")
index.add(xb)
D, I = index.search(xb[:5], 10)
print(I)

[[   0  393  363  924  364  100  247 1124  270  608]
 [   1  555  277  364  617  175 1063  756   77  917]
 [   2  101   13  801  134  365  225  837  397  265]
 [   3  173   18  182   64  527  887  316  412  911]
 [   4  288  370  531  178  381  175   18  364  614]]


### 如果稍微有点在意，那么应该使用 "...,Flat"

"..." means a clustering of the dataset has to be performed beforehand (read below). After clustering, "Flat" just organizes the vectors into buckets, so it does not compress them, the storage size is the same as that of the original dataset. The tradeoff between speed and accuracy is set via the nprobe parameter.

支持GPU(but see below, the clustering method must be supported as well)

"..."是聚类操作，聚类之后将每个向量映射到相应的bucket。该索引类型并不会保存压缩之后的数据，而是保存原始数据，所以内存开销与原始数据一致。通过nprobe参数控制速度/精度。  

In [24]:
index = faiss.index_factory(d, "IVF100,Flat")
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], 10)
print(I)

[[   0  363  584 1124  608  424  278  175  281  549]
 [   1  277  617   88  270  393  306  779 1246  138]
 [   2  304  801  837   43  642  282   81   31  513]
 [   3   18  182  484   64   74  210    8  149  225]
 [   4   18  614  225   52  159  541   61   51  484]]


### 如果很在意，那么应该使用 "PCARx,...,SQ8"

如果存储所有的向量开销太大, 那么可以如下两个操作:

- a PCA to dimension x to reduce the dimension
- a scalar quantization of each vector component into 1 byte.

Therefore the total storage is x bytes per vector.

`SQ4` and `SQ6` are also supported (for 4 or 6 bits per vector component).

支持 GPU(除了 SQ6)

In [25]:
index = faiss.index_factory(d, "PCAR16,IVF50,SQ8")
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], 10)
print(I)

[[  0 456 216 205  81 393 363  76 100  78]
 [  1  15 364 389 779  34  48 477 698   5]
 [  2 304 100   4 265 225 456  22 159  35]
 [  3 173 373 383 434 182  76 528 108 149]
 [  4 153 527 225  22 244 541 159   8 211]]


### 如果非常非常在意，那么应该使用 "OPQx_y,...,PQx"

PQx compresses the vectors using a product quantizer that outputs x-byte codes. x is typically <= 64, for larger codes SQ is usually as accurate and faster. OPQ is a linear transformation of the vectors to make them easier to compress. y is a dimension such that:

y is a multiple of x (required)
y <= d, with d the dimension of the input vectors (preferable)
y <= 4*x (preferable)
Supported on GPU: yes (note: the OPQ transform is done on CPU, but it is not performance critical)

y需要是x的倍数，一般保持y<=d，y<=4*x。
支持GPU。

In [26]:
index = faiss.index_factory(d, "OPQ32_512,IVF50,PQ32")  
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], 10)
print(I) 

[[   0  363  491   85  608   41  372  473  281  616]
 [   1  277  617  756  779  706 1246  270   88  393]
 [   2  101   13  134  265  225  397  642  390   70]
 [   3  182  527   18  484  498  139    8   74  476]
 [   4  370  614  531   61  175  541   18  364   52]]


# 数据集的大小?

This question is used to fill in the clustering options (the ... above). The dataset is clustered into buckets and at search time, only a fraction of the buckets are visited (nprobe buckets). The clustering is performed on a representative sample of the dataset vectors, typically a sample of the dataset. We indicate the optimal size for this sample.

### 如果小于1M， 应该使用 "...,IVFx,..."
`x` 取值范围为从 `4*sqrt(N)` 到 `16*sqrt(N)`, 其中，`N` 是数据集的大小. This just clusters the vectors with k-means. You will need between `30*x` and `256*x` vectors for training (the more the better).

支持GPU

### 如果在1M-10M， 应该使用 "...,IVF65536_HNSW32,..."
IVF in combination with HNSW uses HNSW to do the cluster assignment. You will need between 30 * 65536 and 256 * 65536 vectors for training.

不支持GPU(on GPU, use IVF as above)

### 如果在10M-100M，使用"...,IVF262144_HNSW32,..."
Same as above, replace 65536 with 262144 (2^18). Note that training is going to be slow. It is possible to do just the training on GPU, everything else running on CPU, see [train_ivf_with_gpu.ipynb](https://gist.github.com/mdouze/46d6bbbaabca0b9778fca37ed2bcccf6).

### 如果在100M-1B: "...,IVF1048576_HNSW32,..."
Same as above, replace 65536 with 1048576 (2^20). Training will be even slower!