# Tutorial 快速入门

https://github.com/facebookresearch/faiss/wiki/Faster-search

In [1]:
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
print('xb', xb.shape)
print('xb', xb[:1])
print('xq', xq.shape)
print('xq', xq[:1])

xb (100000, 64)
xb [[0.19151945 0.62210876 0.43772775 0.7853586  0.77997583 0.2725926
  0.27646425 0.8018722  0.95813936 0.87593263 0.35781726 0.5009951
  0.6834629  0.71270204 0.37025076 0.5611962  0.50308317 0.01376845
  0.7728266  0.8826412  0.364886   0.6153962  0.07538124 0.368824
  0.9331401  0.65137815 0.39720258 0.78873014 0.31683612 0.56809866
  0.8691274  0.4361734  0.8021476  0.14376682 0.70426095 0.7045813
  0.21879211 0.92486763 0.44214076 0.90931594 0.05980922 0.18428709
  0.04735528 0.6748809  0.59462476 0.5333102  0.04332406 0.5614331
  0.32966843 0.5029668  0.11189432 0.6071937  0.5659447  0.00676406
  0.6174417  0.9121229  0.7905241  0.99208146 0.95880175 0.7919641
  0.28525096 0.62491673 0.4780938  0.19567518]]
xq (10000, 64)
xq [[0.81432974 0.7409969  0.8915324  0.02642949 0.24954738 0.75948536
  0.33756447 0.0388501  0.06253924 0.04496585 0.6500265  0.14300306
  0.10555115 0.7554373  0.8733019  0.91065574 0.949595   0.4678057
  0.7957018  0.06088004 0.5086471  0.77

## This is too slow, how can I make it faster?——IndexIVFFlat

为了加快搜索，可以将数据集分割成块. We define `Voronoi cells` in the d-dimensional space, and each database vector falls in one of the cells. At search time, only the database vectors y contained in the cell the query x falls in and a few neighboring ones are compared against the query vector.

This is done via the `IndexIVFFlat` index. 该索引需要一个训练阶段, that can be performed on any collection of vectors that has the same distribution as the database vectors. In this case we just use the database vectors themselves.

`IndexIVFFlat` 也需要另一个索引——quantizer（量化器），把向量分配到`Voronoi cells`. Each cell is defined by a centroid, and finding the Voronoi cell a vector falls in consists in finding the nearest neighbor of the vector in the set of centroids. 这个任务由另一个索引完成，通常情况是`IndexFlatL2`.

There are two parameters to the search method: `nlist`, the number of cells, and `nprobe`, the number of cells (out of nlist) that are visited to perform a search. 

搜索时间大致随probe数量的增加而线性增加（再加上量化后的某个常数）。当`nprobe=nlist`时相当于暴力搜索。

In [4]:
nlist = 100
k = 4

import faiss 

quantizer = faiss.IndexFlatL2(d)  # the other index as 
index = faiss.IndexIVFFlat(quantizer, d, nlist)
assert not index.is_trained
index.train(xb)
assert index.is_trained

index.add(xb)                  # add may be a bit slower as well
D, I = index.search(xq, k)     # actual search
print(len(I))                  # len(I) is equal to nq
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(len(I))
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = nlist           # default nprobe is 1, try a few more
D, I = index.search(xq, k)
print(len(I))
print(I[-5:])                  # neighbors of the 5 last queries

10000
[[ 9900  9309  9810 10048]
 [11055 10895 10812 11321]
 [11353 10164  9787 10719]
 [10571 10664 10632 10203]
 [ 9628  9554  9582 10304]]
10000
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
10000
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
[[6.531542  6.978715  7.003928  7.0137343]
 [4.3351846 5.236974  5.3194113 5.703156 ]
 [6.0726953 6.5766897 6.6140213 6.732214 ]
 [6.6373663 6.648776  6.8578253 7.0096517]
 [6.218346  6.4524803 6.54873   6.5813117]]


备注：

- `len(I)` 的长度就是 `xq` 的长度。
- `I` 结果
```
[[ 9900  9309  9810 10048]
 [11055 10895 10812 11321]
 [11353 10164  9787 10719]
 [10571 10664 10632 10203]
 [ 9628  9554  9582 10304]]
```
search的结果是向量的索引，这里k近邻为4，4个索引。

# Results

当 `nprobe=1`, 结果如下所示：
```text
[[ 9900  9309  9810 10048]
 [11055 10895 10812 11321]
 [11353 10164  9787 10719]
 [10571 10664 10632 10203]
 [ 9628  9554  9582 10304]]
```
这个结果与暴力搜索相比不够准确。`（衡量搜索的准确性，与暴力搜索相比。）`
```python
index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search(xq, k)
print(I[-5:])
# [[ 9900 10500  9309  9831]
#  [11055 10895 10812 11321]
#  [11353 11103 10164  9787]
#  [10571 10664 10632  9638]
#  [ 9628  9554 10036  9582]]
```
This is because some of the results were not in the exact same Voronoi cell. Therefore, visiting a few more cells may prove useful.

当 nprobe 增加到 `10`：
```text
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
```
这是正确的结果. Note that getting a perfect result in this case is merely an artifact of the data distribution, as it is has a strong component on the x-axis which makes it easier to handle. 

`nprobe` 参数是调节速度与准确性之间的折衷方法. 

当设置 `nprobe=nlist` 时，等价于暴力搜索（brute-force search），此时会搜索会变慢.