# Binary indexes

https://github.com/facebookresearch/faiss/wiki/Binary-indexes

# Overview
Faiss 支持用`IndexBinaryFlat`和`IndexBinaryIVF`（它们都继承`IndexBinary`）索引 binary vectors (with `Hamming distance`) natively.

这写索引以`字节数组`形式保存向量集合, 因此，大小为`d`的向量在内存中仅占用`d/8`字节. 注意，此时, 只支持8的倍数的向量集合. 当然, 如果需要，你可以对向量进行舍入.

`add()`和`search()`方法输入参数也是字节数组("uint8_t" in C++, "uint8" in numpy).

汉明距离（Hamming distance）的计算是通过`popcount CPU指令`进行优化.

# IndexBinaryFlat
"flat"二进制索引执行穷举搜索.

穷举搜索是经过精心优化的，特别是针对256位向量，这是很常见的.

Batching被应用在查询和数据库端，避免缓存丢失.

The values of hamming_batch_size and faiss::IndexBinaryFlat#query_batch_size can be customized to adjust the batch sizes but the default values were found to be close to optimal for a large range of settings.

```
import faiss

# Dimension of the vectors.
d = 256

# Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
# i.e. the i-th vector starts at db[i * (d / 8)].
db = ...

# Vectors to be queried from the index.
queries = ...

# Initializing index.
index = faiss.IndexBinaryFlat(d)

# Adding the database vectors.
index.add(db)

# Number of nearest neighbors to retrieve per query vector.
k = ...;

# Querying the index
D, I = index.search(queries, k)

# D[i, j] contains the distance from the i-th query vector to its j-th nearest neighbor.
# I[i, j] contains the id of the j-th nearest neighbor of the i-th query vector.
```

# IndexBinaryIVF
The "IVF" (Inverse Vector File) flavor speeds up the search by clustering the vectors. This clustering is done using a second (binary) index for quantization (usually a flat index). This is equivalent to the IndexIVFFlat of the floating-point indexes.

```
import faiss

# Dimension of the vectors.
d = 256

# Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
# i.e. the i-th vector starts at db[i * (d / 8)].
db = ...

# Vectors to train the quantizer.
training = ...

# Vectors to be queried from the index.
queries = ...

# Initializing the quantizer.
quantizer = faiss.IndexBinaryFlat(d)

# Number of clusters.
nlist = ...

# Initializing index.
index = faiss.IndexBinaryIVF(quantizer, d, nlist)
index.nprobe = 4 # Number of nearest clusters to be searched per query. 

# Training the quantizer.
index.train(training)

# Adding the database vectors.
index.add(db)

# Number of nearest neighbors to retrieve per query vector.
k = ...

# Querying the index.
D, I = index.search(queries, k)

# D[i, j] contains the distance from the i-th query vector to its j-th nearest neighbor.
# I[i, j] contains the id of the j-th nearest neighbor of the i-th query vector.
```

# Shorter versions using index factory
The faiss::index_binary_factory() allows for shorter declarations of binary indexes. It is especially useful for IndexBinaryIVF, for which a quantizer needs to be initialized.

Instead of the above initialization code:
```
# Initializing the quantizer.
quantizer = faiss.IndexBinaryFlat(d)

# Number of clusters.
nlist = 32

# Initializing index.
index = faiss.IndexBinaryIVF(quantizer, d, nlist)
index.nprobe = 4 # Number of nearest clusters to be searched per query. 
```
one could write:
```
# Initializing the quantizer.
index = faiss.index_binary_factory(d, "BIVF32")
index.nprobe = 4 # Number of nearest clusters to be searched per query.
```

### 汉明距离
在信息论中，两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。也就是，将一个字符串变换成另外一个字符串所需要替换的字符个数。
1011101 与 1001001 之间的汉明距离是 2。
2143896 与 2233796 之间的汉明距离是 3。
"toned" 与 "roses" 之间的汉明距离是 3。
汉明重量是字符串相对于同样长度的零字符串的汉明距离，也就是说，它是字符串中非零的元素个数：对于二进制字符串来说，就是 1 的个数，所以 11101 的汉明重量是 4。