# Problem setup and discussion
**tl;dr; When indices are attributed sequentially, we can compress the ids down to less than `0.1` bits per id (roughly `0.1%` of its original size), while still guaranteeing random access to the set of ids given a cluster index.**

There are two different scenarios for index attribution: sequential, where indices are assigned to vectors according to the order they are added to the database, and non-sequential or external, where the value of the index is set by some mechanism external to the FAISS index.

**Here we focus on the sequential case**

Summary and take-aways:
- Assuming the number of clusters is $m = \sqrt{n}$, where $n$ is the database size, we can store the sequential index using $\frac{\log(n)}{8\cdot\sqrt{n}}$ bytes per id, using a uniform code (which is very fast). This scheme guarantees random access: given a cluster index, we can decode the vector ids in that cluster, without having to decode the ids of any other cluster.

In what follows we first model the joint distribution over all possible clusterings to understand the lower bound on the number of bits for this scenario.
We do this by looking at the joint distribution over all possible values for the collection of clusters (which are themselves sets of indices).
We then introduce the constraint of random access (i.e., given a cluster, we need to decode the indices of that cluster without having to decode the indices of all clusters).

## Knowing the cluster sizes is enough to describe the entire index in the sequential case.

The sequential scenario, in code, is: if the database is `db: NDArray[np.floating] = ds.get_database()`, then vector `db[i]: NDArray[np.floating]` is assigned index `i`.
The assumption here is that indices have no meaning outside the index, and can be changed at will.
This assumption allows us to impose the following constraints on $I_j$, the set of ids for the $j$-th cluster with centroid $c_j$,  by relabeling the indices after training the database.
- **Contiguous**: $I_j = [s_j, s_j + k_j)$ are always contiguous integer intervals with a starting value $s_j$
- **Monotone**: $x \in I_j$ and $y \in I_{j+1}$ implies $x < y$ for any $j$.

To frame this as a source coding problem, we can work with the distribution $P_{I^m}$, where $m$ is the total number of clusters, and $I^m = (I_0, \dots, I_{m-1})$.
First, consider the conditional distribution $P_{I^m \vert K^m}(\cdot \vert k^m)$ modeling when the cluster sizes $K^m = k^m$ are known.
This distribution is a delta function: there is only one possible sequence $I^m$ obeying our constraints, if the cluster sizes are known, and it is such that 
- $I_0 = [0, k_1)$
- $I_1 = [k_1, k_1 + k_2)$
- $I_2 = [k_1 + k_2, k_1 + k_2 + k_3)$
- $…$
- $I_{m-1} = [n - k_m, n)$
where `n: int = len(db)` is the database size.
The index set of cluster $j$ can be fully specified by giving the starting index, $s_j = \sum_{i=1}^{j-1} k_i$, and the size $k_i$.
Note that the first starting point is always fixed, $s_1 = 0$, as well as the last endpoint, $\sum_{i=1}^m k_i = n$,

$P_{I^m \vert K^m}(\cdot \vert k^m)$ is a delta function, hence, in theory, we can compress the index to $0$ bits if the cluster sizes are known.
In other words, the cost of encoding the entire index is that of encoding the size of the clusters.

To encode the index, we can store the array `start_ids = [0, s_1, s_2, \dots, s_{n-1}, n]` in memory, uncompressed.
To decode the ids of cluster `j`, we simply take `np.arange(start_ids[j], start_ids[j+1])`.

The `start_ids` array will occupy roughly $\frac{m\cdot u}{8\cdot n}$ bytes per id in memory, where $u$ is the size of the representation of the integer value in `start_ids`, and must obey $u \geq \log(n)$.
Assuming $u = \log(n)$ is possible, and $m = \sqrt{n}$, then the cost per id is $\frac{\log(n)}{8 \cdot \sqrt{n}}$, which converges to $0$ as $n$ grows.
For `n=1e6`, the cost is already very small, at $0.0025$ bytes per id, or $2.5$ kB for the entire index.

It is possible to compress `start_ids` further, by entropy modelling the start values, but might not be worth it given the compressed database already hovers around $0.1\%$ of its original size (see experiments below).

# Experiments

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import faiss
from faiss.contrib.datasets import (
    SyntheticDataset,
    DatasetSIFT1M,
    DatasetGIST1M
)

import numpy as np

from ipynb_helper_functions import (
    prepare_index,
    get_ivfs,
    make_start_ids_array
)


## Build index and relabel sequentially (small example)

In [3]:
# Initially, we take a small index to be able to view the relabelling of ids.
ds = SyntheticDataset(d=2, nt=int(1e6), nb=10, nq=1)
num_clusters = 3
index_str = f"IVF{num_clusters},SQ8"

# This function takes care of re-ordering the ids according to our constraints.
index = prepare_index(ds, index_str, relabel_ids_sequentially=True)

# Returns `list[NDArray[np.integer]]` containing the ids of cluster `j` at index `j`.
ivfs = get_ivfs(index)

# Process ivfs to get starting ids for each interval.
start_ids = make_start_ids_array(index)

# Reconstruct the ivfs from start_ids, which is what will be done at search time. 
ivfs_reconstructed = [
    np.arange(start_ids[j], start_ids[j+1])
    for j in range(len(start_ids) - 1)
]

display(ivfs, start_ids, ivfs_reconstructed)

Successfully relabelled index sequentially.


[array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])]

array([ 0,  3,  7, 10], dtype=uint32)

[array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])]

## Compute compression ratio on larger indices 

In [4]:
for ds in [DatasetGIST1M(), DatasetSIFT1M()]:
    for num_clusters in [1_000, 2_000]:
        index_str = f"IVF{num_clusters},SQ8"
        index = prepare_index(ds, index_str, relabel_ids_sequentially=True)
        ivfs = get_ivfs(index)
        start_ids = make_start_ids_array(index)
        ivfs_reconstructed = [
            np.arange(start_ids[j], start_ids[j+1])
            for j in range(len(start_ids) - 1)
        ]

        # Assert reconstruction is correct
        assert all(np.all(ivfs[j] == ivfs_reconstructed[j]) for j in range(num_clusters))

        ivf_bytes = sum(ivf.nbytes for ivf in ivfs)
        start_ids_bytes = start_ids.nbytes
        n = index.ntotal
        print(f'Results for {ds.__class__.__name__} w/ {index_str}')
        print(f'Original size: {8*ivf_bytes/n:.2f} bits per id')
        print(f'Compressed size: {8*start_ids_bytes/n:.2f} bits per id')
        print(f'Compression ratio: {100*start_ids_bytes/ivf_bytes:.2f}% of original size')
        print(10*'-')

Successfully relabelled index sequentially.
Results for DatasetGIST1M w/ IVF1000,SQ8
Original size: 64.00 bits per id
Compressed size: 0.03 bits per id
Compression ratio: 0.05% of original size
----------
Successfully relabelled index sequentially.
Results for DatasetGIST1M w/ IVF2000,SQ8
Original size: 64.00 bits per id
Compressed size: 0.06 bits per id
Compression ratio: 0.10% of original size
----------
Successfully relabelled index sequentially.
Results for DatasetSIFT1M w/ IVF1000,SQ8
Original size: 64.00 bits per id
Compressed size: 0.03 bits per id
Compression ratio: 0.05% of original size
----------
Successfully relabelled index sequentially.
Results for DatasetSIFT1M w/ IVF2000,SQ8
Original size: 64.00 bits per id
Compressed size: 0.06 bits per id
Compression ratio: 0.10% of original size
----------
