<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-a-PQLite-instance-and-fit-it" data-toc-modified-id="Create-a-PQLite-instance-and-fit-it-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create a <code>PQLite</code> instance and fit it</a></span></li><li><span><a href="#Adding-data" data-toc-modified-id="Adding-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Adding data</a></span><ul class="toc-item"><li><span><a href="#Understanding-the-underlying-sqlite-connection" data-toc-modified-id="Understanding-the-underlying-sqlite-connection-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Understanding the underlying sqlite connection</a></span></li></ul></li><li><span><a href="#Search-without-filtering" data-toc-modified-id="Search-without-filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Search without filtering</a></span></li><li><span><a href="#Search-with-filtering" data-toc-modified-id="Search-with-filtering-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Search with filtering</a></span></li><li><span><a href="#Benchmark-PQLite" data-toc-modified-id="Benchmark-PQLite-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Benchmark PQLite</a></span></li></ul></div>

In [1]:
%load_ext autoreload
%autoreload 2

## Create a `PQLite` instance and fit it

The following cell fits a  `PQLite` instance that learns a set of prototypes for each of the sub-spaces.

In [7]:
import random
import numpy as np
from pqlite import PQLite

N = 10_000 # number of data points
Nt = 2000
Nq = 10
D = 128 # dimentionality / number of features

# 2,000 128-dim vectors for training
Xt = np.random.random((Nt, D)).astype(np.float32)  

# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(d_vector=D, n_cells=64, n_subvectors=8, columns=[('x', float, True)])

In [6]:
pqlite.fit(Xt)

2021-11-10 17:12:49.040 | INFO     | pqlite.index:fit:92 - => start training VQ codec with 2000 data...
2021-11-10 17:12:49.145 | INFO     | pqlite.index:fit:95 - => start training PQ codec with 2000 data...
2021-11-10 17:12:50.364 | INFO     | pqlite.index:fit:98 - => pqlite is successfully trained!


In [3]:
import pprint
pprint.pprint(list(pqlite.__dict__.keys()))

['initial_size',
 'expand_step_size',
 'expand_mode',
 'n_cells',
 'code_size',
 'dtype',
 '_doc_id_dtype',
 '_vecs_storage',
 '_cell_size',
 '_cell_capacity',
 '_cell_tables',
 '_meta_table',
 'd_vector',
 'n_subvectors',
 'd_subvector',
 'metric',
 'use_residual',
 'n_probe',
 '_use_smart_probing',
 '_smart_probing_temperature',
 'vq_codec',
 'pq_codec']


In [None]:
pqlite.d_vector / pqlite.n_subvectors == pqlite.d_subvector

- Where can we find te data for each cell after fitting a `PQLite` instance ?

 `PQlite.fit` partitions the data in `n_cells` groups but no data is stored into the object unless `.add` function is called. 

In [None]:
len(pqlite._vecs_storage)

- Where can we find the codebooks for each of the regions `n_cells`?

The prototypes for each of the cells of the coarse quantization step can be found in `pqlite.vq_codec.codebook.shape`. There is one single prototype for each of the cells. Hence, `n_cell` prototypes.

In [None]:
pqlite.vq_codec.codebook.shape

## Adding data

Before data is added the `_cell_size` is zero for each of the `n_cells` cells.

In [None]:
pqlite._cell_size

Once we add data 

In [None]:
Xt.shape

In [None]:
tags = [{'x': random.random()} for _ in range(len(Xt))]
pqlite.add(Xt, ids=list(range(len(Xt))), doc_tags=tags)

We can see that each cell contains some examples

In [None]:
pqlite._cell_size

In total we should have 2000 exaples across cells

In [None]:
pqlite._cell_size.sum()

- Where can we access the quantized data of a particular cell ?

A user can access the data of n'th cell in `pqlite._vecs_storage[n]`. Each datapoint is grouped into a single cell.

In [None]:
pqlite._vecs_storage[0]

In [None]:
cell_0 = pqlite._vecs_storage[0]
cell_0

We can see that the number of elements in cell 0, which we can get using  `pqlite._cell_size[0]` is the same as the number of rows in ` pqlite._vecs_storage[0]`.

In [None]:
inds = cell_0.sum(axis=1)!=0
len(cell_0[inds]) == pqlite._cell_size[0]

### Understanding the underlying sqlite connection

In [None]:
pqlite.cell_tables[0].count()

In [None]:
pqlite.cell_tables[0]

In [None]:
pqlite.cell_tables[0].__dict__

In [None]:
pqlite

## Search without filtering

In [None]:
Nq = 1
query = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector

# without filtering
dists, ids = pqlite.search(query, k=5)

print(f'the result without filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
dists, ids = pqlite.search(query,  k=5)


In [None]:
dists

## Search with filtering

In [None]:
query = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector

# without filtering
dists, ids = pqlite.search(query, k=5)

print(f'the result without filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
conditions = [('x', '<', 0.3)]
dists, ids = pqlite.search(query, conditions=conditions, k=5)

print(f'the result with filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

## Benchmark PQLite

In [None]:
#[v.sum() for v in pqlite._vecs_storage]

Note that each original vector is decreased in dimensionalty.
Here the new dimensionality of the Product Quantized vector is equal to the original dimension divided by `n_subvectors`.

That is 128/8 = 16.

In [None]:
# shoud this be 
pqlitse._vecs_storage[0].shape

In [None]:
pqlite._vecs_storage[0].dtype

In [None]:
X = np.random.random((N, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed

tags = [{'x': random.random()} for _ in range(N)]
pqlite.add(X, ids=list(range(len(X))), doc_tags=tags)

In [None]:
len(pqlite._vecs_storage)