<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-a-PQLite-instance-and-fit-it" data-toc-modified-id="Create-a-PQLite-instance-and-fit-it-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create a <code>PQLite</code> instance and fit it</a></span></li><li><span><a href="#Adding-data" data-toc-modified-id="Adding-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Adding data</a></span><ul class="toc-item"><li><span><a href="#Understanding-the-underlying-sqlite-connection" data-toc-modified-id="Understanding-the-underlying-sqlite-connection-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Understanding the underlying sqlite connection</a></span></li></ul></li><li><span><a href="#Search-without-filtering" data-toc-modified-id="Search-without-filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Search without filtering</a></span></li><li><span><a href="#Search-with-filtering" data-toc-modified-id="Search-with-filtering-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Search with filtering</a></span></li><li><span><a href="#Benchmark-PQLite" data-toc-modified-id="Benchmark-PQLite-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Benchmark PQLite</a></span></li></ul></div>

In [1]:
import pqlite
pqlite.__path__

['/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite']

In [2]:
#%load_ext autoreload
#%autoreload 2

## Create a `PQLite` instance and fit it

The following cell fits a  `PQLite` instance that learns a set of prototypes for each of the sub-spaces.

In [3]:
import random
import numpy as np
from pqlite import PQLite

N = 10_000 # number of data points
Nt = 2000
Nq = 10
D = 128 # dimentionality / number of features

# 2,000 128-dim vectors for training
Xt = np.random.random((Nt, D)).astype(np.float32)  

# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(d_vector=D, n_cells=64, n_subvectors=8, columns=[('x', float, True)])

In [4]:
pqlite.fit(Xt)

2021-11-11 13:27:38.407 | INFO     | pqlite.index:fit:90 - => start training VQ codec with 2000 data...
2021-11-11 13:27:38.490 | INFO     | pqlite.index:fit:93 - => start training PQ codec with 2000 data...
2021-11-11 13:27:39.482 | INFO     | pqlite.index:fit:96 - => pqlite is successfully trained!


In [5]:
import pprint
pprint.pprint(list(pqlite.__dict__.keys()))

['initial_size',
 'expand_step_size',
 'expand_mode',
 'n_cells',
 'code_size',
 'dtype',
 '_doc_id_dtype',
 '_vecs_storage',
 '_cell_size',
 '_cell_capacity',
 '_cell_tables',
 '_meta_table',
 'd_vector',
 'n_subvectors',
 'd_subvector',
 'metric',
 'use_residual',
 'n_probe',
 '_use_smart_probing',
 '_smart_probing_temperature',
 'vq_codec',
 'pq_codec']


In [6]:
pqlite.d_vector / pqlite.n_subvectors == pqlite.d_subvector

True

- Where can we find te data for each cell after fitting a `PQLite` instance ?

 `PQlite.fit` partitions the data in `n_cells` groups but no data is stored into the object unless `.add` function is called. 

In [7]:
len(pqlite._vecs_storage)

64

- Where can we find the codebooks for each of the regions `n_cells`?

The prototypes for each of the cells of the coarse quantization step can be found in `pqlite.vq_codec.codebook.shape`. There is one single prototype for each of the cells. Hence, `n_cell` prototypes.

In [8]:
pqlite.vq_codec.codebook.shape

(64, 128)

## Adding data

Before data is added the `_cell_size` is zero for each of the `n_cells` cells.

In [9]:
pqlite._cell_size

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Once we add data 

In [10]:
Xt.shape

(2000, 128)

In [11]:
tags = [{'x': random.random()} for _ in range(len(Xt))]
pqlite.add(Xt, ids=list(range(len(Xt))), doc_tags=tags)

2021-11-11 13:27:41.376 | DEBUG    | pqlite.storage.cell:_expand:148 - => total storage capacity is expanded by 0 for 64 cells
2021-11-11 13:27:41.378 | DEBUG    | pqlite.storage.cell:insert:90 - => 2000 new items added


We can see that each cell contains some examples

In [12]:
pqlite._cell_size

array([53, 12, 11, 18,  7, 70,  5, 12, 65,  4,  4, 18, 16, 14, 20,  8, 56,
       43, 50, 25, 35, 48, 10, 18, 74, 26, 31, 29, 42,  9, 44, 12, 16, 33,
       10, 12, 25,  7, 67, 53, 33, 42, 25, 35, 60, 62, 65, 16,  7, 51, 34,
       45, 67, 53, 29, 28, 23, 22, 37, 23,  4, 28, 43, 56])

In total we should have 2000 exaples across cells

In [13]:
pqlite._cell_size.sum()

2000

- Where can we access the quantized data of a particular cell ?

A user can access the data of n'th cell in `pqlite._vecs_storage[n]`. Each datapoint is grouped into a single cell.

In [14]:
pqlite._vecs_storage[0]

array([[162, 172, 251, ..., 167, 107, 147],
       [  5,  78,  44, ..., 162, 216, 232],
       [ 59,  10, 133, ...,  93, 134, 105],
       ...,
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=uint8)

In [15]:
cell_0 = pqlite._vecs_storage[0]
cell_0

array([[162, 172, 251, ..., 167, 107, 147],
       [  5,  78,  44, ..., 162, 216, 232],
       [ 59,  10, 133, ...,  93, 134, 105],
       ...,
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=uint8)

We can see that the number of elements in cell 0, which we can get using  `pqlite._cell_size[0]` is the same as the number of rows in ` pqlite._vecs_storage[0]`.

In [16]:
inds = cell_0.sum(axis=1)!=0
len(cell_0[inds]) == pqlite._cell_size[0]

True

### Understanding the underlying sqlite connection

In [17]:
pqlite.cell_tables[0].count()

53

In [18]:
pqlite.cell_tables[0]

<pqlite.storage.table.CellTable at 0x7fe6304ccd90>

In [19]:
pqlite.cell_tables[0].__dict__

{'_conn_name': ':memory:',
 '_name': 'cell_table_0',
 '_conn': <sqlite3.Connection at 0x7fe6306113f0>,
 '_columns': ['x FLOAT'],
 '_indexed_keys': {'x'}}

In [20]:
pqlite

<pqlite.index.PQLite at 0x7fe6117e6790>

## Search without filtering

In [21]:
Nq = 1
query = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector

# without filtering
dists, ids = pqlite.search(query, k=5)

print(f'the result without filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
dists, ids = pqlite.search(query,  k=5)

the result without filtering:
query [0]: [11.450677 13.367572 13.403849 13.49054  13.53594 ] [b'1707' b'607' b'654' b'361' b'1367']


In [22]:
dists

array([[11.450677, 13.367572, 13.403849, 13.49054 , 13.53594 ]],
      dtype=float32)

## Search with filtering

In [23]:
query = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector

# without filtering
dists, ids = pqlite.search(query, k=5)

print(f'the result without filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
conditions = [('x', '<', 0.3)]
dists, ids = pqlite.search(query, conditions=conditions, k=5)

print(f'the result with filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

the result without filtering:
query [0]: [13.944338 14.238414 14.262455 14.2831   14.428671] [b'1349' b'1460' b'1370' b'529' b'892']
the result with filtering:
query [0]: [14.528587  14.8192425 14.847575  14.941542  15.022545 ] [b'1679' b'139' b'1446' b'1573' b'1983']


## Benchmark PQLite

- Let us benchmark with  100k, 1million, 5million, 10 million of vectors of 128 floats.

- We want to benchmark time and memory usage (precision, recall).

- Detailed profiling of which function calls spend more time in pqlite.

    - Propose improvements to make it faster.



In [47]:
import random
import numpy as np
from pqlite import PQLite

N = 100_000 # number of data points
D = 128 # dimentionality / number of features

# 2,000 128-dim vectors for training
Xt = np.random.random((Nt, D)).astype(np.float32)  

# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(d_vector=D, n_cells=64, n_subvectors=8, columns=[('x', float, True)])

pqlite.fit(Xt)

2021-11-11 15:22:02.546 | INFO     | pqlite.index:fit:90 - => start training VQ codec with 2000 data...
2021-11-11 15:22:02.626 | INFO     | pqlite.index:fit:93 - => start training PQ codec with 2000 data...
2021-11-11 15:22:03.505 | INFO     | pqlite.index:fit:96 - => pqlite is successfully trained!


In [48]:
pqlite.add(Xt, ids=list(range(len(Xt))))

2021-11-11 15:22:03.741 | DEBUG    | pqlite.storage.cell:_expand:148 - => total storage capacity is expanded by 0 for 64 cells
2021-11-11 15:22:03.743 | DEBUG    | pqlite.storage.cell:insert:90 - => 2000 new items added


In [58]:
query = np.random.random((1, D)).astype(np.float32)  # a 128-dim query vector

In [65]:
pq_dists, ids = pqlite.search(query, k=5)

In [66]:
pq_dists

array([[11.835682 , 12.070293 , 12.2886715, 12.305212 , 12.500567 ]],
      dtype=float32)

Note that each original vector is decreased in dimensionalty.
Here the new dimensionality of the Product Quantized vector is equal to the original dimension divided by `n_subvectors`.

That is 128/8 = 16.

In [None]:
# shoud this be 
pqlitse._vecs_storage[0].shape

In [None]:
pqlite._vecs_storage[0].dtype

In [None]:
X = np.random.random((N, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed

tags = [{'x': random.random()} for _ in range(N)]
pqlite.add(X, ids=list(range(len(X))), doc_tags=tags)

In [None]:
len(pqlite._vecs_storage)