<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#PQLite-explained" data-toc-modified-id="PQLite-explained-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>PQLite explained</a></span></li><li><span><a href="#Understanding-pq.fit" data-toc-modified-id="Understanding-pq.fit-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Understanding <code>pq.fit</code></a></span><ul class="toc-item"><li><span><a href="#Encoding-data-in-the-PQ-space" data-toc-modified-id="Encoding-data-in-the-PQ-space-0.2.1"><span class="toc-item-num">0.2.1&nbsp;&nbsp;</span>Encoding data in the PQ space</a></span></li></ul></li><li><span><a href="#Understanding-Storage" data-toc-modified-id="Understanding-Storage-0.3"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>Understanding Storage</a></span><ul class="toc-item"><li><span><a href="#._vec_indexes" data-toc-modified-id="._vec_indexes-0.3.1"><span class="toc-item-num">0.3.1&nbsp;&nbsp;</span>._vec_indexes</a></span></li><li><span><a href="#._doc_stores" data-toc-modified-id="._doc_stores-0.3.2"><span class="toc-item-num">0.3.2&nbsp;&nbsp;</span>._doc_stores</a></span></li><li><span><a href="#._cell_tables" data-toc-modified-id="._cell_tables-0.3.3"><span class="toc-item-num">0.3.3&nbsp;&nbsp;</span>._cell_tables</a></span></li></ul></li><li><span><a href="#Understanding-pq.index" data-toc-modified-id="Understanding-pq.index-0.4"><span class="toc-item-num">0.4&nbsp;&nbsp;</span>Understanding <code>pq.index</code></a></span></li><li><span><a href="#Understanding-pq.search" data-toc-modified-id="Understanding-pq.search-0.5"><span class="toc-item-num">0.5&nbsp;&nbsp;</span>Understanding <code>pq.search</code></a></span><ul class="toc-item"><li><span><a href="#Searching-with-filtering" data-toc-modified-id="Searching-with-filtering-0.5.1"><span class="toc-item-num">0.5.1&nbsp;&nbsp;</span>Searching with filtering</a></span></li></ul></li></ul></li><li><span><a href="#Advanced-Resources" data-toc-modified-id="Advanced-Resources-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Advanced Resources</a></span><ul class="toc-item"><li><span><a href="#Training-in-minibatches" data-toc-modified-id="Training-in-minibatches-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Training in minibatches</a></span></li></ul></li></ul></div>

## PQLite explained

In [1]:
import pyximport
pyximport.install()

import pqlite
pqlite.__path__
import time

import jina
from docarray.math.distance import cdist
from docarray import DocumentArray, Document

import sklearn
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

import random
import numpy as np
from pqlite import PQLite



#### How does pqlite works

Pqlite has a first coarse search step.

When adding elements to PQLite elements are stored in cells.

The number `n_datapoints / n_cells` will be roughly the number of elements in each cell.

In [2]:
Nq = 1
D = 128 
top_k = 100
n_cells = 10
n_subvectors = 32

In [3]:
!rm -rf ./data

## Understanding `pq.fit`

Internally, calling `pq.fit(Xtr)` makes the `pq` class to learn a quantizer stored in `pq.pq_codec`.

The `pq` does not add data unitl `pq.add()` is called.

We can see that the cells in `pq` are empty

Let us see what we have after adding to PQLIte with 500 examples


In [4]:

Nt = 2000

np.random.seed(1234)
Xtr, Xte = train_test_split(make_blobs(n_samples = Nt, n_features = D)[0].astype(np.float32), test_size=20)

# the column schema: (name:str, dtype:type, create_index: bool)
pq = PQLite(dim=D,
            metric='euclidean',
            n_cells=n_cells,
            n_subvectors=n_subvectors, 
            columns=[('price',float), ('category', str)])

pq.train(Xtr)

2021-12-29 08:15:34.127 | INFO     | pqlite.index:__init__:89 - Initialize VQ codec (K=10)
2021-12-29 08:15:34.128 | INFO     | pqlite.index:__init__:99 - Initialize PQ codec (n_subvectors=32)
2021-12-29 08:15:34.149 | INFO     | pqlite.index:train:142 - Start training VQ codec (K=10) with 1980 data...
2021-12-29 08:15:34.331 | INFO     | pqlite.index:train:148 - Start training PQ codec (n_subvectors=32) with 1980 data...
2021-12-29 08:15:55.409 | INFO     | pqlite.index:train:153 - The pqlite is successfully trained!
2021-12-29 08:15:55.410 | INFO     | pqlite.index:dump_model:387 - Save the trained parameters to data/0a7dfc558abb6bc6cb48db43ccf64964


Note that cells are empty because we have not added yet any information

In [5]:
[c.size for c in pq.cell_tables] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [6]:
!du -h data

 16K	data/cell_1
 16K	data/cell_6
 16K	data/cell_8
 16K	data/cell_9
 16K	data/cell_7
 16K	data/cell_0
528K	data/0a7dfc558abb6bc6cb48db43ccf64964
 16K	data/cell_5
 16K	data/cell_2
 16K	data/cell_3
 16K	data/cell_4
688K	data


Internally `pq.train` will train the the product quantizers in `pq.pq_codec` using the `.fit` of `pq.pq_codec`.

In [7]:
pq.pq_codec.fit(Xtr)

Information about hyperparams: There are 3 important hyperparams.

- `n_cells`: Number of cells in the coarse distance estimation step
- `n_subvectors`:  Number of subvectors. 
- `n_probe`: Number of cells used to perform search.


In [8]:
print(f'pq.n_cells = {pq.n_cells }')
print(f'pq.n_subvectors = {pq.n_subvectors }')
print(f'pq.n_probe = {pq.n_probe}')

pq.n_cells = 10
pq.n_subvectors = 32
pq.n_probe = 16


### Encoding data in the PQ space

We can use the current `quantizer` stored in `pq.pq_codec` to encode data.

The dimension of the encoded data will be equal to `n_subvectors`.

In [9]:
pq.encode(Xte[[0]])

array([[134, 147, 144, 105, 160,  97,  54,  77, 139,  61, 205, 111, 193,
        220, 160, 141,  24,  91,  12, 163,  41, 122, 215,   1, 212, 201,
        233,  18, 246, 156, 190,  65]], dtype=uint8)

The quantizer stored in `pq.pq_codec` uses a codebook for each of the subspaces of the Product Quantized space. 

Since we have `n_subvectors = 32` this means we will have 32 different subspaces. Each sub-space will have been used to learn a quantizer with the corresponding columns from the training data.

In this case since `pq.pq_codec.d_subvector` is `4` then each of the slices `Xtr[:,0:4],Xtr[:,4:8],Xtr[:,8:12],....` will have a corresponding codebook. Note that `d_subvector` was not set but is a consequence of having `n_subvectors=32` because `D=128` and `128/32=4`.  

All codebooks are stored in `pq.pq_codec.codebooks`. Since each subvector is mapped into a single integer and we have 32 subvectors we will need 32 sub-codebooks to map those vectors. 

Each of the sub-codebooks will consist on `pq.pq_codec.n_clusters` **codevectors** of size 4. By default `pq.pq_codec.n_clusters` is set to `256`.

In [10]:
pq.pq_codec.codebooks.shape

(32, 256, 4)

In [11]:
pq.pq_codec.n_clusters

256

Note that each codebook contains is a matrix of shape `(K, d_subvector)` where `K` is the number of prototypes for each subspace (that is `K` equal `pq.pq_codec.n_clusters`).

In [12]:
pq.pq_codec.codebooks[0].shape

(256, 4)

##### Understanding the encoding

Once we have fitted a `pq.codec` we can encode the data.
This process takes a real valued vector, splits it in slices of size `pq.d_subvector` and each of the slices is assigned to the closest prototype stored in the codebook of the corresponding slice.

For example we can take an slice of a vector and look where it should be matched


In [13]:
slice_0 = Xte[0][0:4]
slice_0

array([-6.4715757,  3.6630175, -1.0034125,  5.976884 ], dtype=float32)

In [14]:
dists_to_prototypes_slice_0 = np.sum((pq.pq_codec.codebooks[0] - slice_0)**2, axis=1)
print(dists_to_prototypes_slice_0.shape)
print(dists_to_prototypes_slice_0.argmin())

(256,)
134


Repeating this process for each slice will encode our vector in the PQ space.

This can be done using `pq.encode`

In [15]:
pq.encode(Xte[[0]])

array([[134, 147, 144, 105, 160,  97,  54,  77, 139,  61, 205, 111, 193,
        220, 160, 141,  24,  91,  12, 163,  41, 122, 215,   1, 212, 201,
        233,  18, 246, 156, 190,  65]], dtype=uint8)

This method will internally call the stored `pq_codec` and call the `.encode` method of the internal `pq_codec`

In [16]:
pq.pq_codec.encode(Xte[[0]])

array([[134, 147, 144, 105, 160,  97,  54,  77, 139,  61, 205, 111, 193,
        220, 160, 141,  24,  91,  12, 163,  41, 122, 215,   1, 212, 201,
        233,  18, 246, 156, 190,  65]], dtype=uint8)

## Understanding Storage


### ._vec_indexes


`pq` stores the quantized data in `pq._vec_indexes`. This is a list of `n_cells` elements containing matrices with the quantized data added to `pq`. Note that if no data is added this matrices will contain only 0 values.


In [32]:
pq._vec_indexes

[<pqlite.core.index.pq_index.PQIndex at 0x7ff67931a640>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff6784955e0>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff678495730>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff678495850>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff670411a30>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff6784956d0>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff67931a160>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff67931a610>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff67931a9a0>,
 <pqlite.core.index.pq_index.PQIndex at 0x7ff67931a9d0>]

In [43]:
pq._vec_indexes[0]._data

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

In [44]:
pq._vec_indexes[0]._data.shape

(10240, 32)

In [55]:
pq._vec_indexes[0].__dict__

{'_dense_dim': 128,
 'initial_size': 10240,
 'expand_step_size': 10240,
 'expand_mode': <ExpandMode.STEP: 1>,
 'dim': 32,
 'dtype': numpy.uint8,
 'metric': <Metric.EUCLIDEAN: 1>,
 '_size': 0,
 '_capacity': 10240,
 '_data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 '_pq_codec': <pqlite.core.codec.pq.PQCodec at 0x7ff6902fc0a0>}

### ._doc_stores

In [40]:
pq._doc_stores

[<pqlite.storage.kv.DocStorage at 0x7ff67931a430>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931a5e0>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931ad00>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931a700>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931ae50>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931abb0>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931af70>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931ab50>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931ad60>,
 <pqlite.storage.kv.DocStorage at 0x7ff67931aee0>]

### ._cell_tables

`CellTable` objects provide an interface with sqlite that allows retriving data with filters

In [57]:
pq._cell_tables

[<pqlite.storage.table.CellTable at 0x7ff67931a460>,
 <pqlite.storage.table.CellTable at 0x7ff679338040>,
 <pqlite.storage.table.CellTable at 0x7ff6793380a0>,
 <pqlite.storage.table.CellTable at 0x7ff679338100>,
 <pqlite.storage.table.CellTable at 0x7ff679338190>,
 <pqlite.storage.table.CellTable at 0x7ff679338220>,
 <pqlite.storage.table.CellTable at 0x7ff679338280>,
 <pqlite.storage.table.CellTable at 0x7ff6793382e0>,
 <pqlite.storage.table.CellTable at 0x7ff679338340>,
 <pqlite.storage.table.CellTable at 0x7ff6793383a0>]

In [60]:
pq._cell_tables[0].__dict__

{'_conn_name': ':memory:',
 '_name': 'table_0',
 '_conn': <sqlite3.Connection at 0x7ff6792c3990>,
 '_columns': ['price FLOAT', 'category TEXT'],
 '_indexed_keys': {'category', 'price'}}

## Understanding `pq.index`

We have seen that `pq` has not stored a single example

In [17]:
[c.size for c in pq.cell_tables] 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

To add examples we have to do 

In [18]:
from jina import DocumentArray, Document

In [19]:
np.random.choice((100,25,10)),np.random.choice(['comics','movies','audiobook'])

(10, 'audiobook')

In [20]:
CATEGORIES = ['comics','movies','audiobook']
da = DocumentArray([Document(id=f'{i}', 
                             embedding=Xtr[i], 
                             tags={
                                   'price': np.random.choice((5.,10.,25.,100.)),
                                   'category':np.random.choice(CATEGORIES),
                                 }) for i in range(len(Xtr))])
    

In [21]:
len(da)

1980

In [22]:
da[0].tags['price']

5.0

Before indexing we can see in `./data` that there are some folders that containg the basic data structures used to store the `indexed data`.

In [23]:
!du -h data

 16K	data/cell_1
 16K	data/cell_6
 16K	data/cell_8
 16K	data/cell_9
 16K	data/cell_7
 16K	data/cell_0
528K	data/0a7dfc558abb6bc6cb48db43ccf64964
 16K	data/cell_5
 16K	data/cell_2
 16K	data/cell_3
 16K	data/cell_4
688K	data


In [24]:
pq.index(da)

2021-12-27 13:41:53.273 | DEBUG    | pqlite.container:insert:225 - => 1980 new docs added


In [25]:
[c.size for c in pq.cell_tables] 

[99, 203, 231, 202, 255, 186, 210, 175, 221, 198]

After indexing we can see that each cell  has a sensible amount of information

In [26]:
!du -h data

200K	data/cell_1
212K	data/cell_6
224K	data/cell_8
204K	data/cell_9
188K	data/cell_7
112K	data/cell_0
528K	data/0a7dfc558abb6bc6cb48db43ccf64964
196K	data/cell_5
232K	data/cell_2
212K	data/cell_3
248K	data/cell_4
2.5M	data


If we sum the elements across cells we will se that this number matches the length of the indexed DocumentArray

In [28]:
np.sum([c.size for c in pq.cell_tables] ) == len(da)

True

The cell information is 

In [29]:
print(f'The number of cells is n_cells={pq.n_cells}')

print('\nCells can be accessed in pq.cell_tables')
print(f'\twe have len(pq.cell_tables)={len(pq.cell_tables)} cells')
print(f'\nWe have added len(Xtr)={len(Xtr)} elements to pq')

The number of cells is n_cells=10

Cells can be accessed in pq.cell_tables
	we have len(pq.cell_tables)=10 cells

We have added len(Xtr)=1980 elements to pq


Note that `pq.cell_tables` is a list of  `CellTable` objects

In [30]:
pq.cell_tables[0]

<pqlite.storage.table.CellTable at 0x7fb468aa3880>


Each CellTable allows you to `insert`, `query` and `delete` vectors

We can inspect how many elements are in a cell using `.count()`

In [31]:
pq.cell_tables[0].count()

99

Not all `cell_tables` will contain the same number of elements because not all of them are assigned to the same prototype. Nevertheless the sum of the elements across cells equalts the number of added elements

In [33]:
elements_per_cell = [pq_cell_table.count() for pq_cell_table in pq.cell_tables]
print('elements_per_cell =', elements_per_cell)
print('total number of elements added =', np.sum(elements_per_cell))
print('np.sum(elements_per_cell) == len(Xtr) is ',np.sum(elements_per_cell) == len(Xtr))

elements_per_cell = [99, 203, 231, 202, 255, 186, 210, 175, 221, 198]
total number of elements added = 1980
np.sum(elements_per_cell) == len(Xtr) is  True


In [42]:
pq._doc_stores[0]

<pqlite.storage.kv.DocStorage at 0x7fb468aa3550>

In [44]:
pq.total_docs

1980

## Understanding `pq.search`


Internally `pq.search` first computes the distance between each query and the prototypes that define the cells. Then the cells whose prototypes are closest to a query are selected as search space. The best  `pq.n_probe` cells are selected (this is a hyperparameter of the algorithm).

Since `pq.n_probe` in this case is bigger than `pq.n_cells` all the cells will be searched.

In [32]:
pq.n_probe, pq.n_cells

(16, 10)

Note that  `pq.search` can be called with a batch of vectors. Once called it will end up calling `pq.search_cells` with the full batch of queries but with an array of arrays containing at each position a list of the ids of the cells that best batch the query. So if 5 queries are passed into the `pq.search` it will pass to `self.search_cells` an array of size `(len(queries), max(pq.n_probe, pq.n_cells)`.

The `.search_cells` method iterates over the queries and comptues the distance between each query and all retrieved elements in the activated cells.

For each query in the batch  the Asymetric Distance Computation is performed using `pq.pq_codec.precompute_adc` which returns a table of shape `(pq.n_subvectors, pq.pq_codec.n_clusters)`.  In our case a matrix of shape `(32, 256)`.


In [33]:
query = Xtr[[10]]

In [34]:
dtable = pq.pq_codec.precompute_adc(query[0])
dtable.dtable.shape

(32, 256)

We can do this faster with a cython function as follows

In [35]:
import pqlite.pq_bind
from pqlite.pq_bind import precompute_adc_table

In [36]:
d_subvector = int(query.shape[1]/pq.pq_codec.n_subvectors)

In [37]:
dt = precompute_adc_table(query[0], 
                          d_subvector,
                          pq.pq_codec.n_clusters,
                          pq.pq_codec.codebooks)

In [38]:
np.mean(np.asarray(dt) - dtable.dtable)

0.0

This table contains the distance between each possible subvector in que query and each possible subvector from any subcodevector.

Therefore we go from `search` -> `search_cells` -> `search_cells` -> `precomputed_k = pq_codec.precompute_adc(query_k)` -> `ivfpq_topk`

Therefore for `query_k` we compute the ADC table. Then this table is used to compute the distance between the query and all the database. 

In this case, since there is filtering, the computations are done only on a subset of the database. Distances are computed between the query and the exapmles that come from the selected cells and verify the conditions specified by the provided filter.

```
self.ivfpq_topk(precomputed, cells=cell_idx,conditions=conditions,k=k )
```

In [39]:
pq.pq_codec.codebooks.shape

(32, 256, 4)

In [40]:
precomputed = pq.pq_codec.precompute_adc(query[0])
precomputed.dtable.shape

(32, 256)

Given a bunch of datapoint candidates from the database (from which we already have the pqcode) we want to find distances between the query and the candidates. This is done with `precomputed.adist(codes)` which returns the distance between each code in codes and the pqcode of the query.

Let us recall that each subvector in the quantized space represented 4 values in the original space and those 4 real-valued values are represented with a single intenger from 0 to n_clusters.

In [41]:
pq.pq_codec.d_subvector, pq.pq_codec.n_clusters

(4, 256)

We can look at the indexed (quantized) data in `cell_k` using  `pq._vec_indexes[cell_k].data`

In [42]:
pq._vec_indexes[0]._data

array([[ 83,  91,  31, ...,  41,   5,  37],
       [ 45, 109,  36, ...,  45, 178,  40],
       [120,   1,  20, ..., 150, 147,   8],
       ...,
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=uint8)

This is an array of pq-codes of vectors that have been indexed.

Note that several rows are 0 because memory is preemtively saved to avoid too much memory resizes.

In [43]:
pq._vec_indexes[0]._data.shape

(10240, 32)

We can see that the number of items in a cell starts at a position that is full of 0 values

In [44]:
print(np.where(pq._vec_indexes[0]._data.sum(axis=1)==0)[0][0])
print(np.where(pq._vec_indexes[1]._data.sum(axis=1)==0)[0][0])
print([c.size for c in pq.cell_tables])

99
203
[99, 203, 231, 202, 255, 186, 210, 175, 221, 198]


Note that this is not necessarily true since a vector in the original feature space could be mapped to a pq-code that is represented  as zeros [0,0,...0].

In [45]:
precomputed = pq.pq_codec.precompute_adc(query[0])

In [46]:
from jina import DocumentArray
da = DocumentArray([Document(embedding=query[0]),
                    Document(embedding=Xtr[0])])

We can search matches for documents in a documentarray using `.search`.

Note that this does not return anything

In [47]:
pq.search(da,limit=5)

But the documentarray is updated with matches in each of the docs of the documentarray

In [48]:
[m.id for m in da[0].matches]

['10', '499', '710', '1862', '1979']

We can anually look at the euclidean distances with

In [49]:
[x.scores['euclidean'].value for x in da[0].matches]

[13.598586082458496,
 162.94834899902344,
 178.1434783935547,
 178.3148651123047,
 180.0396270751953]

In [50]:
[x.scores['euclidean'].value for x in da[1].matches]

[12.756232261657715,
 178.5064239501953,
 182.26963806152344,
 183.38494873046875,
 186.2959442138672]

The search method will look into the different cells and search on each cell retrieve elements and compute distances. In each cell the method `.search` will be called.

Note that the distance 5.45 appears in one of the cells if we exhaustively search across cells for the closest matches.

In [51]:
[pq._vec_indexes[i].search(query[0], 1) for i in range(len(pq._vec_indexes))]

[(array([5336.07617188]), array([99])),
 (array([5336.07617188]), array([203])),
 (array([186.16217041]), array([64])),
 (array([5336.07617188]), array([202])),
 (array([5336.07617188]), array([255])),
 (array([5336.07617188]), array([186])),
 (array([178.14347839]), array([75])),
 (array([5336.07617188]), array([175])),
 (array([13.59858608]), array([0])),
 (array([5336.07617188]), array([198]))]

In [52]:
[pq._vec_indexes[i].search(Xtr[0], 1) for i in range(len(pq._vec_indexes))]

[(array([193.5615387]), array([32])),
 (array([5816.02050781]), array([203])),
 (array([5816.02050781]), array([231])),
 (array([5816.02050781]), array([202])),
 (array([5816.02050781]), array([255])),
 (array([182.26963806]), array([94])),
 (array([5816.02050781]), array([210])),
 (array([12.75623226]), array([0])),
 (array([5816.02050781]), array([221])),
 (array([186.29594421]), array([119]))]

An important observation is that the closest elements in many cells are really far from the best elments in a few cells. This suggests there is not need to look into all cells at query time (for this examples).

### Searching with filtering

We can filter according to a set of tags of the documents

In [53]:
!rm -rf ./data

In [54]:

np.random.seed(1234)
Xtr, Xte = train_test_split(make_blobs(n_samples = Nt, n_features = D)[0].astype(np.float32), test_size=20)

# the column schema: (name:str, dtype:type, create_index: bool)
pq = PQLite(dim=D,
            metric='euclidean',
            n_cells=n_cells,
            n_subvectors=n_subvectors, 
            columns=[('price',float), ('category', str)],
            include_metadata=True)

pq.train(Xtr)

CATEGORIES = ['comics','movies','audiobook']
da = DocumentArray([Document(id=f'{i}', 
                             embedding=Xtr[i], 
                             tags={
                                   'price': np.random.choice((5.,10.,25.,100.)),
                                   'category':np.random.choice(CATEGORIES),
                                 }) for i in range(len(Xtr))])
    
pq.index(da)

2021-12-23 13:46:51.832 | INFO     | pqlite.index:__init__:96 - Initialize VQ codec (K=10)
2021-12-23 13:46:51.833 | INFO     | pqlite.index:__init__:108 - Initialize PQ codec (n_subvectors=32)
2021-12-23 13:46:51.856 | INFO     | pqlite.index:train:153 - Start training VQ codec (K=10) with 1980 data...
2021-12-23 13:46:52.163 | INFO     | pqlite.index:train:160 - Start training PQ codec (n_subvectors=32) with 1980 data...
2021-12-23 13:47:39.608 | INFO     | pqlite.index:train:166 - The pqlite is successfully trained!
2021-12-23 13:47:39.609 | INFO     | pqlite.index:dump_model:404 - Save the trained parameters to data/0a7dfc558abb6bc6cb48db43ccf64964
2021-12-23 13:47:40.008 | DEBUG    | pqlite.container:insert:225 - => 1980 new docs added


In [55]:
query_da = DocumentArray([Document(embedding=Xtr[0], tags={'price':0.23})])

pq.search(query_da, filter={'price': {'$lt': 120}}, limit=5)

dists = [x.scores['euclidean'].value for x in query_da[0].matches]
prices = [x.tags['price'] for x in query_da[0].matches]

print(dists)
print(prices)

[12.333423614501953, 175.82518005371094, 179.87442016601562, 187.24586486816406, 188.56765747070312]
[25.0, 25.0, 25.0, 5.0, 100.0]


Note that if we put a more restrictive filter the top_k distances will be worse (in general) but all values will verify the filter.

In [56]:

pq.search(query_da, filter={'price': {'$lt': 50}}, limit=5)

dists = [round(x.scores['euclidean'].value,2) for x in query_da[0].matches]
prices = [x.tags['price'] for x in query_da[0].matches]

print(dists)
print(prices)

[12.33, 175.83, 179.87, 187.25, 189.3]
[25.0, 25.0, 25.0, 5.0, 5.0]


If no filter is applied...

In [57]:
pq.search(query_da,  limit=10)

dists = [round(x.scores['euclidean'].value,2) for x in query_da[0].matches]
prices = [x.tags['price'] for x in query_da[0].matches]

print(dists)
print(prices)

[12.33, 175.83, 179.87, 187.25, 188.57, 189.3, 191.03, 191.43, 192.73, 193.25]
[25.0, 25.0, 25.0, 5.0, 100.0, 5.0, 25.0, 25.0, 5.0, 5.0]


In [58]:
%timeit pq.search(query_da)

15 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Advanced Resources

## Training in minibatches

Sometimes, data is too large to fit in memory, in such cases pqlite allows to train in minibatches using `.partial_fit`

In [59]:
codebook_full_batch = pq.pq_codec.codebooks
codebook_full_batch.shape

(32, 256, 4)

In [60]:

# the column schema: (name:str, dtype:type, create_index: bool)
pq_minibatch = PQLite(dim=D,
                      metric='euclidean',
                      n_cells=n_cells,
                      n_subvectors=n_subvectors, 
                      columns=[('price',float), ('category', str)],
                      data_path='data_minibatch',
                      logger_flag=False)


Now we would create a custom minibatch generator (probably reading from disk) and use `.partial_train`

In [61]:

def minibatch_generator(Xtr,batch_size):
    num = 0
    pos_begin_batch = 0
    n_examples = len(Xtr)
    
    while True:
        Xtr_batch = Xtr[pos_begin_batch:pos_begin_batch+batch_size]
        yield Xtr_batch
        
        num += len(Xtr_batch)
        pos_begin_batch += batch_size 

        if num + batch_size >= n_examples:
            break

In [62]:
n_epochs = 100
n_examples = Xtr.shape[0]
n_batch = int(n_examples/3)
n_batches = int(n_examples/n_batch)

minibatch_generator_ = minibatch_generator(Xtr, n_batch)

In [63]:
for i in range(n_epochs):
    minibatch_generator_ = minibatch_generator(Xtr, n_batch)
    
    for batch in minibatch_generator_:
        pq_minibatch.partial_train(batch)
    

In [64]:
pq_minibatch.vq_codec.build_codebook()
vq_codebook_minibatch = pq_minibatch.vq_codec.codebook

In [65]:
vq_codebook_minibatch.shape

(10, 128)

Once training is finished codebooks need to be set with `.build_codebook()`

In [66]:
pq_minibatch.build_codebook()

vq_codebook_full_batch = pq.vq_codec.codebook
vq_codebook_minibatch = pq_minibatch.vq_codec.codebook
print(f'vq_codebook_full_batch.shape = {vq_codebook_full_batch.shape}')
print(f'vq_codebook_minibatch.shape = {vq_codebook_minibatch.shape}')

pq_codebook_full_batch = pq.pq_codec.codebooks
pq_codebook_minibatch = pq_minibatch.pq_codec.codebooks
print(f'pq_codebook_full_batch.shape = {pq_codebook_full_batch.shape}')
print(f'pq_codebook_minibatch.shape = {pq_codebook_minibatch.shape}')

vq_codebook_full_batch.shape = (10, 128)
vq_codebook_minibatch.shape = (10, 128)
pq_codebook_full_batch.shape = (32, 256, 4)
pq_codebook_minibatch.shape = (32, 256, 4)


The average SSE error between a full batch version and the minibatch version is 

In [67]:
kmeans = sklearn.cluster.KMeans()
kmeans.fit(Xtr)

abs(kmeans.score(Xtr)/len(Xtr) - pq_minibatch.vq_codec.kmeans.score(Xtr)/len(Xtr))

0.22011521464645512