Embedding Statistics
=================

This code emperically determines the [domain](https://en.wikipedia.org/wiki/Domain_of_a_function) of the enrolled vectors and provides other helpful statistics. Useful for understanding the aggregate properties of the embeddings.

In [1]:
from binah.model import Vector, get_config_single_host
from tqdm import tqdm_notebook as tqdm
import numpy as np


config = get_config_single_host()
query = config.query(Vector.vec)

In [2]:
mins = []
maxes = []
sums = []
for (vec,) in tqdm(query, total=query.count(), unit=' vecs', 
                   desc='Vectors'):
    vec = np.frombuffer(vec, dtype=np.float32)
    maxes.append(np.max(vec))
    mins.append(np.min(vec))
    sums.append(np.sum(vec))
mins = np.sort(mins)
maxes = np.sort(maxes)
sums = np.sort(sums)

print("Embedding Statistics")
print("====================")
print('Top Maxes: ' + str(maxes[-20:]))
print('Top Mins: ' + str(mins[0:20]))
print('Mean of Maxes: ' + str(np.mean(maxes)))
print('Mean of Mins: ' + str(np.mean(mins)))
print('Sums (largest): ' + str(sums[-20:]))
print('Sums (smallest): ' + str(sums[0:20]))
print('Mean of Sums: ' + str(np.mean(sums)))

HBox(children=(IntProgress(value=0, description='Vectors', max=616767), HTML(value='')))


Embedding Statistics
Top Maxes: [0.14407796 0.14431612 0.14437321 0.14445287 0.1445786  0.14481735
 0.14497212 0.14619437 0.14660802 0.14782998 0.14818104 0.14863257
 0.1489033  0.15048969 0.1515979  0.15184359 0.15277444 0.15513025
 0.16090414 0.16335225]
Top Mins: [-0.17133677 -0.1655125  -0.1655125  -0.1655125  -0.16292623 -0.15957768
 -0.15745819 -0.1573801  -0.1562428  -0.1562428  -0.15618019 -0.15579247
 -0.15521643 -0.15374443 -0.15366852 -0.15309294 -0.1529138  -0.15218389
 -0.1517213  -0.15163776]
Mean of Maxes: 0.0946391
Mean of Mins: -0.094648376
Sums (largest): [3.916528  3.917465  3.9222074 3.9232428 3.9300818 3.9364767 3.9439158
 3.950307  3.9511452 3.9673526 3.979783  4.0115128 4.0670533 4.091491
 4.091509  4.0947056 4.1919756 4.230557  4.378519  4.456129 ]
Sums (smallest): [-2.3035164 -2.2830157 -2.2495246 -2.0803638 -2.0059276 -2.0014672
 -1.993067  -1.9765998 -1.9698577 -1.9527419 -1.9513801 -1.9367818
 -1.9156747 -1.9022464 -1.9005005 -1.8995728 -1.8970131 -1.879287