# DLMR Code
---------------

# Surrogate Text Representations



If you want to use a textual search engine (e.g. Elasticsearch, Apache Lucene, Whoosh) to search vectors, you can use **Surrogate Text Representations** (STR), a method that enables similarity search on top of standard textual search engines.

The key idea of STRs is to transform a real-valued vector into fake text that can be indexed and searched for using standard full-text search engines.

To do so, we need to **transform a real-valued vector $v \in \mathbb{R}^d$ into a vector of positive integers $\tilde{v} \in \mathbb{N}^{d'}$** representing *term frequencies* that can be mapped easily to text. This transformation $\tilde{v} = T(v)$ should have the following properies:

1.   $T$ should **preserve the rank** produced by the inner product between vectors; the more it does, the less precision we are sacrificing when scoring documents.

     $ q \cdot x < q \cdot y  \quad \Longrightarrow \quad T(q) \cdot T(x) < T(q) \cdot T(y)$ 

2.   $T$ should create **sparse** vectors; sparse vectors mean few terms per vector and thus shorter posting lists.


We will explore the **permutation of pivots** as the transformation function `T()`.

In [1]:
import h5py
import numpy as np

But first, we'll get some data to index. We will use the GloVe word embedding dataset provided by [ann-benchmarks.com](https://github.com/erikbern/ann-benchmarks/#data-sets). It's a dataset of 1M+ word embeddings that also provides 10k queries for which true nearest neighbors are already computed. Vectors are compared with the cosine similarity.

In [2]:
!wget --no-clobber http://ann-benchmarks.com/glove-100-angular.hdf5

--2022-07-13 08:26:49--  http://ann-benchmarks.com/glove-100-angular.hdf5
Resolving ann-benchmarks.com (ann-benchmarks.com)... 52.216.102.18
Connecting to ann-benchmarks.com (ann-benchmarks.com)|52.216.102.18|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 485413888 (463M) [binary/octet-stream]
Saving to: ‘glove-100-angular.hdf5’


2022-07-13 08:27:01 (41.7 MB/s) - ‘glove-100-angular.hdf5’ saved [485413888/485413888]



The dataset is in HDF5 format, a popular scientific format that can be accessed using the `h5py` library. Let's see what's inside the datset file.

In [3]:
data = h5py.File('glove-100-angular.hdf5', 'r')  # open in read mode
list(data.items())

[('distances', <HDF5 dataset "distances": shape (10000, 100), type "<f4">),
 ('neighbors', <HDF5 dataset "neighbors": shape (10000, 100), type "<i4">),
 ('test', <HDF5 dataset "test": shape (10000, 100), type "<f4">),
 ('train', <HDF5 dataset "train": shape (1183514, 100), type "<f4">)]

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

n_samples = 10_000
n_queries = 200
k = 100

# data can be accessed as numpy arrays
database = data['train'][:n_samples]
queries = data['test'][:n_queries]

true_scores = cosine_similarity(queries, database)
true_neighbors = true_scores.argsort(axis=1)[:, ::-1][:, :k] 

database[0]

array([-0.11333  ,  0.48402  ,  0.090771 , -0.22439  ,  0.034206 ,
       -0.55831  ,  0.041849 , -0.53573  ,  0.18809  , -0.58722  ,
        0.015313 , -0.014555 ,  0.80842  , -0.038519 ,  0.75348  ,
        0.70502  , -0.17863  ,  0.3222   ,  0.67575  ,  0.67198  ,
        0.26044  ,  0.4187   , -0.34122  ,  0.2286   , -0.53529  ,
        1.2582   , -0.091543 ,  0.19716  , -0.037454 , -0.3336   ,
        0.31399  ,  0.36488  ,  0.71263  ,  0.1307   , -0.24654  ,
       -0.52445  , -0.036091 ,  0.55068  ,  0.10017  ,  0.48095  ,
        0.71104  , -0.053462 ,  0.22325  ,  0.30917  , -0.39926  ,
        0.036634 , -0.35431  , -0.42795  ,  0.46444  ,  0.25586  ,
        0.68257  , -0.20821  ,  0.38433  ,  0.055773 , -0.2539   ,
       -0.20804  ,  0.52522  , -0.11399  , -0.3253   , -0.44104  ,
        0.17528  ,  0.62255  ,  0.50237  , -0.7607   , -0.071786 ,
        0.0080131, -0.13286  ,  0.50097  ,  0.18824  , -0.54722  ,
       -0.42664  ,  0.4292   ,  0.14877  , -0.0072514, -0.1648

Since we have the groundtruth provided by the dataset, we will compute the **Recall**, that is the fraction of true neighbors of a query that have been retrieved by the system, as metric of quality of the results by our indices.

$ \text{Recall} = \dfrac{|\text{True Neighbors} \cap \text{Retrieved Objects}|}{|\text{True Neighbors}|}\,.$

In [5]:
def compute_recall(true_neighbors, predicted_neighbors):
  recalls = []
  for t, p in zip(true_neighbors, predicted_neighbors):
    intersection = np.intersect1d(t, p)
    recall = len(intersection) / len(t)
    recalls.append(recall)

  return np.mean(recalls)

## Pivots Permutation Representation

Let's select some pivots at random from the database:

In [6]:
n_samples = len(database)
n_queries = len(queries)
n_pivots = 1000

pivots = np.random.choice(n_samples, n_pivots)
pivots = database[pivots]
pivots.shape

(1000, 100)

In [7]:
xp_distances = cosine_similarity(database, pivots)
xp_distances

array([[ 0.0730714 ,  0.20748228,  0.02928537, ...,  0.23704147,
         0.35764992,  0.11531361],
       [-0.1317926 ,  0.04130875, -0.13022028, ...,  0.07318358,
        -0.11490967,  0.13735631],
       [-0.07231024,  0.32932368, -0.0081216 , ...,  0.07290298,
         0.27425113,  0.29042706],
       ...,
       [ 0.09051503,  0.4369079 ,  0.11866744, ...,  0.22238837,
         0.14660211,  0.20440254],
       [ 0.17491174,  0.24385123,  0.10760495, ...,  0.09169679,
         0.20435211,  0.36831784],
       [ 0.18078792,  0.1798296 ,  0.06412008, ...,  0.16927849,
         0.17336157,  0.26738298]], dtype=float32)

In [8]:
# permutation: i -> pivot with rank i
xp_permutation = xp_distances.argsort(axis=1)
xp_permutation = xp_permutation[:, ::-1]  # reverse order, higher is better (cosine similarity)
xp_permutation

array([[345, 519, 318, ..., 362,  99, 210],
       [762, 964, 717, ..., 626,  81, 276],
       [926, 869, 258, ...,  73, 210, 807],
       ...,
       [411, 103, 892, ..., 312, 918, 274],
       [641, 177, 385, ..., 741, 807, 395],
       [ 24, 341, 613, ..., 463, 739, 210]])

In [9]:
# inverse permutation: i -> position of pivot i
xp_inv_perm = xp_permutation.argsort(axis=1)
xp_inv_perm

array([[904, 565, 946, ..., 467, 123, 827],
       [973, 591, 971, ..., 457, 961, 225],
       [940,   5, 838, ..., 604,  26,  17],
       ...,
       [757,  12, 667, ..., 308, 560, 362],
       [513, 256, 729, ..., 775, 401,  43],
       [549, 555, 886, ..., 603, 585, 276]])

In [31]:
# term frequency: i -> the importance of pivot i
xp_term_freq = n_pivots - xp_inv_perm
xp_term_freq

array([[ 96, 435,  54, ..., 533, 877, 173],
       [ 27, 409,  29, ..., 543,  39, 775],
       [ 60, 995, 162, ..., 396, 974, 983],
       ...,
       [243, 988, 333, ..., 692, 440, 638],
       [487, 744, 271, ..., 225, 599, 957],
       [451, 445, 114, ..., 397, 415, 724]])

In [11]:
# truncated term frequency: i -> the importance of pivot i
# but only keeping the nearest p pivots
prefix_length = 10
xp_trunc_tf = np.maximum(prefix_length - xp_inv_perm, 0)
xp_trunc_tf

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 5, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [12]:
qp_distances = cosine_similarity(database, pivots)
qp_inv_perm = qp_distances.argsort(axis=1)[:, ::-1].argsort(axis=1)
qp_trunc_tf = np.maximum(prefix_length - qp_inv_perm, 0)

In [13]:
def T(samples, pivots, prefix_length=None):
  """ Transform real-valued vectors into Term Frequencies-like vectors via
      Pivots Permutations. """
  n_pivots = len(pivots)
  prefix_length = prefix_length or n_pivots

  xp_distances = cosine_similarity(samples, pivots)
  xp_permutations = xp_distances.argsort(axis=1)[:, ::-1]
  xp_inv_permutat = xp_permutations.argsort(axis=1)
  xp_truncated_tf = np.maximum(prefix_length - xp_inv_permutat, 0)

  return xp_truncated_tf

In [14]:
tx = T(database, pivots, prefix_length=250)
tq = T(queries, pivots, prefix_length=250)
tx.shape, tq.shape

((10000, 1000), (200, 1000))

Let's measure the **sparsity** of the transformed dataset. The sparsity of a vector is the fraction of zero elements in the vector.
The sparsity of a vector is directly related to the number of entries that will be stored in the posting lists of the textual search engine; therefore, the smaller the better.

In [15]:
sparsity = 1 - (np.count_nonzero(tx) / tx.size)
sparsity

0.75

We can evaluate the performance loss introduced by the approximation by computing the inner product between the transformed vectors. This will be the operations that the textual search engine like Elasticsearch will internally perform.

In [16]:
# compute scores (without an index, this is slow)

# this takes forever
# scores = tq.dot(tx.T)

# we use sparse multiplication (similar to what an index does)
# that is only efficient if sparsity is high enough
from scipy.sparse import csr_matrix
stq = csr_matrix(tq)
stx = csr_matrix(tx)

scores = stq.dot(stx.T).toarray()

# find 100 nearest neighbors
k = 100
sorted_indices = scores.argsort(axis=1)[:, ::-1]  # sort descending per row
topk = sorted_indices[:, :k]  # get **indices** of the topk images for each row

recall = compute_recall(true_neighbors, topk)
print(recall)

0.52705


Once we have transformed our vectors, we can create surrogate text representations by repeating terms as many times as indicated by the transoformed vector, i.e., we interpret the transofmed vector as term frequencies.

In [17]:
def surrogate_text(term_frequencies):
  """ Creates Surrogate Text by repeating the i-th term
      a number of time indicated by term_frequency[i].
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    tokens += [f'f{i}'] * int(tf)

  text = ' '.join(tokens)
  return text

In [18]:
surrogate_text(tx[0])

'f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f12 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19 f19

If the textual search engine supports boosting, we can generate shorter surrogate texts (that leads to lower query times):

In [19]:
def surrogate_text_boost(term_frequencies):
  """ Creates Surrogate Text with Boosting.
      Instead of repeating a term N times, writes 'term^N'
      that in many full-text search engines has the same effect
      of setting the term frequency of that term to N.
      Useful to get shorter surrogate text representations.
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    if tf:
      tokens.append(f'f{i}^{tf:g}')

  text = ' '.join(tokens)
  return text

In [20]:
surrogate_text_boost(tq[0])

'f1^189 f9^45 f20^64 f24^164 f27^124 f29^135 f32^72 f34^195 f39^67 f47^48 f51^151 f53^55 f55^215 f61^44 f62^169 f66^46 f70^137 f72^167 f77^75 f78^144 f82^188 f85^63 f89^2 f91^54 f92^37 f98^201 f102^160 f103^248 f106^192 f108^86 f110^77 f116^56 f118^224 f124^74 f129^104 f131^108 f132^146 f134^120 f135^170 f148^180 f149^177 f156^87 f159^190 f160^140 f163^3 f165^4 f169^203 f174^69 f178^102 f180^19 f184^181 f191^216 f198^11 f200^184 f209^197 f212^174 f214^35 f216^30 f220^109 f223^231 f235^99 f237^139 f247^130 f251^36 f254^110 f258^1 f259^113 f260^228 f264^58 f282^28 f298^128 f304^79 f305^82 f306^234 f311^38 f313^122 f314^162 f316^7 f317^5 f318^194 f319^187 f325^31 f326^142 f327^93 f329^163 f333^105 f334^243 f336^238 f345^205 f348^136 f349^221 f353^159 f356^20 f360^8 f368^13 f369^80 f371^40 f380^157 f385^165 f386^232 f388^66 f389^76 f391^89 f394^246 f399^15 f400^210 f411^208 f414^141 f417^81 f418^171 f421^123 f425^230 f427^23 f428^204 f433^206 f438^145 f445^168 f447^125 f450^106 f454^198 f4

### Surrogate Text Representations on Elasticsearch

Let's see a working example of using STRs with Elasticsearch.
First, let's download and run an Elasticsearch instance.

In [21]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q --no-clobber
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0
!pip install -q elasticsearch==7.0.0

[?25l[K     |████                            | 10 kB 20.1 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 23.6 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 29.3 MB/s eta 0:00:01[K     |████████████████▎               | 40 kB 9.5 MB/s eta 0:00:01[K     |████████████████████▍           | 51 kB 7.0 MB/s eta 0:00:01[K     |████████████████████████▍       | 61 kB 8.2 MB/s eta 0:00:01[K     |████████████████████████████▌   | 71 kB 8.9 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 4.9 MB/s 
[?25h

In [22]:
# start server
import os
from subprocess import Popen, STDOUT, DEVNULL
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=DEVNULL, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# wait a bit for ES to start
!sleep 30

In [23]:
! ps -ef | grep elasticsearch

daemon       139      59 99 08:27 ?        00:00:33 /content/elasticsearch-7.0.0/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-6497323665627477750 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Dio.netty.allocator.type=unpooled -Des.path.home=/content/elasticsearch-7.0.0 -Des.path.conf=/content/elasticsearch-7.0.0/config -Des.distribution.flavor=default 

In [24]:
from elasticsearch import Elasticsearch

es = Elasticsearch(timeout=30)
print(es.ping())

True


Once Elasticsearch is up and running, let's create and index.
We need to specify one field (we name it `repr`) of type `text`, such that is searched with full-text semantics (TfIdf-like scoring). We also specify to use a white-space analyzer, that simply splits tokens on spaces.

In [25]:
index_config = {
    "mappings": {
        "_source": {"enabled": False},  # do not store STR
        "properties": {"repr": {"type": "text"}}  # declare the field 'repr' as FULLTEXT
    },
    "settings": {
        "index": {"number_of_shards": 1, "number_of_replicas": 0},
        "analysis": {"analyzer": {"first": {"type": "whitespace"}}},  # tokenize by spaces, we don't need fancier analyzers
        # "similarity": {"inner_product": {"type": "scripted", "script": {"source": "return query.boost * doc.freq;"}}}  # multiply term frequencies only
    }
}

# delete any pre-existing indices
es.indices.delete('simsearch_index', ignore=(400, 404))

# create the index
es.indices.create('simsearch_index', index_config, ignore=400)

{'acknowledged': True, 'index': 'simsearch_index', 'shards_acknowledged': True}

We use the utilities provided by the elasticsearch python package to bulk index documents.
We define a function that creates elasticsearch indexing commands and use `streaming_bulk` to process them sequentially.

In [26]:
from elasticsearch.helpers import streaming_bulk
from tqdm.auto import tqdm

def generate_docs(index_name, v):
  for i, vi in enumerate(v):
    yield {'_index': index_name, '_id': i, 'repr': surrogate_text(vi)}

docs = generate_docs('simsearch_index', tx)
indexing = streaming_bulk(es, docs, chunk_size=150, max_chunk_bytes=2**26)
indexing = tqdm(indexing, total=n_samples)

for _ in indexing:
  pass

  0%|          | 0/10000 [00:00<?, ?it/s]

Let's check how many documents have been inserted.

In [27]:
print(es.count(index='simsearch_index'))

{'count': 10000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}


Now, we can query the database. We specify we want to search the `repr` field. Since Elasticsearch supports the boosting syntax, we use the shorter surrogate text representation with boosting.

In [28]:
qi = tq[0]  # first query

query = {
  "query": {"query_string": {"default_field": "repr", "query": surrogate_text_boost(qi)}},
  "from": 0, "size": k
}

results = es.search(index='simsearch_index', body=query)
results

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1989',
    '_index': 'simsearch_index',
    '_score': 39733.582,
    '_type': '_doc'},
   {'_id': '3143',
    '_index': 'simsearch_index',
    '_score': 39706.637,
    '_type': '_doc'},
   {'_id': '9995',
    '_index': 'simsearch_index',
    '_score': 39231.664,
    '_type': '_doc'},
   {'_id': '50',
    '_index': 'simsearch_index',
    '_score': 39173.15,
    '_type': '_doc'},
   {'_id': '6392',
    '_index': 'simsearch_index',
    '_score': 38692.97,
    '_type': '_doc'},
   {'_id': '5041',
    '_index': 'simsearch_index',
    '_score': 38680.87,
    '_type': '_doc'},
   {'_id': '7472',
    '_index': 'simsearch_index',
    '_score': 38610.68,
    '_type': '_doc'},
   {'_id': '1418',
    '_index': 'simsearch_index',
    '_score': 38373.816,
    '_type': '_doc'},
   {'_id': '5356',
    '_index': 'simsearch_index',
    '_score': 38337.066,
    '_type': '_doc'},
   {'_id': '4867',
    '_inde

Let's compute the recall for our query.

In [29]:
topk = [int(hit['_id']) for hit in results['hits']['hits']]
topk = np.array([topk])  # make it a matrix with 1 row
topk

array([[1989, 3143, 9995,   50, 6392, 5041, 7472, 1418, 5356, 4867, 2021,
        1961, 6499, 9901, 7785, 5941, 4503,  870,  322, 1226, 6303, 1811,
        5735, 5538, 9085, 4467, 3616, 4663, 4574, 6598, 6590, 5686, 1491,
        3684, 9635, 8673, 8452, 3075, 2131, 6747,   72, 1091, 5457, 5940,
        3309, 3398, 9616,  852, 6221, 1744, 6577, 3971, 8239, 3617, 3059,
        5178, 4535, 6390, 2910, 3085, 9923, 4204, 7499,  909, 5487, 7432,
        6800, 8705,  450, 9057, 6640, 3560, 9348, 4499,  725, 1089,  668,
        6083, 4918, 4377,  260, 9169, 9244, 1556, 7340, 3819, 8360, 2441,
        9467, 9107, 8450,  628,  630, 1058, 5287, 7923, 7589, 8327, 5102,
        8395]])

In [30]:
qi_true_neighbors = true_neighbors[[0]]  # make it a matrix with 1 row
compute_recall(qi_true_neighbors, topk)

0.5

## Additional Resources

*   https://github.com/erikbern/ann-benchmarks/
*   http://melisandre.deepfeatures.org/
*   https://www.elastic.co/elasticsearch/
