# DLMR Code
---------------

# Surrogate Text Representations



If you want to use a textual search engine (e.g. Elasticsearch, Apache Lucene, Whoosh) to search vectors, you can use **Surrogate Text Representations** (STR), a method that enables similarity search on top of standard textual search engines.

The key idea of STRs is to transform a real-valued vector into fake text that can be indexed and searched for using standard full-text search engines.

To do so, we need to **transform a real-valued vector $v \in \mathbb{R}^d$ into a vector of positive integers $\tilde{v} \in \mathbb{N}^{d'}$** representing *term frequencies* that can be mapped easily to text. This transformation $\tilde{v} = T(v)$ should have the following properies:

1.   $T$ should **preserve the rank** produced by the inner product between vectors; the more it does, the less precision we are sacrificing when scoring documents.

     $ q \cdot x < q \cdot y  \quad \Longrightarrow \quad T(q) \cdot T(x) < T(q) \cdot T(y)$ 

2.   $T$ should create **sparse** vectors; sparse vectors mean few terms per vector and thus shorter posting lists.


We will explore the **permutation of pivots** as the transformation function `T()`.

In [1]:
import h5py
import numpy as np

But first, we'll get some data to index. We will use the GloVe word embedding dataset provided by [ann-benchmarks.com](https://github.com/erikbern/ann-benchmarks/#data-sets). It's a dataset of 1M+ word embeddings that also provides 10k queries for which true nearest neighbors are already computed. Vectors are compared with the cosine similarity.

In [2]:
!wget --no-clobber http://ann-benchmarks.com/glove-100-angular.hdf5

--2022-07-12 16:56:19--  http://ann-benchmarks.com/glove-100-angular.hdf5
Resolving ann-benchmarks.com (ann-benchmarks.com)... 52.217.235.69
Connecting to ann-benchmarks.com (ann-benchmarks.com)|52.217.235.69|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 485413888 (463M) [binary/octet-stream]
Saving to: ‘glove-100-angular.hdf5’


2022-07-12 16:56:29 (51.7 MB/s) - ‘glove-100-angular.hdf5’ saved [485413888/485413888]



The dataset is in HDF5 format, a popular scientific format that can be accessed using the `h5py` library. Let's see what's inside the datset file.

In [3]:
data = h5py.File('glove-100-angular.hdf5', 'r')  # open in read mode
list(data.items())

[('distances', <HDF5 dataset "distances": shape (10000, 100), type "<f4">),
 ('neighbors', <HDF5 dataset "neighbors": shape (10000, 100), type "<i4">),
 ('test', <HDF5 dataset "test": shape (10000, 100), type "<f4">),
 ('train', <HDF5 dataset "train": shape (1183514, 100), type "<f4">)]

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

n_samples = 10_000
n_queries = 200
k = 100

# data can be accessed as numpy arrays
database = data['train'][:n_samples]
queries = data['test'][:n_queries]

true_scores = cosine_similarity(queries, database)
true_neighbors = true_scores.argsort(axis=1)[:, ::-1][:, :k] 

database[0]

array([-0.11333  ,  0.48402  ,  0.090771 , -0.22439  ,  0.034206 ,
       -0.55831  ,  0.041849 , -0.53573  ,  0.18809  , -0.58722  ,
        0.015313 , -0.014555 ,  0.80842  , -0.038519 ,  0.75348  ,
        0.70502  , -0.17863  ,  0.3222   ,  0.67575  ,  0.67198  ,
        0.26044  ,  0.4187   , -0.34122  ,  0.2286   , -0.53529  ,
        1.2582   , -0.091543 ,  0.19716  , -0.037454 , -0.3336   ,
        0.31399  ,  0.36488  ,  0.71263  ,  0.1307   , -0.24654  ,
       -0.52445  , -0.036091 ,  0.55068  ,  0.10017  ,  0.48095  ,
        0.71104  , -0.053462 ,  0.22325  ,  0.30917  , -0.39926  ,
        0.036634 , -0.35431  , -0.42795  ,  0.46444  ,  0.25586  ,
        0.68257  , -0.20821  ,  0.38433  ,  0.055773 , -0.2539   ,
       -0.20804  ,  0.52522  , -0.11399  , -0.3253   , -0.44104  ,
        0.17528  ,  0.62255  ,  0.50237  , -0.7607   , -0.071786 ,
        0.0080131, -0.13286  ,  0.50097  ,  0.18824  , -0.54722  ,
       -0.42664  ,  0.4292   ,  0.14877  , -0.0072514, -0.1648

Since we have the groundtruth provided by the dataset, we will compute the **Recall**, that is the fraction of true neighbors of a query that have been retrieved by the system, as metric of quality of the results by our indices.

$ \text{Recall} = \dfrac{|\text{True Neighbors} \cap \text{Retrieved Objects}|}{|\text{True Neighbors}|}\,.$

In [5]:
def compute_recall(true_neighbors, predicted_neighbors):
  recalls = []
  for t, p in zip(true_neighbors, predicted_neighbors):
    intersection = np.intersect1d(t, p)
    recall = len(intersection) / len(t)
    recalls.append(recall)

  return np.mean(recalls)

## Pivots Permutation Representation

Let's select some pivots at random from the database:

In [6]:
n_samples = len(database)
n_queries = len(queries)
n_pivots = 1000

pivots = np.random.choice(n_samples, n_pivots)
pivots = database[pivots]
pivots.shape

(1000, 100)

In [7]:
xp_distances = cosine_similarity(database, pivots)
xp_distances

array([[ 0.20021504,  0.13863502,  0.33101022, ...,  0.22211789,
         0.23687509,  0.47158659],
       [ 0.09615555,  0.179611  ,  0.0932782 , ...,  0.13987598,
         0.14759186,  0.02403159],
       [ 0.05402752,  0.04419706,  0.14909385, ...,  0.05323974,
        -0.03399039,  0.07625086],
       ...,
       [ 0.42470384,  0.03864544,  0.13144507, ..., -0.03763069,
         0.1984465 ,  0.3507079 ],
       [ 0.2378237 , -0.01016196,  0.12738702, ...,  0.01752536,
         0.12821895,  0.18311518],
       [ 0.13217287,  0.04220441,  0.3731196 , ...,  0.14416984,
         0.08462623,  0.23728451]], dtype=float32)

In [8]:
# permutation: i -> pivot with rank i
xp_permutation = xp_distances.argsort(axis=1)
xp_permutation = xp_permutation[:, ::-1]  # reverse order, higher is better (cosine similarity)
xp_permutation

array([[567, 638, 455, ..., 473, 654, 330],
       [392, 293, 470, ..., 860, 125, 826],
       [615, 830, 265, ..., 679, 730, 554],
       ...,
       [545, 649,   3, ..., 473, 905, 654],
       [184, 723, 273, ..., 945, 654, 473],
       [595, 695, 661, ...,  99, 473, 972]])

In [9]:
# inverse permutation: i -> position of pivot i
xp_inv_perm = xp_permutation.argsort(axis=1)
xp_inv_perm += 1  # let's make the positions 1-based for clarity
xp_inv_perm

array([[581, 760, 193, ..., 501, 465,  21],
       [365, 104, 374, ..., 204, 178, 650],
       [688, 719, 293, ..., 691, 905, 607],
       ...,
       [  6, 868, 613, ..., 958, 393,  30],
       [252, 939, 632, ..., 903, 626, 444],
       [729, 912,  66, ..., 703, 839, 366]])

In [10]:
qp_distances = cosine_similarity(database, pivots)
qp_inv_perm = qp_distances.argsort(axis=1)[:, ::-1].argsort(axis=1) + 1

In [11]:
def T(samples, pivots, prefix_length=None):
  """ Transform real-valued vectors into Term Frequencies-like vectors via
      Pivots Permutations. """
  n_pivots = len(pivots)
  prefix_length = prefix_length or n_pivots

  xp_distances = cosine_similarity(samples, pivots)
  xp_permutations = xp_distances.argsort(axis=1)[:, ::-1]
  xp_inv_permutat = xp_permutations.argsort(axis=1) + 1
  xp_truncated_ip = np.maximum(xp_inv_permutat - n_pivots + prefix_length , 0)

  return xp_truncated_ip

In [12]:
tx = T(database, pivots, prefix_length=250)
tq = T(queries, pivots, prefix_length=250)
tx.shape, tq.shape

((10000, 1000), (200, 1000))

Let's measure the **sparsity** of the transformed dataset. The sparsity of a vector is the fraction of zero elements in the vector.
The sparsity of a vector is directly related to the number of entries that will be stored in the posting lists of the textual search engine; therefore, the smaller the better.

In [13]:
sparsity = 1 - (np.count_nonzero(tx) / tx.size)
sparsity

0.75

We can evaluate the performance loss introduced by the approximation by computing the inner product between the transformed vectors. This will be the operations that the textual search engine like Elasticsearch will internally perform.

In [14]:
# compute scores (without an index, this is slow)

# this takes forever
# scores = tq.dot(tx.T)

# we use sparse multiplication (similar to what an index does)
# that is only efficient if sparsity is high enough
from scipy.sparse import csr_matrix
stq = csr_matrix(tq)
stx = csr_matrix(tx)

scores = stq.dot(stx.T).toarray()

# find 100 nearest neighbors
k = 100
sorted_indices = scores.argsort(axis=1)[:, ::-1]  # sort descending per row
topk = sorted_indices[:, :k]  # get **indices** of the topk images for each row

recall = compute_recall(true_neighbors, topk)
print(recall)

0.51505


Once we have transformed our vectors, we can create surrogate text representations by repeating terms as many times as indicated by the transoformed vector, i.e., we interpret the transofmed vector as term frequencies.

In [15]:
def surrogate_text(term_frequencies):
  """ Creates Surrogate Text by repeating the i-th term
      a number of time indicated by term_frequency[i].
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    tokens += [f'f{i}'] * int(tf)

  text = ' '.join(tokens)
  return text

In [16]:
surrogate_text(tx[0])

'f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f18 f18 f25 f25 f25 f25 f25 f25 f25 f25 f25 f25 f25 f2

If the textual search engine supports boosting, we can generate shorter surrogate texts (that leads to lower query times):

In [17]:
def surrogate_text_boost(term_frequencies):
  """ Creates Surrogate Text with Boosting.
      Instead of repeating a term N times, writes 'term^N'
      that in many full-text search engines has the same effect
      of setting the term frequency of that term to N.
      Useful to get shorter surrogate text representations.
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    if tf:
      tokens.append(f'f{i}^{tf:g}')

  text = ' '.join(tokens)
  return text

In [18]:
surrogate_text_boost(tq[0])

'f7^130 f9^215 f24^23 f25^247 f29^98 f31^38 f33^182 f34^108 f38^183 f45^141 f51^163 f52^88 f56^147 f57^203 f58^34 f64^12 f71^33 f74^13 f86^139 f91^217 f93^81 f94^56 f95^126 f98^233 f99^211 f101^218 f108^236 f115^68 f122^177 f129^168 f135^166 f141^82 f153^109 f154^14 f155^84 f157^110 f162^164 f165^129 f166^90 f172^49 f177^221 f185^250 f195^66 f196^189 f200^11 f207^148 f215^167 f217^29 f218^17 f220^43 f224^22 f229^243 f235^131 f241^195 f245^190 f246^142 f252^206 f254^105 f255^36 f257^128 f260^62 f262^99 f266^50 f281^175 f284^104 f286^27 f288^19 f292^18 f300^213 f301^140 f306^87 f313^127 f316^227 f327^54 f330^239 f332^113 f336^1 f338^46 f341^103 f343^16 f344^107 f345^96 f356^120 f357^24 f359^235 f361^3 f364^44 f367^65 f368^208 f372^75 f374^162 f377^21 f378^209 f380^125 f383^72 f384^6 f388^26 f396^229 f398^173 f400^53 f404^101 f405^201 f414^55 f416^205 f424^63 f425^154 f437^161 f440^42 f457^144 f459^86 f465^5 f472^94 f473^246 f478^93 f481^91 f482^32 f483^185 f485^123 f490^79 f491^200 f501^

### Surrogate Text Representations on Elasticsearch

Let's see a working example of using STRs with Elasticsearch.
First, let's download and run an Elasticsearch instance.

In [19]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q --no-clobber
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0
!pip install -q elasticsearch==7.0.0

[?25l[K     |████                            | 10 kB 22.3 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 12.8 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 9.8 MB/s eta 0:00:01[K     |████████████████▎               | 40 kB 8.8 MB/s eta 0:00:01[K     |████████████████████▍           | 51 kB 4.2 MB/s eta 0:00:01[K     |████████████████████████▍       | 61 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████████▌   | 71 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 3.7 MB/s 
[?25h

In [20]:
# start server
import os
from subprocess import Popen, STDOUT, DEVNULL
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=DEVNULL, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# wait a bit for ES to start
!sleep 30

In [21]:
! ps -ef | grep elasticsearch

daemon       137      59 99 16:56 ?        00:00:33 /content/elasticsearch-7.0.0/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-2618654414886008709 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Dio.netty.allocator.type=unpooled -Des.path.home=/content/elasticsearch-7.0.0 -Des.path.conf=/content/elasticsearch-7.0.0/config -Des.distribution.flavor=default 

In [22]:
from elasticsearch import Elasticsearch

es = Elasticsearch(timeout=30)
print(es.ping())

True


Once Elasticsearch is up and running, let's create and index.
We need to specify one field (we name it `repr`) of type `text`, such that is searched with full-text semantics (TfIdf-like scoring). We also specify to use a white-space analyzer, that simply splits tokens on spaces.

In [23]:
index_config = {
    "mappings": {
        "_source": {"enabled": False},  # do not store STR
        "properties": {"repr": {"type": "text"}}  # declare the field 'repr' as FULLTEXT
    },
    "settings": {
        "index": {"number_of_shards": 1, "number_of_replicas": 0},
        "analysis": {"analyzer": {"first": {"type": "whitespace"}}},  # tokenize by spaces, we don't need fancier analyzers
        # "similarity": {"inner_product": {"type": "scripted", "script": {"source": "return query.boost * doc.freq;"}}}  # multiply term frequencies only
    }
}

# delete any pre-existing indices
es.indices.delete('simsearch_index', ignore=(400, 404))

# create the index
es.indices.create('simsearch_index', index_config, ignore=400)

{'acknowledged': True, 'index': 'simsearch_index', 'shards_acknowledged': True}

We use the utilities provided by the elasticsearch python package to bulk index documents.
We define a function that creates elasticsearch indexing commands and use `streaming_bulk` to process them sequentially.

In [24]:
from elasticsearch.helpers import streaming_bulk
from tqdm.auto import tqdm

def generate_docs(index_name, v):
  for i, vi in enumerate(v):
    yield {'_index': index_name, '_id': i, 'repr': surrogate_text(vi)}

docs = generate_docs('simsearch_index', tx)
indexing = streaming_bulk(es, docs, chunk_size=150, max_chunk_bytes=2**26)
indexing = tqdm(indexing, total=n_samples)

for _ in indexing:
  pass

  0%|          | 0/10000 [00:00<?, ?it/s]

Let's check how many documents have been inserted.

In [25]:
print(es.count(index='simsearch_index'))

{'count': 10000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}


Now, we can query the database. We specify we want to search the `repr` field. Since Elasticsearch supports the boosting syntax, we use the shorter surrogate text representation with boosting.

In [26]:
qi = tq[0]  # first query

query = {
  "query": {"query_string": {"default_field": "repr", "query": surrogate_text_boost(qi)}},
  "from": 0, "size": k
}

results = es.search(index='simsearch_index', body=query)
results

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '5940',
    '_index': 'simsearch_index',
    '_score': 38927.918,
    '_type': '_doc'},
   {'_id': '4241',
    '_index': 'simsearch_index',
    '_score': 36631.074,
    '_type': '_doc'},
   {'_id': '4503',
    '_index': 'simsearch_index',
    '_score': 36261.16,
    '_type': '_doc'},
   {'_id': '502',
    '_index': 'simsearch_index',
    '_score': 35741.13,
    '_type': '_doc'},
   {'_id': '7680',
    '_index': 'simsearch_index',
    '_score': 35453.28,
    '_type': '_doc'},
   {'_id': '870',
    '_index': 'simsearch_index',
    '_score': 35335.723,
    '_type': '_doc'},
   {'_id': '6303',
    '_index': 'simsearch_index',
    '_score': 35250.234,
    '_type': '_doc'},
   {'_id': '3616',
    '_index': 'simsearch_index',
    '_score': 35045.746,
    '_type': '_doc'},
   {'_id': '7576',
    '_index': 'simsearch_index',
    '_score': 35016.098,
    '_type': '_doc'},
   {'_id': '8262',
    '_ind

Let's compute the recall for our query.

In [27]:
topk = [int(hit['_id']) for hit in results['hits']['hits']]
topk = np.array([topk])  # make it a matrix with 1 row
topk

array([[5940, 4241, 4503,  502, 7680,  870, 6303, 3616, 7576, 8262, 5491,
        3684, 3438, 8679, 2518, 3971, 4467,  852, 9264, 6577, 6083, 1961,
        1058, 8828, 1037,  983, 4162, 5200, 3398, 4575, 4571, 4215, 1519,
        4663, 4867, 3429, 8868, 7472, 9085, 2088, 2310, 6706, 5187,  395,
        4304,  450, 2798, 4574, 3882, 1811, 5432, 6885, 5213, 3702, 5102,
        1744, 5821, 7364, 1226, 8715, 5941,  800, 3264, 2819,  740, 6799,
         988, 9150, 3819, 5645, 3143, 7814, 6656, 1446, 1514,  685, 8335,
        7420, 4850, 7541,  529, 9169, 9205, 3117, 8239, 5735, 9628, 1233,
        2422, 3059, 1495, 8686, 8158, 2961, 5890, 5605, 2165, 1551, 5977,
        1989]])

In [28]:
qi_true_neighbors = true_neighbors[[0]]  # make it a matrix with 1 row
compute_recall(qi_true_neighbors, topk)

0.41

## Additional Resources

*   https://github.com/erikbern/ann-benchmarks/
*   http://melisandre.deepfeatures.org/
*   https://www.elastic.co/elasticsearch/
