# DLMR Code
---------------

# Surrogate Text Representations



If you want to use a textual search engine (e.g. Elasticsearch, Apache Lucene, Whoosh) to search vectors, you can use **Surrogate Text Representations** (STR), a method that enables similarity search on top of standard textual search engines.

The key idea of STRs is to transform a real-valued vector into fake text that can be indexed and searched for using standard full-text search engines.

To do so, we need to **transform a real-valued vector $v \in \mathbb{R}^d$ into a vector of positive integers $\tilde{v} \in \mathbb{N}^{d'}$** representing *term frequencies* that can be mapped easily to text. This transformation $\tilde{v} = T(v)$ should have the following properies:

1.   $T$ should **preserve the rank** produced by the inner product between vectors; the more it does, the less precision we are sacrificing when scoring documents.

     $ q \cdot x < q \cdot y  \quad \Longrightarrow \quad T(q) \cdot T(x) < T(q) \cdot T(y)$ 

2.   $T$ should create **sparse** vectors; sparse vectors mean few terms per vector and thus shorter posting lists.


We will explore the **permutation of pivots** as the transformation function `T()`.

In [None]:
import h5py
import numpy as np

But first, we'll get some data to index. We will use the GloVe word embedding dataset provided by [ann-benchmarks.com](https://github.com/erikbern/ann-benchmarks/#data-sets). It's a dataset of 1M+ word embeddings that also provides 10k queries for which true nearest neighbors are already computed. Vectors are compared with the cosine similarity.

In [None]:
!wget --no-clobber http://ann-benchmarks.com/glove-100-angular.hdf5

--2021-06-04 13:52:53--  http://ann-benchmarks.com/glove-100-angular.hdf5
Resolving ann-benchmarks.com (ann-benchmarks.com)... 52.216.110.218
Connecting to ann-benchmarks.com (ann-benchmarks.com)|52.216.110.218|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 485413888 (463M) [binary/octet-stream]
Saving to: ‘glove-100-angular.hdf5’


2021-06-04 13:53:00 (63.0 MB/s) - ‘glove-100-angular.hdf5’ saved [485413888/485413888]

time: 7.68 s (started: 2021-06-04 13:52:53 +00:00)


The dataset is in HDF5 format, a popular scientific format that can be accessed using the `h5py` library. Let's see what's inside the datset file.

In [None]:
data = h5py.File('glove-100-angular.hdf5', 'r')  # open in read mode
list(data.items())

[('distances', <HDF5 dataset "distances": shape (10000, 100), type "<f4">),
 ('neighbors', <HDF5 dataset "neighbors": shape (10000, 100), type "<i4">),
 ('test', <HDF5 dataset "test": shape (10000, 100), type "<f4">),
 ('train', <HDF5 dataset "train": shape (1183514, 100), type "<f4">)]

time: 25.3 ms (started: 2021-06-04 13:53:00 +00:00)


In [None]:
# data can be accessed as numpy arrays
database = data['train'][:]
queries = data['test'][:]
true_neighbors = data['neighbors'][:]

x[0]

array([-0.11333  ,  0.48402  ,  0.090771 , -0.22439  ,  0.034206 ,
       -0.55831  ,  0.041849 , -0.53573  ,  0.18809  , -0.58722  ,
        0.015313 , -0.014555 ,  0.80842  , -0.038519 ,  0.75348  ,
        0.70502  , -0.17863  ,  0.3222   ,  0.67575  ,  0.67198  ,
        0.26044  ,  0.4187   , -0.34122  ,  0.2286   , -0.53529  ,
        1.2582   , -0.091543 ,  0.19716  , -0.037454 , -0.3336   ,
        0.31399  ,  0.36488  ,  0.71263  ,  0.1307   , -0.24654  ,
       -0.52445  , -0.036091 ,  0.55068  ,  0.10017  ,  0.48095  ,
        0.71104  , -0.053462 ,  0.22325  ,  0.30917  , -0.39926  ,
        0.036634 , -0.35431  , -0.42795  ,  0.46444  ,  0.25586  ,
        0.68257  , -0.20821  ,  0.38433  ,  0.055773 , -0.2539   ,
       -0.20804  ,  0.52522  , -0.11399  , -0.3253   , -0.44104  ,
        0.17528  ,  0.62255  ,  0.50237  , -0.7607   , -0.071786 ,
        0.0080131, -0.13286  ,  0.50097  ,  0.18824  , -0.54722  ,
       -0.42664  ,  0.4292   ,  0.14877  , -0.0072514, -0.1648

time: 12.6 ms (started: 2021-06-04 13:53:00 +00:00)


Since we have the groundtruth provided by the dataset, we will compute the **Recall**, that is the fraction of true neighbors of a query that have been retrieved by the system, as metric of quality of the results by our indices.

$ \text{Recall} = \dfrac{|\text{True Neighbors} \cap \text{Retrieved Objects}|}{|\text{True Neighbors}|}\,.$

In [None]:
def compute_recall(true_neighbors, predicted_neighbors):
  recalls = []
  for t, p in zip(true_neighbors, predicted_neighbors):
    intersection = np.intersect1d(t, p)
    recall = len(intersection) / len(t)
    recalls.append(recall)

  return np.mean(recalls)

time: 4.97 ms (started: 2021-06-04 13:53:00 +00:00)


## Pivots Permutation Representation

Let's select some pivots at random from the database:

In [None]:
n_pivots = 1000

pivots = np.random.choice(database, n_pivots)

In [None]:
from sklearn.metrics import pairwise_distances

xp_distances = pairwise_distances(database, pivots, metric='cosine')
xp_distances

In [None]:
# permutation: i -> pivot with rank i
xp_permutation = xp_distances.argsort(axis=1)
xp_permutation

In [None]:
# inverse permutation: i -> position of pivot i
xp_inv_perm = xp_permutation.argsort(axis=1)
xp_inv_perm

In [None]:
qp_distances = pairwise_distances(database, pivots, metric='cosine')
qp_inv_perm = qp_distances.argsort(axis=1).argsort(axis=1)

In [None]:
def T(samples, pivots):
  """ Transform real-valued vectors into Term Frequencies-like vectors via permutations """
  xp_distances = pairwise_distances(samples, pivots, metric='cosine')

  xp_permutations = xp_distances.argsort(axis=1)
  xp_inv_permutat = xp_permutations.argsort(axis=1)
  return xp_inv_permutat

time: 8.05 ms (started: 2021-05-31 18:07:17 +00:00)


In [None]:
tx = T(db, pivots)
tq = T(queries, pivots)

time: 3.76 s (started: 2021-05-31 18:07:17 +00:00)


Let's measure the **sparsity** of the transformed dataset. The sparsity of a vector is the fraction of zero elements in the vector.
The sparsity of a vector is directly related to the number of entries that will be stored in the posting lists of the textual search engine; therefore, the smaller the better.

In [None]:
sparsity = 1 - (np.count_nonzero(tx) / tx.size)
sparsity

0.508522514309083

time: 568 ms (started: 2021-06-01 09:24:55 +00:00)


We can evaluate the performance loss introduced by the approximation by computing the inner product between the transformed vectors. This will be the operations that the textual search engine like Elasticsearch will internally perform.

In [None]:
# compute scores
scores = tq.dot(tx.T)
nq, ndb = scores.shape

# find 100 nearest neighbors
k = 100
sorted_indices = scores.argsort(axis=1)[:, ::-1]  # sort descending per row
topk = sorted_indices[:, :k]  # get **indices** of the topk images for each row

recall = compute_recall(true_neighbors, sorted_indices)
print(recall)

0.569
time: 1.63 s (started: 2021-05-31 18:07:30 +00:00)


Once we have transformed our vectors, we can create surrogate text representations by repeating terms as many times as indicated by the transoformed vector, i.e., we interpret the transofmed vector as term frequencies.

In [None]:
def surrogate_text(term_frequencies):
  """ Creates Surrogate Text by repeating the i-th term
      a number of time indicated by term_frequency[i].
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    tokens += [f'f{i}'] * int(tf)

  text = ' '.join(tokens)
  return text

time: 2.73 ms (started: 2021-05-31 18:25:23 +00:00)


In [None]:
surrogate_text(tx[0])

'f0 f0 f0 f0 f0 f0 f0 f3 f3 f3 f3 f3 f3 f3 f7 f7 f7 f8 f8 f8 f8 f8 f8 f8 f9 f9 f9 f9 f9 f9 f9 f14 f14 f14 f14 f14 f14 f14 f14 f17 f17 f17 f17 f17 f17 f17 f18 f18 f18 f18 f19 f19 f19 f22 f22 f22 f22 f22 f22 f22 f30 f30 f30 f31 f31 f32 f32 f32 f33 f33 f33 f33 f35 f35 f35 f35 f35 f39 f39 f40 f40 f40 f40 f40 f40 f40 f40 f40 f40 f40 f43 f43 f43 f43 f43 f43 f45 f45 f45 f45 f45 f45 f45 f47 f47 f47 f47 f48 f48 f48 f48 f48 f48 f48 f48 f48 f48 f48 f49 f49 f49 f49 f49 f49 f49 f49 f54 f54 f54 f54 f54 f54 f54 f54 f54 f54 f54 f54 f56 f56 f56 f56 f57 f57 f57 f57 f57 f59 f59 f59 f66 f66 f66 f67 f67 f67 f68 f68 f68 f68 f68 f68 f68 f68 f68 f68 f68 f68 f68 f68 f71 f71 f71 f71 f71 f71 f71 f71 f71 f71 f71 f71 f71 f74 f74 f74 f74 f74 f74 f74 f74 f74 f74 f74 f74 f74 f74 f75 f75 f75 f75 f75 f80 f80 f80 f80 f80 f80 f80 f80 f80 f80 f80 f80 f80 f80 f82 f82 f82 f82 f86 f86 f86 f86 f86 f86 f86 f86 f86 f86 f87 f87 f87 f87 f87 f87 f87 f87 f87 f89 f89 f89 f89 f89 f89 f90 f90 f91 f91 f91 f91 f91 f91 f91 f91 f91 f94 f9

time: 8.95 ms (started: 2021-05-31 18:25:24 +00:00)


If the textual search engine supports boosting, we can generate shorter surrogate texts (that leads to lower query times):

In [None]:
def surrogate_text_boost(term_frequencies):
  """ Creates Surrogate Text with Boosting.
      Instead of repeating a term N times, writes 'term^N'
      that in many full-text search engines has the same effect
      of setting the term frequency of that term to N.
      Useful to get shorter surrogate text representations.
  """
  tokens = []
  for i, tf in enumerate(term_frequencies):
    if tf:
      tokens.append(f'f{i}^{tf:g}')

  text = ' '.join(tokens)
  return text

time: 6.32 ms (started: 2021-05-31 18:16:16 +00:00)


In [None]:
surrogate_text_boost(tq[0])

'f3^5 f7^7 f10^3 f11^14 f15^3 f17^4 f18^4 f20^7 f22^15 f23^5 f31^6 f33^2 f35^6 f39^13 f46^3 f48^5 f49^12 f50^10 f51^17 f52^3 f59^2 f62^11 f65^5 f66^19 f69^13 f70^26 f71^17 f75^17 f76^11 f77^4 f80^17 f91^7 f92^7 f94^11 f95^23 f98^8 f99^7 f100^2 f101^15 f102^2 f104^8 f105^9 f106^14 f108^7 f109^11 f112^11 f113^11 f116^4 f119^14 f121^3 f124^13 f125^11 f126^8 f127^7 f128^3 f129^14 f130^11 f132^5 f134^25 f138^11 f140^8 f141^9 f142^4 f144^14 f153^10 f154^19 f156^18 f157^3 f158^5 f161^12 f164^3 f167^12 f168^2 f172^3 f173^6 f178^14 f179^3 f183^18 f184^20 f185^11 f197^7'

time: 6.08 ms (started: 2021-05-31 18:16:18 +00:00)


### Surrogate Text Representations on Elasticsearch

Let's see a working example of using STRs with Elasticsearch.
First, let's download and run an Elasticsearch instance.

In [None]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q --no-clobber
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0
!pip install -q elasticsearch

time: 13.4 s (started: 2021-05-31 18:23:28 +00:00)


In [None]:
# start server
import os
from subprocess import Popen, STDOUT, DEVNULL
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=DEVNULL, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# wait a bit for ES to start

time: 141 ms (started: 2021-05-31 18:23:58 +00:00)


In [None]:
from elasticsearch import Elasticsearch

es = Elasticsearch(timeout=30)
print(es.ping())

True
time: 104 ms (started: 2021-05-31 18:24:25 +00:00)


In [None]:
! ps -ef | grep elasticsearch

daemon      5728      64 33 18:23 ?        00:00:34 /content/elasticsearch-7.0.0/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-6799401008857908504 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Dio.netty.allocator.type=unpooled -Des.path.home=/content/elasticsearch-7.0.0 -Des.path.conf=/content/elasticsearch-7.0.0/config -Des.distribution.flavor=default 

Once Elasticsearch is up and running, let's create and index.
We need to specify one field (we name it `repr`) of type `text`, such that is searched with full-text semantics (TfIdf-like scoring). We also specify to use a white-space analyzer, that simply splits tokens on spaces.

In [None]:
index_config = {
    "mappings": {
        "_source": {"enabled": False},  # do not store STR
        "properties": {"repr": {"type": "text"}}  # declare the field 'repr' as FULLTEXT
    },
    "settings": {
        "index": {"number_of_shards": 1, "number_of_replicas": 0},
        "analysis": {"analyzer": {"first": {"type": "whitespace"}}}  # tokenize by spaces, we don't need fancier analyzers
    }
}

# delete any pre-existing indices
es.indices.delete('simsearch_index', ignore=(400, 404))

# create the index
es.indices.create('simsearch_index', index_config, ignore=400)

{'acknowledged': True, 'index': 'simsearch_index', 'shards_acknowledged': True}

time: 394 ms (started: 2021-05-31 18:24:46 +00:00)


We use the utilities provided by the elasticsearch python package to bulk index documents.
We define a function that creates elasticsearch indexing commands and use `streaming_bulk` to process them sequentially.

In [None]:
from elasticsearch.helpers import streaming_bulk

n_samples = 10_000

def generate_docs(index_name, v):
  for i, vi in enumerate(v):
    yield {'_index': index_name, '_id': i, 'repr': surrogate_text(vi)}

docs = generate_docs('simsearch_index', tx[:n_samples])
indexing = streaming_bulk(es, docs, chunk_size=150, max_chunk_bytes=2**26)
indexing = tqdm(indexing, total=n_samples)

for _ in indexing:
  pass

HBox(children=(FloatProgress(value=0.0, max=10000.0), HTML(value='')))


time: 9.52 s (started: 2021-05-31 18:25:45 +00:00)


Let's check how many documents have been inserted.

In [None]:
print(es.count(index='simsearch_index'))

{'count': 10000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
time: 28.6 ms (started: 2021-05-31 18:30:22 +00:00)


Now, we can query the database. We specify we want to search the `repr` field. Since Elasticsearch supports the boosting syntax, we use the shorter surrogate text representation with boosting.

In [None]:
qi = tq[0]  # first query

query = {
  "query": {"query_string": {"default_field": "repr", "query": surrogate_text_boost(qi)}},
  "from": 0, "size": k
}

results = es.search(query, index='simsearch_index')
results

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '3143',
    '_index': 'simsearch_index',
    '_score': 9325.603,
    '_type': '_doc'},
   {'_id': '870',
    '_index': 'simsearch_index',
    '_score': 9300.942,
    '_type': '_doc'},
   {'_id': '1548',
    '_index': 'simsearch_index',
    '_score': 8952.77,
    '_type': '_doc'},
   {'_id': '852',
    '_index': 'simsearch_index',
    '_score': 8823.858,
    '_type': '_doc'},
   {'_id': '6828',
    '_index': 'simsearch_index',
    '_score': 8791.831,
    '_type': '_doc'},
   {'_id': '4098',
    '_index': 'simsearch_index',
    '_score': 8688.694,
    '_type': '_doc'},
   {'_id': '8868',
    '_index': 'simsearch_index',
    '_score': 8667.128,
    '_type': '_doc'},
   {'_id': '8828',
    '_index': 'simsearch_index',
    '_score': 8655.381,
    '_type': '_doc'},
   {'_id': '8063',
    '_index': 'simsearch_index',
    '_score': 8639.455,
    '_type': '_doc'},
   {'_id': '3004',
    '_index': 's

time: 197 ms (started: 2021-05-31 15:34:04 +00:00)


Let's prepare the true nearest neighbors groundtruth that include only the subset of data points that we indexed with Elasticsearch (indexing all the data in Colab takes roughly 40 minutes).

In [None]:
true_scores_small = q.dot(x[:n_samples].T)
true_neighbors_small = true_scores_small.argsort(axis=1)[:, ::-1][:, :k] 

time: 136 ms (started: 2021-05-31 18:44:05 +00:00)


Next, let's perform the search for all our queries and compute the obtained recall.

### Exercise

Perform all the queries using Elasticsearch and measure the obtained mean recall.

*Suggestion: perform the queries once at a time and produce the D and I matrices incrementally; then use `compute_recall(true_neighbors_small, I)` to compute recall as usual.* 

In [None]:
D = []
I = []

for qi in tqdm(tq):
  query = {
    "query": {"query_string": {"default_field": "repr", "query": surrogate_text_boost(qi)}},
    "from": 0, "size": k
  }

  results = es.search(query, index='simsearch_index')

  Di = [hit['_score'] for hit in results['hits']['hits']]
  Ii = [int(hit['_id']) for hit in results['hits']['hits']]

  D.append(Di)
  I.append(Ii)

D = np.array(D)
I = np.array(I)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))


time: 8.34 s (started: 2021-05-31 18:45:26 +00:00)


In [None]:
recall = compute_recall(true_neighbors_small, I)
print(recall)

0.5236000000000001
time: 10.8 ms (started: 2021-05-31 18:45:35 +00:00)


## Additional Resources

*   https://ai.facebook.com/tools/faiss/
*   https://github.com/facebookresearch/faiss/wiki/
*   https://github.com/erikbern/ann-benchmarks/
*   http://melisandre.deepfeatures.org/
