# An Overview of Approximate Nearest Neighbors Search

## Background

In [1]:
import nmslib
import numpy as np
import pandas as pd
import requests
from collections import Counter
from scipy.spatial import distance
from sklearn.feature_extraction.text import CountVectorizer

### Relational algebra and SQL

In relational algebra, a **theta join** ($⋈_{\theta}$) is an operation between two relations (tables) $R$ and $S$ that returns the set of all combinations of tuples (rows) in $R$ and $S$ that satisfy a predicate $\theta$. 

$$R ⋈_{\theta} S$$

The predicate $\theta$ is essentially a comparison between attributes (columns) of the two relations. The operators for comparisons are =, ≠, <, ≤, >, and ≥. The literature on relational algebra highlights two cases of the theta join:
- The **equijoin**: a theta join where the operator in the predicate is an equal sign (=).
- The **natural join**: an equijoin on all common attributes between $R$ and $S$, preserving only one of each compared attributes. The attributes to be compared are implicit and do not need to be specified.

In practice, relational algebra is extended to distinguish between inner and outer joins. Inner joins return only the tuples (rows) that satisfy the matching criteria. Outer joins return the matched tuples, plus all other tuples from one or both tables, filling empty attributes with NULL values. There are three outer join operations:
- **Left outer join**: preserves all tuples in the relation on the left side of the operation
- **Right outer join**: preserves all tuples in the relation on the right side of the operation
- **Full outer join**: preserves all tuples in both relations

<div align="center">
    <img src="join-operations.png" style="width:calc(1em * 37); margin:20px;" alt="Join Operations"/>
    <div style="font-size: 0.85em;">Illustration of the inner and outer join operations</div>
</div>

SQL builds upon relational algebra. The SQL `INNER JOIN` operation corresponds to the theta join, with the `ON` keyword specifying the predicate. For example, given the `books` and `stocks` tables, we can find all books available at the Frescati bookshop by joining both tables on the `isbn` and `location` attributes.  

`books` table:

|isbn (PK)        |author            |title                   |
|-----------------|------------------|------------------------|
|978-2-253-06707-8|Fiodor Dostoïevski|Les Frères Karamazov    |
|978-0-141-19027-3|Tennessee Williams|A Streetcar Named Desire|

`stocks` table:

|isbn (PK)        |location (PK)  |quantity|
|-----------------|---------------|--------|
|978-2-253-06707-8|Frescati       |2       |
|978-0-141-19027-3|Östermalmstorg |1       |
|978-0-141-19027-3|Frescati       |3       |

SQL query:

```SQL
SELECT books.isbn, books.author, books.title, stocks.quantity
FROM books
INNER JOIN stocks ON books.isbn=stocks.isbn
	AND stocks.location="Frescati";
```

Result:

|isbn             |author            |title                   |quantity|
|-----------------|------------------|------------------------|--------|
|978-2-253-06707-8|Fiodor Dostoïevski|Les Frères Karamazov    |2       |
|978-0-141-19027-3|Tennessee Williams|A Streetcar Named Desire|3       |

Note that since the attribute `isbn` is found in both tables, we could also have performed a natural join (`NATURAL JOIN stocks`) and filtered the result with a `WHERE` clause (`WHERE stocks.location="Frescati"`).

Relational algebra and SQL are limited in their application to information that is consistent and standardized. The relational model is based on classic propositional logic (boolean algebra) and does therefore not take account of uncertainty and inaccuracy. Yet data is often ambiguous or inconsistent. One example is natural language: a query in the `books` table for "Fyodor Dostoevsky" would not return any results. While the French and English spellings are both valid and identify the same author, their string representations cannot be equated using Boolean algebra.

In [2]:
binary = lambda s: " ".join(map(bin,bytearray(s, "utf-8")))
binary("Fiodor Dostoïevski")

'0b1000110 0b1101001 0b1101111 0b1100100 0b1101111 0b1110010 0b100000 0b1000100 0b1101111 0b1110011 0b1110100 0b1101111 0b11000011 0b10101111 0b1100101 0b1110110 0b1110011 0b1101011 0b1101001'

In [3]:
binary("Fyodor Dostoevsky")

'0b1000110 0b1111001 0b1101111 0b1100100 0b1101111 0b1110010 0b100000 0b1000100 0b1101111 0b1110011 0b1110100 0b1101111 0b1100101 0b1110110 0b1110011 0b1101011 0b1111001'

In [4]:
"Fiodor Dostoïevski" == "Fyodor Dostoevsky"

False

Other sources of ambiguity and inconsistency include differing standards and imperfect data collection methods.

Instead of using boolean algebra to evaluate two relations, we can use their distance. 

### Representations and Metrics

Attributes can be represented as elements of sets. Strings, for example, are sequences of symbols taken from an alphabet. The name "Fyodor Dostoevsky" is therefore an element in the set of all strings $\Sigma^*$ over the alphabet $\Sigma$. Representations can be given by the type of the attribute (strings, real numbers, geographic coordinates) or derived by transforming it (n-grams, word embeddings). Based on the definition of the set, the relationship between its elements can be quantified using a metric. A ***metric*** is a function which give the distance between each pair of elements in the set as a real number. A set with a metric is a ***metric space***.

For strings, two common metrics are the Hamming distance and the Levenshtein distance.

The ***Hamming distance*** is a metric on the set of strings with length $n$. It is the number of positions between two strings at which their corresponding symbols are different.

For example, the Hamming distance between "F<span style="color:cyan">y</span>odor" and "F<span style="color:red">i</span>odor" is 1.

In [5]:
def hamming_distance(x, y):
    assert len(x) == len(y)
    return sum(xi != yi for xi, yi in zip(x, y))

hamming_distance("Fyodor", "Fiodor")

1

The ***Levenshtein distance*** is a metric on the set of all strings. It is the minimum number of single-character edits required to change one string into the other. Unlike the Hamming distance, it is therefore defined for strings of any length and is able to take account of insertions and deletions.

For example, the Levenshtein distance between "kitten" and "sitting" is 3. The <span style="color:cyan">k</span> and <span style="color:cyan">e</span> need to be substituted by an <span style="color:red">s</span> and <span style="color:red">i</span>, and a <span style="color:red">g</span> needs to be inserted at the end:

<span style="color:cyan">k</span>itt<span style="color:cyan">e</span>n<span style="color:cyan">_</span><br><span style="color:red">s</span>itt<span style="color:red">i</span>n<span style="color:red">g</span>

Sequences such as strings are defined by the order of their elements. However, for data such as natural language, units such as words matter more than the index of characters in a sequence of text.

One representation for text is the ***bag-of-words model***. It represents text as the bag ([multiset](https://en.wikipedia.org/wiki/Multiset)) of its words.

In [6]:
BoW = lambda s: Counter(s.split())
BoW("the author of the book is Tennessee Williams")

Counter({'the': 2,
         'author': 1,
         'of': 1,
         'book': 1,
         'is': 1,
         'Tennessee': 1,
         'Williams': 1})

The bag-of-words model is a natural vector representation: each word is a dimension. The distance betweeen two elements in the set can be thus defined by a metric such as the cosine distance.

A drawback of the bag-of-words model is that it represents only the frequency of words and does not capture their order in the sequence.

An alternative representation for sequences such as text is the ***n-gram*** model. An n-gram is a contiguous sub-sequence of *n* items from a given sequence. The unit of an n-gram depends on the elements of the given sequence. For text, it can be words or characters, while for a DNA sequence it might be base pairs. For example, the string "language" can be split into 5 character 4-grams:

lang, angu, ngua, guag, uage

In [7]:
def word_ngrams(s: str, n=2):
    """Splits a string into n-grams of n words"""
    words = s.split()
    return [" ".join(words[i:i+n]) for i in range(len(words)-n+1)]

word_ngrams("John likes to watch movies")

['John likes', 'likes to', 'to watch', 'watch movies']

In [8]:
def char_ngrams(s: str, n=3):
    """Splits a string into n-grams of n characters"""
    return [s[i:i+n] for i in range(len(s)-n+1)]

char_ngrams("Apple Inc.")

['App', 'ppl', 'ple', 'le ', 'e I', ' In', 'Inc', 'nc.']

The distance between the sets of n-grams of two sequences can be measured using a metric such as the Jaccard distance. The Jaccard distance is a metric on the set of all finite sets. For two sets $A$ and $B$, the Jaccard similarity coefficient (also known as the Jaccard index) is given by the size of their intersection divided by the size of their union:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

The Jaccard distance is then obtained by substracting the similarity coefficient from $1$:

$$d_{Jaccard}(A, B) = 1 - J(A, B)$$

In [9]:
def jaccard(a: set, b: set):
    """Jaccard distance between two sets a and b"""
    coeff = len(a & b) / len(a | b)
    return 1 - coeff

# character 3-grams of two DNA sequences
a = ("ATC", "TCG", "CGA", "GAT")
b = ("CGA", "GAT", "ATT", "TTG", "TGA")
jaccard(set(a), set(b))

0.7142857142857143

In [10]:
jaccard(set(a), set(a))

0.0

Alternatively, a set of n-gram sequences can be vectorized so that the distance between two sequences is measurable using a metric such as the cosine distance.

In [12]:
# Set of n-gram sequences
s = set([a, b])
# Find the set of all n-grams
dims = set.union(*[set(element) for element in s])
# Create a feature matrix
matrix = np.zeros((len(s), len(dims)))
# Populate the matrix
for e_idx, element in enumerate(s):
    features = Counter(element)
    for d_idx, dim in enumerate(dims):
        matrix[e_idx][d_idx] = features.get(dim, 0)

df = pd.DataFrame(matrix, index=["ATCGAT", "CGATTGA"], columns=dims, dtype=int)
df

Unnamed: 0,GAT,TGA,ATT,ATC,CGA,TCG,TTG
ATCGAT,1,1,1,0,1,0,1
CGATTGA,1,0,0,1,1,1,0


In [13]:
distance.cosine(df.loc["ATCGAT"], df.loc["CGATTGA"])

0.5527864045000421

***Word and sentence embeddings*** are two other representations of text. The idea is to represent words or sentences as low-dimensional vectors using a language model such as a neural network. The language model is typically designed to encode the meaning of the word or sentence. Words or sentences with similar meaning or semantic content will thus be close (e.g. small cosine distance) in the vector space.

<!-- Locality sensitive hashing

Combine different dimensions (e.g. name, location, employee count) -->

An important subset of metric spaces are normed vector spaces. A vector space is a set of vectors in which two operations are defined: addition and scalar multiplication. Normed vector spaces are vector spaces with a specified norm, i.e. a function which determines the length of all vectors in the vector space. A norm induces a metric, so every normed vector space is a metric space.

The $p$-norm or $L^p$ norm of a vector $x$ is defined by

$${\lVert x \rVert}_p = \left(\sum_{i=1}^n |x_i|^p\right)^{\frac{1}{p}}$$

for a real number $p ≥ 1$. For a given $p$-norm, the corresponding distance (also known as Minkowski distance) between two points $x$ and $y$ is:

$$d(x,y) = {\lVert x-y \rVert}_p$$

Based on $p$, we can distinguish between three metrics:
- $p=1$:    ***Manhattan distance*** / ***taxicab metric***

$$d_{Manhattan}(x, y) = \sum_{i=1}^n|x_i-y_i|$$

In [14]:
def manhattan(x: np.ndarray, y: np.ndarray):
    """Manhattan distance between two points x and y"""
    return np.sum(np.abs(x - y))

x = np.array([0, 0])
y = np.array([2, 1])
manhattan(x, y)

3

The Manhattan distance can be used when the dimensions of the vector space are not comparable. It is faster to compute than the Euclidean distance and may therefore be preferred in very high dimensional spaces.

- $p=2$: ***Euclidean distance***

$$d_{Euclidean}(x, y) = \left(\sum_{i=1}^n (x_i-y_i)^2\right)^{\frac{1}{2}}$$

In [15]:
def euclidean(x: np.ndarray, y: np.ndarray):
    """Euclidean distance between two points x and y"""
    return np.sqrt(np.sum((x - y)**2))

euclidean(x, y)

2.23606797749979

Often, metrics are used only for relative distance comparisons. In such cases, computing the absolute distance is not necessary. For the Euclidean distance, we can therefore omit the square root operation. 

The Euclidean distance is among the most widely used metrics. However, as the difference between the two points is squared, an outlier coordinate can skew the result.

- $p=\infty$: ***Chebychev distance***

The $\infty$-norm is the limit of the $p$-norm for $p \to \infty$:

$$\|x\|_\infty = \lim_{p\to\infty} \|x\|_p = \sup_i |x_i|$$

In a finite-dimensional vector space, the corresponding distance function is the maximum of the absolute difference between $x$ and $y$:

$$d_{Chebychev}(x, y) = \max_i|x_i -y_i|$$

See proof [here](https://proofwiki.org/wiki/Chebyshev_Distance_is_Limit_of_P-Product_Metric).

In [16]:
def chebyshev(x: np.ndarray, y: np.ndarray):
    """Chebyshev distance between two points x and y"""
    return np.max(np.abs(x - y))

chebyshev(x, y)

2

The Chebyshev distance is useful when we are interested in the largest difference along any single dimension.

***Inner product spaces*** are normed vector spaces with an inner product. An ***inner product*** $\langle \cdot, \cdot \rangle$ is an operation on two vectors that satisifes the *bilinearity*, *symmetry*, and *positive-definiteness* properties. For example, the real vector space $\R^n$ with $L^2$ norm, also known as Euclidean space, has an inner product that is known as the dot product. For two vectors $x$ and $y$:

$$\langle x, y \rangle = \sum_i^n x_i y_i = x \cdot y$$

In [17]:
x = np.array([-1, 3])
y = np.array([.5, 2])

np.dot(x, y)

5.5

In [18]:
np.dot(x, x)

10

The dot product indicates the angle and magnitude of two vectors. Two vectors with a small angle can thus have a smaller dot product than two vectors with a larger angle but higher magnitude.

In [19]:
np.dot(x, x) > np.dot(x**2, y**2)

False

By dividing the vectors by their norms, i.e. converting them into unit vectors, the effect of magnitude can be eliminated. The result of this operation is known as the cosine similarity.

In [20]:
x = x / np.linalg.norm(x)
y = x / np.linalg.norm(x)
np.dot(x, x) > np.dot(x**2, y**2)

True

The ***Cosine similarity*** is a measure of the similarity of two vectors $x$ and $y$, and is defined as the cosine of the angle $\theta$ between them:

$$S_C(x, y) \coloneqq \cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|} $$

In [21]:
def cosine(x: np.ndarray, y: np.ndarray):
    """Cosine similarity between two points x and y"""
    assert np.any(x) == True and np.any(y) == True
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

x = np.array([-1, 3])
y = np.array([.5, 2])
cosine(x, y)

0.8436614877321075

The dot product and cosine similarity are not metrics, but are often used in information retrieval when text is represented as term/n-gram frequency vectors or as word/sentence embeddings.

Proportional vectors have a cosine similarity of $1$, orthogonal vectors have a similarity of $0$, and opposite vectors have a similarity of $-1$.

In [22]:
cosine(x, x)

0.9999999999999998

In [23]:
orthogonal = np.array([3, 1])
cosine(x, orthogonal)

0.0

In [24]:
opposite = np.array([1, -3])
cosine(x, opposite)

-0.9999999999999998

To join two sets based on a metric or similarity measure, we need to find the pairs of elements between both sets that are near to each other. This problem is essentially a form of the k-nearest neighbor search problem. The next section reviews data structures and algorithms to solve this problem efficiently.

### Data Structures and Algorithms

The ***k*-nearest neighbor (*k*-NN) search problem** consists in finding in a dataset $X \subset S$ the *k* nearest points to a query point $\mathbf{q} \in S$.

A brute force approach would be to compute the distance between the query point and each point in $X$, and loop *k* times through the results to return the *k* points with the smallest distance. The downside of this approach is that it has a high computational complexity. Let's imagine, for example, that $S$ is a vector space $\R^d$ with $d$ dimensions and that the Euclidean distance is used as the metric. Each distance computation has a time complexity of $O(d)$. Computing the distances for $n$ points in $X$ therefore requires $O(nd)$ runtime. At last, selecting the *k* smallest distances requires $O(nk)$ runtime. The time complexity of this brute force algorithm is therefore $O(nd + nk)$. If we have a high number of dimensions, a large number of samples, or need to evaluate multiple queries, this approach is unworkable.

A more refined approach is to build an index of the spacial structure of the dataset $X$. Data structures for partitioning space can depend on the metric of the space. 

General metric spaces:
- **vp-tree** (vantage point tree): a tree that partitions space into smaller and smaller circles. The root of the tree is an element of $S$ that serves as the "vantage point". Elements that are within a certain distance (radius) of this point are on one side of the node, while elements outside of the radius are on the other side of the node. Partitioning space with this method yields a tree of increasingly smaller circles.
- **BK-tree** (Burkhard tree): a tree designed specifically for discrete metrics (e.g. Levenshtein distance; Manhattan and Chebyshev distance on a space of natural numbers). Each node represents an element of $S$, and links between nodes indicate the distance between them.

Euclidean spaces:
- ***k*-d tree** (*k*-dimensional tree): a type of binary search tree. Every node in the tree is a point in *k*-dimensional space. The two childs of every node are points on the right and left side of a hyperplane along a certain dimension.
- **r-tree**: a tree where groups of nearby points are represented by their minimum bounding rectangle (hence the name r-tree). Each node represents a rectangle and links to its sub-rectangles. The r-tree can therefore be thought of a tree of increasingly smaller rectangles.

Levenshtein spaces:
- **trie** (also know as **prefix tree**): a search tree where each node is a partial or complete sequence and links between nodes represent individual elements.

Depth-first search (DFS) of these data structures yields exact results. DFS of exact space indexes works well on low dimensional spaces, but is expensive on high-dimensional ones. For high dimensions, this method is not much more efficient than the brute force approach.

<!-- See "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces" -->

An alternative approach for high-dimensional spaces is to relax the constraints around the problem. Instead of searching for the exact nearest neighbors, only the approximate nearest neighbors are searched for. The idea behind **approximate nearest neighbor (ANN) search** is to leverage quantization, i.e. reduce the cardinality of the representation, to improve search efficiency. Essentially, the dataset is compressed with loss before being indexed. ANN search therefore makes a tradeoff between search quality and efficiency.

One technique to perform quantization is **locality sensitive hashing** (LSH). A locality senstive hash function maps points from $S$ into a lower dimensional representation that can be efficiently indexed and queried.

In Euclidean space, another technique is **product quantization**. It partitions a high dimensional space into a Cartesian product of low dimensional subspaces and quantizes each subspace separately (Jégou, Douze, & Schmid, 2011). The distance between a query vector and a PQ-encoded vector can be estimated with low runtime cost (this is known as asymmetric distance computation). Based on product quantization, an **inverted index** of the encoded dataset can be built to efficiently search for nearest neighbors. The inverted index proposed by Jégou, Douze, & Schmid (2011) is similar to the inverted file system of Sivic and Zisserman (2003).

The **Hierarchical Navigable Small World** graph (HNSW) is a proximity graph, i.e. its vertices are linked based on their proximity in space (Malkov & Yashunin, 2016). HNSW is inspired by two data structures: *skip lists* and *navigable small world* graphs (NSW). Skip lists consist of layered linked list that allow for fast search. NSW models are graphs with (poly/)logarithmic search complexity that use a greedy routing algorithm. HNSW consists of layered NSW graphs, with a hierachical multi-structure similar to the skip list.

<div align="center">
    <img src="hnsw.png" style="width:calc(1em * 20); margin:20px; background-color: white;" alt="Hierarchical Navigable Small World graph"/>
    <div style="font-size: 0.85em;">Hierarchical Navigable Small World graph (Malkov & Yashunin, 2016)</div>
</div>

## ANN Search in Practice



In [25]:
queries = ["Feynman", "Victor Hugo", "Dostoïevski", "Martin Luther King", "Cervantes"]

SPARQL query optimized to use the label service. See [Wikidata:SPARQL query service/query optimization](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#Label_service).

In [8]:
url = "https://query.wikidata.org/sparql"
query = """
SELECT ?itemLabel ?sitelink
WHERE {
  {
    SELECT ?item ?sitelink WHERE {
      ?item wdt:P106 wd:Q36180.
      ?sitelink schema:about ?item;
      schema:isPartOf <https://en.wikipedia.org/>.
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
r = requests.get(url, params = {"format": "json", "query": query})
r.status_code

200

In [9]:
data = r.json()
links = [result["sitelink"]["value"] for result in data["results"]["bindings"]]
authors = [result["itemLabel"]["value"] for result in data["results"]["bindings"]]
assert len(links) == len(authors)
df = pd.DataFrame({"author": authors, "link": links})
df

Unnamed: 0,author,link
0,George Orwell,https://en.wikipedia.org/wiki/George_Orwell
1,Gautama Buddha,https://en.wikipedia.org/wiki/Gautama_Buddha
2,Hafez,https://en.wikipedia.org/wiki/Hafez
3,Francis Coventry,https://en.wikipedia.org/wiki/Francis_Coventry
4,"Sir John Barrow, 1st Baronet","https://en.wikipedia.org/wiki/Sir_John_Barrow,..."
...,...,...
100834,Bridget Minamore,https://en.wikipedia.org/wiki/Bridget_Minamore
100835,Hasibe Çerko,https://en.wikipedia.org/wiki/Hasibe_%C3%87erko
100836,Marshall Thornton,https://en.wikipedia.org/wiki/Marshall_Thornton
100837,Mahmood Ashraf Usmani,https://en.wikipedia.org/wiki/Mahmood_Ashraf_U...


### *n*-grams and Vectorization

Let's represent the names of the authors as a count matrix of their 3-grams.

In [27]:
# Example of the character 3-grams sequence of Victor Hugo
char_ngrams("Victor Hugo")

['Vic', 'ict', 'cto', 'tor', 'or ', 'r H', ' Hu', 'Hug', 'ugo']

We build the matrix of 3-gram counts using scikit-learn's `CountVectorizer` together with the `char_grams` function defined previously.

In [28]:
vectorizer = CountVectorizer(analyzer=char_ngrams, lowercase=False)
count_matrix = vectorizer.fit_transform(list(df["author"]))
count_matrix.shape

(100839, 27864)

The resulting vector space has 27864 dimensions.

Next, we transform the list of queries using the learned vocabulary. 

In [29]:
query_matrix = vectorizer.transform(queries)
query_matrix.shape

(5, 27864)

### Brute Force Approach

The simplest approach is to do a brute force search, i.e. calculate the distance between the query and all the points in the dataset, and return the *k* points with the smallest distance.

In [30]:
from sklearn.neighbors import NearestNeighbors

K = 3
kNN = NearestNeighbors(n_neighbors=K, algorithm="brute", metric="cosine")
kNN.fit(count_matrix)

# 3 nearest neighbors of the query `Dostoïevski`
print("Query: Dostoïevski")
distances, neighbors = kNN.kneighbors(query_matrix[2])
for distance, neighbor in zip(distances[0], neighbors[0]):
    print(df.iloc[neighbor].to_string(index=False))
    print(f"Cosine similarity: {distance:.4f}")

Query: Dostoïevski
                             Fyodor Dostoyevsky
https://en.wikipedia.org/wiki/Fyodor_Dostoevsky
Cosine similarity: 0.5275
                             Andrey Dostoyevsky
https://en.wikipedia.org/wiki/Andrey_Dostoevsky
Cosine similarity: 0.5275
                              Lyubov Dostoevskaya
https://en.wikipedia.org/wiki/Lyubov_Dostoevskaya
Cosine similarity: 0.5417


In [31]:
# Nearest neighbor (k=1) of each query
K = 1
distances, neighbors = kNN.kneighbors(query_matrix, n_neighbors=K)
for query, neighbor, distance in zip(queries, neighbors, distances):
    print(f"Query: {query}")
    print(df.iloc[neighbor.item()].to_string(index=False))
    print(f"Cosine similarity: {distance.item():.4f}")

Query: Feynman
                              Richard Feynman
https://en.wikipedia.org/wiki/Richard_Feynman
Cosine similarity: 0.3798
Query: Victor Hugo
                              Victor Hugo
https://en.wikipedia.org/wiki/Victor_Hugo
Cosine similarity: 0.0000
Query: Dostoïevski
                             Fyodor Dostoyevsky
https://en.wikipedia.org/wiki/Fyodor_Dostoevsky
Cosine similarity: 0.5275
Query: Martin Luther King
                            Martin Luther King Jr.
https://en.wikipedia.org/wiki/Martin_Luther_Kin...
Cosine similarity: 0.1056
Query: Cervantes
                              Annabel Cervantes
https://en.wikipedia.org/wiki/Annabel_Cervantes
Cosine similarity: 0.3169


In [32]:
print("Query: Cervantes")
distances, neighbors = kNN.kneighbors(query_matrix[4])
for distance, neighbor in zip(distances[0], neighbors[0]):
    print(df.iloc[neighbor].to_string(index=False))
    print(f"Cosine similarity: {distance:.4f}")

Query: Cervantes
                              Annabel Cervantes
https://en.wikipedia.org/wiki/Annabel_Cervantes
Cosine similarity: 0.3169
                              Miguel de Cervantes
https://en.wikipedia.org/wiki/Miguel_de_Cervantes
Cosine similarity: 0.3583
                              Lorna Dee Cervantes
https://en.wikipedia.org/wiki/Lorna_Dee_Cervantes
Cosine similarity: 0.3583


In [33]:
%timeit kNN.kneighbors(query_matrix, n_neighbors=1)

53.9 ms ± 514 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### HNSW Approach

Recent research on ANN algorithms and machine learning embeddings has led to the creation of a number of ANN libraries. Examples of such libraries are *Facebook AI Similarity Search* (FAISS) and *Non-Metric Space Library* (NMSLIB). For an overview and benchmarks of these libraries, see [ANN Benchmarks](https://github.com/erikbern/ann-benchmarks).

Sparse Cosine Similarity

In [41]:
# HNSW construction graph parameters
M = 30
efConstruction = 200

index = nmslib.init(
    method="hnsw",
    space="cosinesimil_sparse",
    data_type=nmslib.DataType.SPARSE_VECTOR
)
index.addDataPointBatch(count_matrix)
index.createIndex({
    "M": M,
    "efConstruction": efConstruction,
    "indexThreadQty": 2, 
    "post": 0 # No post-processing of the graph
})

In [42]:
index.setQueryTimeParams({"efSearch": 150})

In [43]:
K = 3
result = index.knnQueryBatch(query_matrix[2], k=K)
neighbors, distances = result[0]


print("Query: Dostoïevski")
for neighbor, distance in zip(neighbors, distances):
    print(df.iloc[neighbor.item()].to_string(index=False))
    print(f"Distance: {1 - distance.item()}")

Query: Dostoïevski
                             Fyodor Dostoyevsky
https://en.wikipedia.org/wiki/Fyodor_Dostoevsky
Distance: 0.47245562076568604
                             Andrey Dostoyevsky
https://en.wikipedia.org/wiki/Andrey_Dostoevsky
Distance: 0.47245562076568604
                              Lyubov Dostoevskaya
https://en.wikipedia.org/wiki/Lyubov_Dostoevskaya
Distance: 0.4583492875099182


In [44]:
K = 1
results = index.knnQueryBatch(query_matrix, k=K)
for query, result in zip(queries, results):
    nneighbor, distance = result
    print(f"Query: {query}")
    print(df.iloc[nneighbor.item()].to_string(index=False))
    print(f"Distance: {1 - distance.item()}")

Query: Feynman
                              Richard Feynman
https://en.wikipedia.org/wiki/Richard_Feynman
Distance: 0.6201736330986023
Query: Victor Hugo
                              Victor Hugo
https://en.wikipedia.org/wiki/Victor_Hugo
Distance: 1.0
Query: Dostoïevski
                             Andrey Dostoyevsky
https://en.wikipedia.org/wiki/Andrey_Dostoevsky
Distance: 0.47245562076568604
Query: Martin Luther King
                            Martin Luther King Jr.
https://en.wikipedia.org/wiki/Martin_Luther_Kin...
Distance: 0.8944271802902222
Query: Cervantes
                              Annabel Cervantes
https://en.wikipedia.org/wiki/Annabel_Cervantes
Distance: 0.6831300258636475


In [45]:
%timeit index.knnQueryBatch(query_matrix, k=K)

3.2 ms ± 263 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## References

- Principles of Mathematical Analysis, 3rd edition (Rudin, 1976)
- A relational model of data for large shared data banks (E. F. Codd, 1970) [PDF](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)
- Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces (P. N. Yianilos, 1993) [PDF](http://algorithmics.lsi.upc.edu/docs/practicas/p311-yianilos.pdf)
- Database system concepts, 7th edition (A. Silberschatz, H. F. Korth, S. Sudarshan, 2019)
- Relational Algebra (Wikipedia) [Link](https://en.wikipedia.org/wiki/Relational_algebra)
- Metric space (Wikipedia) [Link](https://en.wikipedia.org/wiki/Metric_space)
- String metric (Wikipedia) [Link](https://en.wikipedia.org/wiki/String_metric)
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (Y. A. Malkov, D. A. Yashunin, 2016) [Arxiv](https://arxiv.org/abs/1603.09320)
- Billion-scale similarity search with GPUs (J. Johnson, M. Douze, H. Jégou, 2019) [Arxiv](https://arxiv.org/abs/1702.08734) / [Github: FAISS library](https://github.com/facebookresearch/faiss)
- ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms (M. Aumüller, E. Bernhardsson, A. Faithfull, 2019) [Arxiv](https://arxiv.org/abs/1807.05614) / [Github](https://github.com/erikbern/ann-benchmarks)
- Engineering Efficient and Effective Non-Metric Space Library (L. Boytsov, B. Naidan, 2013) [PDF](http://boytsov.info/pubs/sisap2013.pdf) / [Github: NMSLIB](https://github.com/nmslib/nmslib)
- Product Quantization for Nearest Neighbor Search (H. Jégou, M. Douze, C. Schmid,  2011) [PDF](https://hal.inria.fr/inria-00514462v2)
- Video Google: A Text Retrieval Approach to Object Matching in Videos (J. Sivic, and A. Zisserman, 2003) [PDF](https://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf)