# Locality-sensitive hashing for detecting coreferences

Problem: We want to find mentions that cold refer to the same entity. Currently we compare each mention with each other, and if mention 1 contains mention 0 as a word, then we consider mention 0 and mention 1 potential coreferences. This approach does not scale well with the number of mentions.

Solution: try out how to use locality-sensitive hashing to reduce the number of comparisons to make. While there is optimized software available to do this, I think that a good start is a solution without external dependencies: no need to check compatibility with other requirements, and data are already preprocessed which should make the task computationally simple. 

How does LSH work?
1. Shingling: convert text to sparse vectors. I will start with shingling at the word level and think about alternatives later.
    - One-hot encoding
    - Define vocabulary of all shingles
2. MinHashing: create signatures based on randomly reordering the vocabulary and recording the first shingle that is in mention $i$.
3. Cut the signature into bands and assign--using a hash function--all mentions to buckets. 
    - More bands means larger buckets
    - Need to use the same function for the same band $j$ for all mentions. Can use different functions for different bands, but not required. 

In [1]:
from load_coreferences import load_coreferences
import lsh 
import copy

import faiss 


0. Load data 

In [2]:
raw_mentions = load_coreferences()


In [3]:
mentions = {i: m for i, m in enumerate(raw_mentions)}
# stack them on top of each other 

mentions_scaled = copy.copy(mentions)

idx = len(mentions_scaled)
scaling_factor = 20
for i in range(1, scaling_factor):
    for idx_old in mentions.keys():
        m = mentions[idx_old]
        mentions_scaled[idx] = m 
        idx += 1

## with FAISS

In [4]:

mylsh = lsh.LSHBitSampling(mentions=mentions_scaled, shingle_size=4)

mylsh.faiss_neighbors(k=10, nbits=100)
neighbors = mylsh.neighbors_to_dict(mentions_scaled)
mylsh.summarise() # note: 50% of the time here is taken for extracting the neighbors! can this be made faster? ie with numpy?

# note: this approach uses quite some memory because it stores the full vectors and shingles.. change??
# here we see the problem -- there are 20 duplicates of any mention, but the 10 nearest neighbors are all in the same bucket
    # how to avoid? we do not know the number of duplicates beforehand... count? somehow create unique mentions to avoid this? 

Took 0.550947904586792 seconds to classify 3480 mentions


Notes 
- It is important to use the binary vectors. If using the min-hashed vectors, longer words are closer to longer ones (and shorter to shorter)
- higher k -> higher recall
    - low k may not capture the actual coreferences. ie, for k=4, it missed them for "eagles", "rosati", "belo"
    - using k=10 fixes the problems for "eagles" and "belo" but not for "rosati"


Next steps
- scaling -- seems very good 
- precision/recall of whole pipeline (including the search for coreferences)
- how easy is it to integrate FAISS into REL?
- a problem could be when there are a lot of mentions that refer to the same (single) mention. ie, 10 times "jimi" and once "jimi hendrix". 
    - I think we would miss "jimi hendrix" for k=10
    - Solutions: larger k? -> quadratic cost again 
    - Restrict the hamming distance of the neighbors, ie forbid them to be in the same bucket (this would also omit the own element already I think)
    - do some more testing for this 
    - or just use the unique mentions






## with MinHashing (does not scale well)

In [8]:
len(mentions_scaled)

mylsh = lsh.LSHMinHash(mentions=mentions_scaled, shingle_size=3, signature_size=200, n_buckets=2)

mylsh.cluster()
mylsh.summarise()

# adjust the sizing according to the rules so that we have log-time complexity!

took 49.10193204879761 seconds for 3480 mentions
average, min, max cluster size: 78.08, 39, 259


Define comparison groups

next steps
- how does it scale?
    - try it out with the prototype?
- optimize for the use case (number of hashes, size of subvectors)
- numpy? use external library? 
    - https://github.com/spotify/annoy
    - https://faiss.ai/
- look at typical coreferences: use other examples for the above! 
- should we change the definition of what is a coreference? 
- add test data, ie precision/recall for the input data? (after proper classification)
    - also add some mentions that are not coreferences (why not all?)


## Old stuff: trying out FAISS

In [6]:
mylsh = lsh.LSH(mentions=mentions_scaled, shingle_size=3, signature_size=200, n_buckets=2)

mylsh._build_vocab()
mylsh._one_hot_encode()
mylsh._min_hash()
mylsh._make_bands()




In [10]:
#mylsh.mentions[0]["signature"]

In [24]:
import numpy as np
d = 64                           # dimension -- number of columns
nb = 100000                      # database size -- number of vectors (=number of rows)
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

True
100000


In [47]:
# help(type(faiss.IndexFlatL2))
sum(xb[:10, :] != xb[:10])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [48]:
k = 4                     # we want to see 4 nearest neighbors
D, I = index.search(xb[:10], k) 
print(I)
print(D)

# xb[:10] is the query vector: the first 10 rows. maybe more intuitive: xb[:10, :]
    # ie, for the first 10 vectors in the database, we look for the closest k neighbors in itself (the whole database)
# what are the outputs?
    # D: distance matrix nq-by-k: number of rows = size of query, number of columns = number of neighbors
        # ordered by increasing distance, ie D[:, 0] is the distance to the first nearest neighbor, D[:, 1] is the distance to the second nearest neighbor
        # the distance is defined by the index type (L2 norm here)
    # I: row i contains the IDs of the neighbors of query vector i, sorted by inreasing distance
        # ie, below, the first nearest neighbor to vector 0 is itself. the next nearest neighbor is vector 393
        # the first nearest neighbor to vector 1 is also itself. 




[[   0  393  363   78]
 [   1  555  277  364]
 [   2  304  101   13]
 [   3  173   18  182]
 [   4  288  370  531]
 [   5  936  817 1316]
 [   6   35  142 1021]
 [   7  175  415  673]
 [   8   18  434   84]
 [   9 1076  622  801]]
[[0.        7.1751738 7.20763   7.2511625]
 [0.        6.3235645 6.684581  6.799946 ]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.527987  7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]
 [0.        6.8761454 7.136672  7.2297354]
 [0.        6.765587  7.6871905 7.940848 ]
 [0.        6.2155056 6.7339525 6.762247 ]
 [0.        7.0762296 7.300755  7.542604 ]
 [0.        6.2345862 6.455964  6.6127834]]


In [13]:
k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[0.        7.1751738 7.20763   7.2511625]
 [0.        6.3235645 6.684581  6.799946 ]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.527987  7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]]
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]


try alternative distances

In [102]:
nbits = 15 # is this the length of the signature?? but the signature is already in the dense vector? 
index = faiss.IndexLSH(d, nbits)   # build the index
print(index.is_trained)
index.add(xb)                 # add vectors to the index
print(index.ntotal)

True
100000


In [103]:
k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check -- for short nbits, it may not even assign itself as the first closest neighbor
print(I) # the nearest neighbors change when changing the number of bits (and therefore buckets)
print(D) # the distance here indicates whether the comparison records are in the same bucket or not
    # what do the integers mean here? 0 = same bucket, 1=? 2=?
    # the distance is the hamming distance? -- taken from the signature? ie, signature more different if distance is larger? 


# so nbits refers to the length of the subvector that we cut for comparing hashes? this would meen that nbits cannot be larger than the dimensionality?
    # i am still not sure how to go from the theory/simple prototype to the faiss library 



[[  0  86  73  29]
 [  1 215 239 199]
 [  2 528 282  69]
 [  3 754 415  61]
 [  4 178 204   8]]
[[0. 1. 1. 1.]
 [0. 1. 1. 1.]
 [0. 0. 1. 1.]
 [0. 1. 1. 2.]
 [0. 1. 1. 1.]]


In [219]:
mylsh = lsh.LSH(mentions=mentions_scaled, shingle_size=4, signature_size=300, n_buckets=2) #singhle size = 5 does not work...

mylsh.cluster()


signatures = [v["vector"] for v in mylsh.mentions.values()] # use the binary vectors, not the min-hashed ones! (see wikipedia https://en.wikipedia.org/wiki/Locality-sensitive_hashing#cite_note-IndykMotwani98-5)
 # min hash is for a different approach!
xb = np.stack([np.array(s) for s in signatures ]).astype('float32')

# signature = mylsh.mentions[0]["signature"]
# len(signature)

d = xb.shape[1]
nbits = 100 # is this the length of the signature?? but the signature is already in the dense vector? 
index = faiss.IndexLSH(d, nbits)   # build the index
print(index.is_trained)
index.add(xb)                 # add vectors to the index
print(index.ntotal)

k = 10                 
D, I = index.search(xb, k) # sanity check -- for short nbits, it may not even assign itself as the first closest neighbor
print(I) # the nearest neighbors change when changing the number of bits (and therefore buckets)
print(D) # the distance here indicates whether the comparison records are in the same bucket or not


def extract_neighbors(mentions, I):
    neighbors = {}
    for i, k in enumerate(mentions.values()):
        n_idx = list(I[i])[1:] # ignore own
        n_i = [mentions[i] for i in n_idx]
        neighbors[k] = n_i
    return neighbors


neighbors = extract_neighbors(mentions_scaled, I=I)
neighbors


True
174
[[  0 139  19 ...  24 105  16]
 [  1 152  25 ... 134  31  97]
 [  2  17 142 ... 112  83   3]
 ...
 [171 132 120 ... 147 125  39]
 [172  11  43 ...  52  84 116]
 [173 162 140 ... 119 139  60]]
[[ 0. 25. 38. ... 39. 41. 41.]
 [ 0. 24. 39. ... 42. 42. 43.]
 [ 0. 31. 38. ... 42. 43. 43.]
 ...
 [ 0. 33. 34. ... 41. 41. 41.]
 [ 0. 33. 36. ... 41. 41. 42.]
 [ 0. 33. 38. ... 40. 41. 41.]]
