# Locality-sensitive hashing for detecting coreferences

Problem: We want to find mentions that possibly refer to the same entity. Currently we compare each mention with each other, and if mention 1 contains mention 0 as a word, then we consider mention 0 and mention 1 possible coreferences. This approach does not scale well with the number of mentions.

Solution: try out how to use locality-sensitive hashing to reduce the number of comparisons to make. While there is optimized software available to do this, I think that a good start is a solution without external dependencies: no need to check compatibility with other requirements, and data are already preprocessed which should make the task computationally simple. 

How does LSH work?
1. Shingling: convert text to sparse vectors. I will start with shingling at the word level and think about alternatives later.
    - One-hot encoding
    - Define vocabulary of all shingles
2. MinHashing: create signatures based on randomly reordering the vocabulary and recording the first shingle that is in mention $i$.
3. Cut the signature into bands and assign--using a hash function--all mentions to buckets. 
    - More bands means larger buckets
    - Need to use the same function for the same band $j$ for all mentions. Can use different functions for different bands, but not required. 

In [1]:
from load_coreferences import load_coreferences
import lsh 
import copy


0. Load data 

In [2]:
raw_mentions = load_coreferences()


In [3]:
mentions = {i: m for i, m in enumerate(raw_mentions)}
# stack them on top of each other 

mentions_scaled = copy.copy(mentions)

idx = len(mentions_scaled)
scaling_factor = 3
for i in range(1, scaling_factor):
    for idx_old in mentions.keys():
        m = mentions[idx_old]
        mentions_scaled[idx] = m 
        idx += 1

In [4]:
len(mentions_scaled)

lsh = lsh.LSH(mentions=mentions_scaled, shingle_size=3, signature_size=200, n_buckets=2)

lsh.cluster()
lsh.summarise()

# adjust the sizing according to the rules so that we have log-time complexity!


took 1.4494690895080566 seconds for 522 mentions
average, min, max cluster size: 11.38, 5, 41


Define comparison groups

next steps
- how does it scale?
    - try it out with the prototype?
- optimize for the use case (number of hashes, size of subvectors)
- numpy? use external library? 
    - https://github.com/spotify/annoy
    - https://faiss.ai/
- look at typical coreferences: use other examples for the above! 
- should we change the definition of what is a coreference? 
- add test data, ie precision/recall for the input data? (after proper classification)
    - also add some mentions that are not coreferences (why not all?)
