# Locality-sensitive hashing for detecting coreferences

Problem: We want to find mentions that cold refer to the same entity. Currently we compare each mention with each other, and if mention 1 contains mention 0 as a word, then we consider mention 0 and mention 1 potential coreferences. This approach does not scale well with the number of mentions.

Solution: try out how to use locality-sensitive hashing to reduce the number of comparisons to make. While there is optimized software available to do this, I think that a good start is a solution without external dependencies: no need to check compatibility with other requirements, and data are already preprocessed which should make the task computationally simple. 

How does LSH work?
1. Shingling: convert text to sparse vectors. I will start with shingling at the word level and think about alternatives later.
    - One-hot encoding
    - Define vocabulary of all shingles
2. MinHashing: create signatures based on randomly reordering the vocabulary and recording the first shingle that is in mention $i$.
3. Cut the signature into bands and assign--using a hash function--all mentions to buckets. 
    - More bands means larger buckets
    - Need to use the same function for the same band $j$ for all mentions. Can use different functions for different bands, but not required. 

Next steps, 4/1/23

1. Adjust lsh to use only lists, no dict 
    - fix LSHBitSampling and LSHMinHash_nonp
1. write tests 
    - [x] correctness of cols_to_int
    - [ ] notebook with speed tests for different sorting options -- not for now
    - [x] correctness of coreference classification -- precision and recall for the clustering 
    - [x] optimize for current use case (shingle size, size of bands, ...)
2. Understand better
    - [ ] what do shingles do? smaller ->?
    - [ ] what does band size do? how does it interact with shingle size? does one compensate for the other? does one scale better than the other? (for optimization)
2. Document 
    - [ ] write what the functions/classes do
    - [ ] `cols_to_int`:  
        - write down/explain the idea/mechanics
        - how far can we go with until overflow error? what to do in this case?
            - advise user to change the size of the bands (?)
            - switch to string operation, which is (much) slower but should still work? 
3. Integrate with REL
    - tests?
    - compare timing with new and old approach 
4. Optimize further?
    - alternative for minhashing? improve speed or effectiveness

In [1]:
from load_coreferences import load_coreferences
import lsh 
import copy

import faiss 
import numpy as np

0. Load data 

In [2]:
raw_mentions = load_coreferences(drop_duplicates=False)


In [3]:
len(raw_mentions)

632

In [5]:
mentions = {i: m for i, m in enumerate(raw_mentions)}
# stack them on top of each other 

mentions_scaled = copy.copy(mentions)

idx = len(mentions_scaled)
scaling_factor = 1
for i in range(1, scaling_factor):
    for idx_old in mentions.keys():
        m = mentions[idx_old]
        mentions_scaled[idx] = m 
        idx += 1

## MinHashing with numpy

In [10]:
len(mentions_scaled)

mylsh = lsh.LSHMinHash(mentions=mentions_scaled, shingle_size=3, signature_size=50, band_length=2)

mylsh.cluster()
mylsh.summarise()

took 0.2159130573272705 seconds for 632 mentions
average, min, max cluster size: 17.8, 0, 49



## Notes for using numpy 
- what is the exact time complexity here? certainly more than linear. check the theory.


### How to integrate with REL?

In [4]:
# out = {i: {"shingles": lsh.k_shingle(m, 4)} for i, m in zip(range(len(mentions_rel), mentions)) }
# mentions_rel
mentions_rel = [
    'German', 'British', 'Brussels', 'European Commission', 'German',
    'British', 'Germany', 'European Union', 'Britain', 'Commission', 
    'European Union', 'Franz Fischler', 'Britain', 'France', 'BSE', 
    'Spanish', 'Loyola de Palacio', 'France', 'Britain', 'BSE', 'British', 'German', 
    'British', 'Europe', 'Germany', 'Bonn', 'British', 'Germany', 'Britain', 'British'
]

mylsh = lsh.LSHMinHash(mentions=mentions_rel, shingle_size=4, signature_size=300, band_length=2)
mylsh.cluster()
mylsh.summarise()



took 0.024751663208007812 seconds for 30 mentions
average, min, max cluster size: 8.73, 0, 19


## with FAISS

In [6]:

mylsh = lsh.LSHBitSampling(mentions=mentions_scaled, shingle_size=4)

mylsh.faiss_neighbors(k=10, nbits=100)
neighbors = mylsh.neighbors_to_dict(mentions_scaled)
mylsh.summarise() # note: 50% of the time here is taken for extracting the neighbors! can this be made faster? ie with numpy?

# note: this approach uses quite some memory because it stores the full vectors and shingles.. change??
# here we see the problem -- there are 20 duplicates of any mention, but the 10 nearest neighbors are all in the same bucket
    # how to avoid? we do not know the number of duplicates beforehand... count? somehow create unique mentions to avoid this? 

AttributeError: 'LSHBitSampling' object has no attribute 'mentions'

Notes 
- It is important to use the binary vectors. If using the min-hashed vectors, longer words are closer to longer ones (and shorter to shorter)
- higher k -> higher recall
    - low k may not capture the actual coreferences. ie, for k=4, it missed them for "eagles", "rosati", "belo"
    - using k=10 fixes the problems for "eagles" and "belo" but not for "rosati"


Next steps
- scaling -- seems very good 
- precision/recall of whole pipeline (including the search for coreferences)
- how easy is it to integrate FAISS into REL?
- a problem could be when there are a lot of mentions that refer to the same (single) mention. ie, 10 times "jimi" and once "jimi hendrix". 
    - I think we would miss "jimi hendrix" for k=10
    - Solutions: larger k? -> quadratic cost again 
    - Restrict the hamming distance of the neighbors, ie forbid them to be in the same bucket (this would also omit the own element already I think)
    - do some more testing for this 
    - or just use the unique mentions






## with MinHashing (does not scale well)

In [8]:
len(mentions_scaled)

mylsh = lsh.LSHMinHash_nonp(mentions=mentions_scaled, shingle_size=3, signature_size=200, n_buckets=2)

mylsh.cluster()
mylsh.summarise()

# adjust the sizing according to the rules so that we have log-time complexity!

took 49.10193204879761 seconds for 3480 mentions
average, min, max cluster size: 78.08, 39, 259
