# Locality Sensitive Hashing

We will - 
* Process twitter sample tweets and represent each tweet (sequence of words) as a vector using `Bag-of-Words(BOW)` model where the order of the words is ignored
* Use locality sensitive hashing (LSH) and k nearest neighbors to find tweets that are similar to a given tweet.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import twitter_samples
from utils import process_tweet, cosine_similarity, distance_cosine_score
import pickle


### Load Twitter Dataset from nltk library

In [2]:
# check available datasets
twitter_samples.fileids()


['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [41]:
positive_reviews = twitter_samples.strings('positive_tweets.json')
negative_reviews = twitter_samples.strings('negative_tweets.json')
all_reviews = positive_reviews + negative_reviews
all_reviews_arr = np.array(all_reviews)


In [4]:
# check count of samples
print('No. of positive review samples: ', len(positive_reviews))
print('No. of negative review samples: ', len(negative_reviews))
print('Total no. of tweet samples: ', len(all_reviews))

No. of positive review samples:  5000
No. of negative review samples:  5000
Total no. of tweet samples:  10000


In [5]:
# check out sample data
positive_reviews[4]

'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days'

*Note: Each entry in the reviews list represent a document, just like the shown above*

### Load the English Word Embeddings

In [6]:
with open('./en_embeddings.p', 'rb') as file:
    en_embeddings = pickle.load(file)

#### Document embeddings
* Create Document embedding by summing up the embeddings of all words in the document.
* If we don't know the embedding of some word, we can ignore that word.

In [7]:
def get_document_embedding(document, en_embeddings):
    """
    Creates a document embedding vector by summing up of the word embeddings of the words present in the document

    Params:
    -----------
    document: str
        Sequence of texts
    en_embeddings: dict
        English word embedding dictionary

    Returns:
    ----------
    document_embedding: numpy array
        The vector represention of the document.
    """

    tokenized_document = process_tweet(document)    # process and tokenize the document

    dim = len(list(en_embeddings.values())[0])      # get the dimensions of word embeddings


    document_embedding = np.zeros((dim, ))          # initialize the document embedding row vector
    for word in tokenized_document:
        document_embedding += en_embeddings.get(word, 0)
    
    return document_embedding


### Create the Document Embedding Matrix for given list of Documents

In [8]:
def get_document_vecs(all_docs, en_embeddings):
    """
    Creates the document embedding matrix for a list of tweets.

    Params:
    ----------
    all_docs: list of str
        all tweets in our dataset.
    en_embeddings: dict
        dictionary with words as the keys and their embeddings as the values.

    Returns:
    ----------
    document_vec_matrix: numpy arrary
        matrix of tweet embeddings.
    ind2Doc_dict: dict
        dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    """

    dim = len(list(en_embeddings.values())[0])      # get the dimensions of word embeddings

    document_vec_matrix = []        # initialize the document embedding vector matrix
    ind2Doc_dict = {}               # initialize dictionary with indices of tweets as keys and their embeddings as the values

    for idx, doc in enumerate(all_docs):
        doc_vec = get_document_embedding(doc, en_embeddings)
        document_vec_matrix.append(doc_vec)
        ind2Doc_dict[idx] = doc_vec
    
    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack((document_vec_matrix))
    
    return document_vec_matrix, ind2Doc_dict



In [9]:
document_vecs, ind2Tweet = get_document_vecs(all_reviews, en_embeddings)

In [10]:
print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")

length of dictionary 10000
shape of document_vecs (10000, 300)


### Finding the most similar tweets using Cosine Similarity

In [51]:
# check which tweet in the dataset is similar to the given tweet
my_tweet = 'i am sad'
tweet_embedding = get_document_embedding(my_tweet, en_embeddings)
idx = np.argsort(cosine_similarity(tweet_embedding, document_vecs))[::-1]
print('Entered Tweet: ', my_tweet)
print('Most Similar Tweets are: ')
all_reviews_arr[idx[:10]]

Entered Tweet:  i am sad
Most Similar Tweets are: 


array(["@zoeeylim sad sad sad kid :( it's ok I help you watch the match HAHAHAHAHA",
       "@sahirlodhi Salam dear brother Eid Mubark &amp; very sorry i've missed ur all the shows on this eid..feeling bad+sad :-(",
       '@dischanmedia Its sad to hear about this, thank you so much for the overwhelmingly beautiful games, thank you for your hard work. :)',
       'I get so sad about Cory Monteith like very often. :( As I am sure a lot of people do. Gone too soon hits close to home. What could have been',
       'being sad for no reason sucks because u dunno how to stop being sad so u just gotta chill in ur room and listen to music &amp; b alone :(',
       "Omg happy late birthday @mariahjoyyy I'm so sorry I missed it :( love you though hope you had a lot of fun 😘🎉",
       '@AdityaRajKaul really thought you were one good journo.. But the lure of the gang I see is very strong.. Sad to see you too twisting news :(',
       '@carissakenga omg so sad then on your birthday sorry i didnt se

### Finding the most similar tweets using LSH

#### Choosing the number of planes

* Each plane divides the space to $2$ parts.
* So $n$ planes divide the space into $2^{n}$ hash buckets.
* We want to organize 10,000 document vectors into buckets so that every bucket has about $16$ vectors.
* For that we need $\frac{10000}{16}=625$ buckets.
* We're interested in $n$, number of planes, so that $2^{n}= 625$. Now, we can calculate $n=\log_{2}625 = 9.29 \approx 10$.

In [12]:
N_VECS = len(all_reviews)       # This many vectors.
N_DIMS = len(ind2Tweet[1])      # Vector dimensionality.
print(f"Number of vectors is {N_VECS} and each has {N_DIMS} dimensions.")

Number of vectors is 10000 and each has 300 dimensions.


In [13]:
# The number of planes. We use log2(625) to have ~16 vectors/bucket.
N_PLANES = 10
# Number of times to repeat the hashing to improve the search.
N_UNIVERSES = 25

#### Using Hyperplanes to split the vector space
Use a hyperplane to split the vector space into $2$ parts.
* All vectors whose dot product with a plane's normal vector is positive are on one side of the plane.
* All vectors whose dot product with the plane's normal vector is negative are on the other side of the plane.
* We calculate the dot product with each plane in the same order for every vector to get each vector's unique hash ID as a binary number, like $[0, 1, 1, ... 0]$.

#### Assingning hash bucket to a vector
We use the vector's unique hash ID to assign the vector to a bucket by using below formula:
$$ hash = \sum_{i=0}^{N-1} \left( 2^{i} \times h_{i} \right) $$

#### Create the sets of planes
* Create multiple (25) sets of planes (the planes that divide up the region).
* Each element of this list contains a matrix with 300 rows (the word vector have 300 dimensions), and 10 columns (there are 10 planes in each "universe").

In [14]:
np.random.seed(0)
planes_list = [np.random.normal(size=(N_DIMS, N_PLANES))
            for _ in range(N_UNIVERSES)]

#### Creating Hash Function

In [15]:
def hash_value_of_vector(v, planes):
    """Create a hash for a vector; hash_id says which random hash to use.

    Params:
    ----------
    v:  numpy array
    vector of tweet. It's dimension is (1, N_DIMS)
    planes: numpy array
        matrix of dimension (N_DIMS, N_PLANES) - the set of planes that divide up the region
        
    Returns:
    ----------
    hash_value: int, scalar
        a number which is used as a hash for your vector
    """

    v_sign = np.sign(np.dot(v, planes))     # check on which side of the plane does the vector lie and assign -1, 0, 1 accordingly
    v_hash_ids = np.where(v_sign >= 0, 1, 0)        # generate hash IDs (0 or 1) for the vector against each plane

    hash_value = 0      # initialize the hash value that would be assigned to the vector
    for idx, hash_val in enumerate(v_hash_ids):
        hash_value += 2**(idx) * hash_val
    
    return int(hash_value)



In [23]:
# test the hash value generation of the vector
planes = planes_list[0]
v = np.random.normal(size=(300,))
hash_value_of_vector(v, planes)

248

#### Creating a Hash table

Given a unique number for each vector (or tweet), we want to create a hash table. A hash table is needed so that given a hash_id, one can quickly look up the corresponding vectors. This allows one to reduce search queries by a significant amount of time.

<div style="width:image width px; font-size:100%; text-align:center;"><img src='table.png' alt="alternate text" width="width" height="height" style="width:500px;height:200px;" />  </div>

In [17]:
def make_hash_table(vecs, planes):
    """
    Creates a hash table for the given list of vectors against set of planes dividing the vector space into sub-regions.

    Params:
    ----------
    vecs: list
        list of vectors to be hashed.
    planes: list
        the matrix of planes in a single "universe", with shape (embedding dimensions, number of planes).

    Returns:
    ----------
    hash_table: dict
        a dictionary where the keys are hashes, values are lists of vectors (hash buckets)
    id_table: dict
        a dictionary where the keys are hashes, values are list of vectors id's (it's used to know which tweet corresponds to the hashed vector)
    """

    # number of planes is the number of columns in the planes matrix
    num_of_planes = planes.shape[1]

    # number of buckets is 2^(number of planes)
    num_buckets = 2**num_of_planes

    # create the hash table as a dictionary.
    # Keys are integers (0,1,2.. number of buckets)
    # Values are empty lists
    hash_table = {i:[] for i in range(num_buckets)}

    # create the id table as a dictionary.
    # Keys are integers (0,1,2... number of buckets)
    # Values are empty lists
    id_table = {i:[] for i in range(num_buckets)}

    # for each vector in 'vecs'
    for i, v in enumerate(vecs):
        # calculate the hash value for the vector
        h = hash_value_of_vector(v,planes)

        # store the vector into hash_table at key h,
        # by appending the vector v to the list at key h
        hash_table[h].append(v)

        # store the vector's index 'i' (each document is given a unique integer 0,1,2...)
        # the key is the h, and the 'i' is appended to the list at key h
        id_table[h].append(i)


    return hash_table, id_table

In [24]:
np.random.seed(0)
planes = planes_list[0]  # get one 'universe' of planes to test the function
vec = np.random.rand(1, 300)
tmp_hash_table, tmp_id_table = make_hash_table(document_vecs, planes)

print(f"The hash table at key 0 has {len(tmp_hash_table[0])} document vectors")
print(f"The id table at key 0 has {len(tmp_id_table[0])}")
print(f"The first 5 document indices stored at key 0 of are {tmp_id_table[0][0:5]}")

The hash table at key 0 has 1 document vectors
The id table at key 0 has 1
The first 5 document indices stored at key 0 of are [2137]


#### Creating all Hash tables

In [19]:
hash_tables = []
id_tables = []
for universe_id in range(N_UNIVERSES):  # there are 25 hashes
    print('working on hash universe #:', universe_id)
    planes = planes_list[universe_id]
    hash_table, id_table = make_hash_table(document_vecs, planes)
    hash_tables.append(hash_table)
    id_tables.append(id_table)

working on hash universe #: 0
working on hash universe #: 1
working on hash universe #: 2
working on hash universe #: 3
working on hash universe #: 4
working on hash universe #: 5
working on hash universe #: 6
working on hash universe #: 7
working on hash universe #: 8
working on hash universe #: 9
working on hash universe #: 10
working on hash universe #: 11
working on hash universe #: 12
working on hash universe #: 13
working on hash universe #: 14
working on hash universe #: 15
working on hash universe #: 16
working on hash universe #: 17
working on hash universe #: 18
working on hash universe #: 19
working on hash universe #: 20
working on hash universe #: 21
working on hash universe #: 22
working on hash universe #: 23
working on hash universe #: 24


#### Normal K-NN


In [20]:
def nearest_neighbor(v, candidates, k=1, metric=distance_cosine_score):
    """
    compute the nearest neighbor of the approximated french word vector v = ( X * R ) in the acutal Y vector space

    Params:
    ----------
    v: numpy array
        the approximated french word row vector whose nearest neighbors are to be found.
    candidates: numpy array
        a list of candidate vectors in Y space from which to search the nearest neighbors with respect to v.
    k: int
        number representing the top k nearest neighbors of v to search for.
    metric: function
        callable function used as a distance metric, default is cosine similarity

    Returns:
    ----------
    knn_idx: numpy array
        list of indices of k nearest neighbors found in Y vector space with respect to v.
    """

    distance_scores = []
    for vec in candidates:
        score = metric(v, vec)
        distance_scores.append(score)
    
    knn_idx = np.argsort(distance_scores)[:k]

    return knn_idx       

In [59]:
# check which tweet in the dataset is similar to the given tweet using Naive KNN
my_tweet = 'i am sad'
tweet_embedding = get_document_embedding(my_tweet, en_embeddings)
idx = nearest_neighbor(tweet_embedding, document_vecs, k=10, metric=distance_cosine_score)
print('Entered Tweet: ', my_tweet)
print('Most Similar Tweets are: ')
all_reviews_arr[idx]

Entered Tweet:  i am sad
Most Similar Tweets are: 


array(['@hanbined sad pray for me :(((',
       "@RabihAntoun :( so sad for us. We're losers",
       'So Sad :( https://t.co/GEx8wFhJhy', 'im so sad :(',
       '@AhamSharmaFC ohh so sad :( @StarPlus @FCManmarzian @ManmarzianFC',
       ':( ♫ Sad by @maroon5 (with zikra, Lusi, and Hasya) — https://t.co/1zKAnQbheZ',
       'this is so sad :((((((((', 'Etienne is making me sad :(',
       "Now I'm sad :( https://t.co/Ribf3SkrDI", '@Samcityyy how sad :-('],
      dtype='<U152')

#### Approximate K-NN

The `approximate_knn` function finds a subset of candidate vectors that
are in the same "hash bucket" as the input vector 'v'.  Then it performs
the usual k-nearest neighbors search on this subset (instead of searching
through all 10,000 tweets).

In [25]:
def approximate_knn(v, planes_list, k=1, num_universes_to_use=N_UNIVERSES):
    """
    Finds the KNN of a vector by using a subset of sample space using Locality Sensitive Hashing(LSH).

    Params:
    ----------
    v: numpy array
        vector whose nearest neighbors are to be found.
    plane_list: list
        contains a list of set of planes (Universes) to be used for LSH.
    k: int
        number representing the top k nearest neighbors of v to search for.
    num_universes_to_use: int
        the no. of set of planes/ universes to use.

    Returns:
    ----------
    nearest_neighbor_ids: list
        list of indices of k nearest neighbors of v.
    """
    assert num_universes_to_use <= N_UNIVERSES

    # Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()

    # list of document IDs
    ids_to_consider_l = list()

    # create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()

    # loop through the universes of planes
    for universe_id in range(num_universes_to_use):

        # get the set of planes from the planes_l list, for this particular universe_id
        planes = planes_list[universe_id]

        # get the hash value of the vector for this set of planes
        hash_value = hash_value_of_vector(v, planes)

        # get the hash table for this particular universe_id
        hash_table = hash_tables[universe_id]

        # get the list of document vectors for this hash table, where the key is the hash_value
        document_vectors_l = hash_table[hash_value]

        # get the id_table for this particular universe_id
        id_table = id_tables[universe_id]

        # get the subset of documents to consider as nearest neighbors from this id_table dictionary
        new_ids_to_consider = id_table[hash_value]

        ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

        # remove the id of the document that we're searching
        # if doc_id in new_ids_to_consider:
        #     new_ids_to_consider.remove(doc_id)
        #     print(f"removed doc_id {doc_id} of input vector from new_ids_to_search")

        # loop through the subset of document vectors to consider
        for i, new_id in enumerate(new_ids_to_consider):

            # if the document ID is not yet in the set ids_to_consider...
            if new_id not in ids_to_consider_set:
                # access document_vectors_l list at index i to get the embedding
                # then append it to the list of vectors to consider as possible nearest neighbors
                document_vector_at_i = document_vectors_l[i]

                # append the new_id (the index for the document) to the list of ids to consider
                vecs_to_consider_l.append(document_vector_at_i)
                ids_to_consider_l.append(new_id)

                # also add the new_id to the set of ids to consider
                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)

        ### END CODE HERE ###

    # Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))

    # convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)

    # call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)

    # Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]

    return nearest_neighbor_ids


In [61]:
# check which tweet in the dataset is similar to the given tweet using Approximate KNN
my_tweet = 'i am sad'
tweet_embedding = get_document_embedding(my_tweet, en_embeddings)
idx = approximate_knn(tweet_embedding, planes_list, k=10, num_universes_to_use=N_UNIVERSES)
print('Entered Tweet: ', my_tweet)
print('Most Similar Tweets are: ')
all_reviews_arr[idx]

Fast considering 1628 vecs
Entered Tweet:  i am sad
Most Similar Tweets are: 


array(['@Samcityyy how sad :-(',
       '@AhamSharmaFC ohh so sad :( @StarPlus @FCManmarzian @ManmarzianFC',
       "@RabihAntoun :( so sad for us. We're losers",
       ':(((((((((( so sad', '@lostboxuk Very sad! :(',
       'Etienne is making me sad :(',
       "Now I'm sad :( https://t.co/Ribf3SkrDI",
       "Nobodies up with me now, I'm sad :(",
       '@archietalanay dont be sad :(((((( ily',
       'So Sad :( https://t.co/GEx8wFhJhy'], dtype='<U152')