# Ref

[Challenges with music identification - Random Prrojection and LSH](https://github.com/santhoshhari/Locality-Sensitive-Hashing)

# Random Projection

representing high-demensional data in low-dimentional feature space

It gained traction for its ability to approximately preserve relations(pairwise distance or cosine similarity)


Notations : 

1. high-demensional data $D_{d \times n}$ -  (dimensions $d$, observations $n$)

2. porjections $P_{k \times n}$ - (dimensions $k$, observations $n$)

3. random projection matrix $R_{k \times d}$ - (low dimensions $k$, high dimensions $d$)

where $k << d$

$R_{k \times d}$ are called random vectors and the elements of these random vectors are drawn indenpendently from gaussion distribution(zero mean, unit variance)

# LSH using Random Projection Method

construct a table of all possible bins where

1. each bins is made up of similar items

2. each bin can be represented by bitwise hash value

3. which is a nunber made up of sequence of 1's and 0's(e.g. 110110, 111001)

In this representation, two observations with same bitwise hash values are more likely to be similar than those with different hashes

Algo

<br>

1. Create `k` random vectors of length `d` each, where `k` is the size of bitwiese hash values and `d` is the dimension of the feature vector (the hash_size is the same as low-dimension k)

<br>

2. For each random vector, compute the `dot product` of the random vector and the observations. If the result of the dot product is positive, assign the bit values as 1 else 0

<br>

3. Concatenate all the bit values computed for `k` dot products

<br>

4. Repeat the above two steps for all observations to compute hash values

<br>

5. Group observations with same hash values together to create a LSH table



In [2]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

In [82]:
import numpy as np
from collections import defaultdict
np.random.seed(42)

In [219]:
class HashTable:
    def __init__(self, hash_size : int, input_dimensions : int) -> None:
        self.hash_size = hash_size
        self.input_dimensions = input_dimensions
        self.hash_table = defaultdict(dict)
        # projections R_{k x d}
        self.projections = np.random.rand(self.hash_size, self.input_dimensions)
    
    def generate_hash(self, input_vector : np.ndarray) -> str:
        bools = (np.dot(input_vector, self.projections.T) > 0).astype('int')
        return ''.join(bools.astype('str'))
        
    def __setitem__(self, label, input_vector : np.ndarray) -> None:
        hash_value = self.generate_hash(input_vector)
        self.hash_table[hash_value][label] = input_vector
        
    def __getitem__(self, hash_code : str) -> dict:
        return self.hash_table[hash_code]
    
    def __repr__(self) -> None:
        return self.hash_table.__str__()

In [220]:
vec1 = np.random.randn(5)
vec2 = np.random.randn(5)
vec3 = np.random.randn(5)
print(vec1, vec2, vec3, sep='\n')

[-1.60644632  0.20346364 -0.75635075 -1.42225371 -0.64657288]
[-1.081548    1.68714164  0.88163976 -0.00797264  1.47994414]
[ 0.07736831 -0.8612842   1.52312408  0.53891004 -1.03724615]


In [221]:
# k=4, d=20
projections = np.random.randn(2, 5)
projections

array([[-0.19033868, -0.87561825, -1.38279973,  0.92617755,  1.90941664],
       [-1.39856757,  0.56296924, -0.65064257, -0.48712538, -0.59239392]])

In [222]:
for idx,v in enumerate([vec1, vec2, vec3]):
    
    bools = (np.dot(v, projections.T) > 0).astype('int')
    hash_code = ''.join(bools.astype('str'))

    print(f'v_{idx+1}',bools, hash_code, sep='\n')
    print()

v_1
[0 1]
01

v_2
[1 1]
11

v_3
[0 0]
00



In [223]:
def cos_sim(vec1 : np.ndarray, vec2 : np.ndarray) -> float:
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1)*np.linalg.norm(vec2))

In [224]:
cos_sim(vec1, vec2), cos_sim(vec1, vec3), cos_sim(vec2, vec3)

(0.07465665021191253, -0.3095602328229634, -0.31148046840627885)

The hash code really preserve similarity!

In [225]:
random_projection_hash = HashTable(hash_size=2, input_dimensions=5)

for idx,v in enumerate([vec1, vec2, vec3]):
    random_projection_hash[f'vec_{idx + 1}'] = v


In [226]:
print(random_projection_hash)

defaultdict(<class 'dict'>, {'00': {'vec_1': array([-1.60644632,  0.20346364, -0.75635075, -1.42225371, -0.64657288])}, '11': {'vec_2': array([-1.081548  ,  1.68714164,  0.88163976, -0.00797264,  1.47994414])}, '01': {'vec_3': array([ 0.07736831, -0.8612842 ,  1.52312408,  0.53891004, -1.03724615])}})


In [227]:
random_projection_hash['00']

{'vec_1': array([-1.60644632,  0.20346364, -0.75635075, -1.42225371, -0.64657288])}

In [228]:
random_projection_hash['01']

{'vec_3': array([ 0.07736831, -0.8612842 ,  1.52312408,  0.53891004, -1.03724615])}

In [229]:
random_projection_hash['11']

{'vec_2': array([-1.081548  ,  1.68714164,  0.88163976, -0.00797264,  1.47994414])}

The intuition behind this idea is that 

1. if two points are aligned completely, i.e have perfect correlation from origin, they will be in the same hash bin

2. if two points separated by 180 degrees will be in different bins

3. two points 90 degrees aprar have 50% probability to be in the same bins


Due to the randomness, it is not likely that all similar item are grouped correctly. 

To overcome this limitation, a common practice is to create multiple hash tables and consider an observation `a` to be simiar to `b`

**If they are in same bin in at least one of the tables**

Below is the code snippet to construct multiple hash tables

In [236]:
class LSH:
    def __init__(self, num_tables : int, hash_size : int, input_dimensions : int):
        self.num_tables = num_tables
        self.hash_size = hash_size
        self.input_dimensions = input_dimensions
        self.tables = []
        for i in range(self.num_tables):
            self.tables.append(
                HashTable(self.hash_size, self.input_dimensions)
            )
    def __setitem__(self, label : str, input_vector : np.ndarray):
        for t in self.tables:
            t[label] = input_vector
    
    def __getitem__(self, label) -> list:
        res = []
        for t in self.tables:
            res.extend(t[label])
        return list(set(res))
    def __repr__(self):
        return self.tables.__str__()

In [237]:
lsh = LSH(num_tables=3, hash_size=2, input_dimensions=5)
# random_projection_hash = HashTable(hash_size=2, input_dimensions=5)

for idx,v in enumerate([vec1, vec2, vec3]):
    lsh[f'vec_{idx + 1}'] = v


In [242]:
for t in lsh.tables:
    print(t)
    print()

defaultdict(<class 'dict'>, {'00': {'vec_1': array([-1.60644632,  0.20346364, -0.75635075, -1.42225371, -0.64657288])}, '11': {'vec_2': array([-1.081548  ,  1.68714164,  0.88163976, -0.00797264,  1.47994414]), 'vec_3': array([ 0.07736831, -0.8612842 ,  1.52312408,  0.53891004, -1.03724615])}, '01': {}, 'vec_1': {}})

defaultdict(<class 'dict'>, {'00': {'vec_1': array([-1.60644632,  0.20346364, -0.75635075, -1.42225371, -0.64657288])}, '11': {'vec_2': array([-1.081548  ,  1.68714164,  0.88163976, -0.00797264,  1.47994414]), 'vec_3': array([ 0.07736831, -0.8612842 ,  1.52312408,  0.53891004, -1.03724615])}, '01': {}, 'vec_1': {}})

defaultdict(<class 'dict'>, {'00': {'vec_1': array([-1.60644632,  0.20346364, -0.75635075, -1.42225371, -0.64657288])}, '11': {'vec_2': array([-1.081548  ,  1.68714164,  0.88163976, -0.00797264,  1.47994414]), 'vec_3': array([ 0.07736831, -0.8612842 ,  1.52312408,  0.53891004, -1.03724615])}, '01': {}, 'vec_1': {}})



# TODO

1. figure out **If they are in same bin in at least one of the tables**

2. get LSH class work