LSH with numpy

In [58]:
import numpy as np 

## Converting multiple columns to single unique number: How does it work?

Idea: stack the numbers from left to right next to each other, so that `A=[3, 10]` becomes `[310]`. We want each entry in the final array (before the summation) to be one order of magnitude larger than the neighbor to the right. For this we need to know which power of 10 each entry in the original array is. At the end, we will multiply `A` by the following multiplication factor: `[100, 1]`, that is `[1e2, 1e0]`.

1. Take log10 of the existing array to know the existing powers: `[0, 1]`
2. Take cumulative sum in a row from right to left, which yields `[1, 1]`. 
3. Build the multiplication factor
    1. We add one order of magnitude going from right to left: `required_power = [1e1, 1e0]`.
    2. Because we have a 10 preceding the 3, we need to add another order of magnitude to the multiplication factor for 3. More generally, we can do this by taking `cumsum - existing_powers`
    3. The multiplication factor is `cumsum - existing_power + required_powers`


In [59]:
a = np.array([[1,4,10], [14, 12, 3], [1, 100, 39]])
existing_powers = np.floor(np.log10(a)) 
n_positions = a.shape[1]
n_mentions = a.shape[0]

cumsum_powers = np.fliplr(np.cumsum(np.fliplr(existing_powers), axis=1))
print(f"sum_powers: \n {cumsum_powers}")

req_powers = [x for x in reversed(range(n_positions))]
req_powers = np.tile(req_powers, (n_mentions, 1))

mult_factor = cumsum_powers - existing_powers + req_powers  
summationvector = np.ones((n_positions, 1)) 
out = np.matmul(a * 10**mult_factor, summationvector)

for i in range(out.shape[0]):
    print("".join(str(x) for x in out[i,])) 

sum_powers: 
 [[1. 1. 1.]
 [2. 1. 0.]
 [3. 3. 1.]]
1410.0
14123.0
110039.0
