# Lab: Finding Similar Items
Data Mining 2021/2022  
Danny Plenge and Gosia Migut  
Revised by Aleksander Buszydlik

**WHAT** This *optional* lab consists of several programming exercises and insight questions. These exercises are meant to let you practice with the theory covered in: [Chapter 3][1] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman.  

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam.  

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use [StackOverflow][2]
to discuss the questions with your peers. For additional questions and feedback please consult the TAs during the assigned lab session. The answers to these exercises will not be provided.

[1]: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
[2]: https://stackoverflow.com/c/tud-cs/questions

#### Summary
In the following exercises you will create algorithms for finding similar items in a dataset. 
* Exercise 1: Shingling   
* Exercise 2: MinHashing
* Exercise 3: Locality Sensitive Hashing


## Exercise 1: Shingling

As you learned during the lecture, shingling allows us to assess the similarity between two documents which finds its uses, for example, in plagiarism detection. A k-shingle refers to any sequence of k characters that appears in the original document. That way, if two documents are similar they will also contain similar k-shingles. The value of k will depend on the case but, ideally, we want to make sure that the probability of encountering any single k-shingle is relatively low. In this exercise you will implement a set of functions which will allow us to compare the similarity of two arbitrary strings.

### Step 1: Implement `shingle_string`

First we will implement the `shingle_string` function. This function will take as an argument a string and the size parameter k, cut the string into shingles of size k, and return the set of newly-created shingles. 

For example, if the input string is "shingling" the resulting string ShingleSet, with a k of 2 should be: {"sh", "hi", "in", "ng", "gl", "li"}

Implement this function and verify that it works as intended.

In [None]:
import numpy as np

def shingle_string(string, k):
    """
    This function takes as argument some string and cuts it up in shingles of size k.
    For example, input ("shingling", 2) -> {"sh", "hi", "in", "ng", "gl", "li"}
    :param string: The input string
    :param k: The size of the shingles
    :return: A set of shingles of size k
    """    
    shingles = set()
    
    # START ANSWER
    # END ANSWER    

    return shingles


assert shingle_string("shingling", 1) == set({"s", "h", "i", "n", "g", "l"})
assert shingle_string("shingling", 2) == set({"sh", "hi", "in", "ng", "gl", "li"})
assert shingle_string("shingling", 9) == set({"shingling"})
assert shingle_string("shingling", 10) == set()

$\textbf{Question 1}$: What would be the output of the `shingle_string` with k set to 5? Will the resulting set increase or decrease in size? 

### Step 2: Implement `jaccard_distance`

Next, we will implement the `jaccard_distance` function which takes as input two sets and computes the distance between them. Remember that the Jaccard distance can be calculated as follows: 

### <center> $d(A, B) = 1 - \frac{| A \cap B|}{|A \cup B|}$ </center>



In [None]:
from numpy.testing import assert_almost_equal

def jaccard_distance(a, b):
    """
    This function takes as input two sets and computes the distance between them -> 1 - length(intersection)/length(union).
    :param a: The first set to compare
    :param b: The second set to compare
    :return: The (Jaccard) distance between set 'a' and 'b' (0 =< distance =< 1)
    """    
    
    distance = -1.0
    
# START ANSWER
# END ANSWER    

    return distance

assert jaccard_distance({"sh", "hi", "ng", "gl", "li"}, {"sh", "hi", "ng", "gl", "li"}) == 0
assert jaccard_distance({"sh", "hi", "ng", "gl", "li"}, {"sa", "am", "mp", "pl", "le"}) == 1
assert_almost_equal(jaccard_distance({"sh", "hi", "ng", "gl", "li"}, {"sh", "hi", "ng", "gl", "le", "es"}), 0.429, 3)

### Step 3: Apply `shingle_string` and `jaccard_distance`

Create two separate ShingleSets with k set to 5 (using shingleString from step 1) from the following strings:  
* _The plane was ready for touch down_
* _The quarterback scored a touchdown_

Are these sentences very similar? Do you expect that the Jaccard distance between these two sentences will be large or small?  
Calculate the Jaccard distance between these two sets using the function implemented in step 2.

In [None]:
s1 = "The plane was ready for touch down"
s2 = "The quarterback scored a touchdown"

def jaccard_distance_on_strings(s1, s2):
    """
    This function calculates the jaccard distance between two strings.
    :param a: The first string
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """   

    # START ANSWER
    # END ANSWER
    
assert_almost_equal(jaccard_distance_on_strings(s1, s2), 0.966, 3)

$\textbf{Question 2}$: The jaccard distance you calculated for the above sentences should be equal to 0.97.
What would happen if we lower our `k` to 1? Would it increase or decrease the distance between the two sets? Which `k` do you think would be appropriate for these two sentences? 

### Step 4: Implement `jaccard_distance_stripped`

Both sentences from step 3 contain whitespaces, but it appears that they do not contribute much to the actual meaning of the sentence. An option would be to strip all whitespaces from the sentences before cutting them into shingles. Create a function that removes all whitespaces from the strings before creating any shingles and calculate the jaccard distance again.

In [None]:
def jaccard_distance_stripped(s1, s2):
    """
    This method computes the jaccard distance between two sets of shingles without any whitespaces in the original strings.
    :param a: The first string to compare
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """  
    
    # START ANSWER
    # END ANSWER

assert_almost_equal(jaccard_distance_stripped(s1, s2), 0.888, 3)

$\textbf{Question 3}$: Did the jaccard distance between the two sets increase or decrease? Why is that?

## Exercise 2: MinHashing

We have successfully found the similarity between two strings, however, when working with a large set of documents this approach may be too expensive computationally. To that end, we employ MinHashing which allows us to efficiently estimate the Jaccard distance between documents. You will now learn how to create a MinHash signature matrix for a set of documents. In the following exercises you are given 4 ShingleSets: `s1` - `s4`, with `k = 1`.

In [None]:
s1 = {"a", "b"}
s2 = {"a", "c"}
s3 = {"d", "c"}
s4 = {"g", "b", "a"}

# Initialize shingle sets
sets = [s1, s2, s3, s4]

### Step 1: Create a hashing function

Create a function which hashes an integer $x$ given an $alpha$ and $beta$. This function should hash the value $x$ using the following formula:

### <center> $h(x) = (x \cdot alpha + beta) \ mod \ n$ </center>

where $x$ is an integer and $n$ is the number of unique shingles of all sets. For example, given $x=3$ and $n=2$ you should get $h(x) = 0$.

In [None]:
class HashFunction:
    """
    This HashFunction class can be used to create an unique hash given an alpha and beta.
    """
    def __init__(self, alpha, beta):
        self.alpha = alpha
        self.beta = beta

    def hashf(self, x, n):
        """
        Returns a hash given an integer x and n.
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        
        hash_value = 0
        
        # START ANSWER
        
        # make some changes
        
        # END ANSWER
    
        return hash_value

# Assume alpha and beta equal 1
h1 = HashFunction(1,1)

# Solve 
assert h1.hashf(3, 2) == 0
assert h1.hashf(4, 4) == 1
assert h1.hashf(5, 7) == 6

$\textbf{Question 4}$: In order to gain some insight in computing minhash signature matrices, compute by hand the matrix for the sets of shingles given above using the the hash functions:
* $h_1$ where $a=1$ and $b=1$
* $h_2$ where $a=3$ and $b=1$.   

Make sure to do this computation by hand! Refer to the slides and other study materials if you forgot how to do this.  

### Step 2: Computing the signature matrix

Next we are going to create two functions: 
* `shingle_space` which will return the all unique shingles among the sets 
* `compute_signature` which will create the minhash signature matrix from our sets s1-s4 given a number of hash functions.

For the latter, you can make use of the pseudocode below.
  
```
foreach shingle (x, index) in the shingle space do 
    foreach ShingleSet S do
        if x ∈ S then
            foreach hash function h do
                signature(h, S) = min(h(index), signature(h, S))
            end
        end
    end
end
```

In [None]:
# Initialize a list of hash functions
hashes = list()

h1 = HashFunction(1,1)
h2 = HashFunction(3,1)

hashes.append(h1)
hashes.append(h2)

In [None]:
def shingle_space(sets):
    """
    Sets up the total shingle space given the list of shingles (sets).
    :param sets: A list of ShingleSets
    :return: The ShingleSpace set
    """
    space = set()
    
    # START ANSWER
    # END ANSWER
    
    return space

assert shingle_space([{"a", "b"}, {"b"}, {"a", "c"}, {"b", "c", "d"}]) == set({"a", "b", "c", "d"})
assert shingle_space([{"u", "v"}, {"u", "v", "x"}, {"y", "z"}, {"u", "y", "z"}]) == set({"u","v", "x", "y", "z"})

In [None]:
import numpy as np
import sys

space = shingle_space(sets)

def compute_signature(space, hashes, sets):
    """
    This function will calculate the minhash signature matrix from our sets s1-s4 
    using the list of hash functions (hashes) and the shingle space (space)
    :param space: The union of all unique shingles among the sets
    :param hashes: The list of hash functions of arbitrary length
    :param sets: The list of ShingleSets
    :return: The Minhash signature matrix for the given sets of shingles
    """
    
    result = np.full((len(hashes), len(sets)), sys.maxsize)
    sorted_space = sorted(space)
    
    # START ANSWER        
    # END ANSWER
    
    return result

compute_signature(space, hashes, sets)

In [None]:
# This part will allow you to test your code
test_hashes = list()

h3 = HashFunction(2, 3)
h4 = HashFunction(4, 2)

test_hashes.append(h3)
test_hashes.append(h4)

test_sets = [{"u", "v"}, {"u", "v", "x"}, {"y", "z"}, {"u", "y", "z"}]
test_space = shingle_space(test_sets)
             
assert np.array_equal(compute_signature(test_space, test_hashes, test_sets), np.array([[0, 0, 1, 1], [1, 0, 3, 2]]))

$\textbf{Question 5}$: Compute the minhash signature matrix the function you have just implemented. Verify that the result of your implementation is correct by comparing the result of the program to your manual calculation.

## Exercise 3: Locality Sensitive Hashing

Finally, we will implement a simple algorithm for Locality Sensitive Hashing. Say that you have access to millions of documents and want to find the similar ones. Any attempt to systematically scan through such a large corpus of documents is unlikely to work. Instead, we can use probability theory to our advantage and find as many matches as possible. Of course, we may find some pairs of documents which are not similar at all (false positives). We may also miss some similar documents (false negatives). Nevertheless, in most cases that is a small price to pay for an otherwise very efficient technique. Even better, using LSH we are in control of the probability of FPs and FNs which makes it applicable to different scenarios.

Let's use the functions implemented in the previous exercises to compute a Locality-Sensitive Hashing table using the banding technique for minhashes as described in the lecture and in the book.

### Step 1: Generate random hash functions

For this exercise we will need many hash functions. Construct a class which can create a hash function with a random $alpha$ and $beta$.  
**Hint:** You can use `random.randint()` to generate a random number in the given range

In [None]:
import random

class RandomHashFunction:
    """
    This RandomHashFunction class can be used to create a random unique hash given an alpha and beta
    """
    def __init__(self, alpha, beta):
        # START ANSWER
        # END ANSWER
        
    def hashf(self, x, n):
        """
        Returns a random hash given an integer x and n
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        hash_value = 0
        
        # START ANSWER
        # END ANSWER
        
        return hash_value
    

### Step 2: Find potential candidates

Now, create a function which, given a minhash table, computes the candidates using the LSH technique. For this you may use the pseudocode given below.  
  
```
# Initialize buckets
foreach band do
    foreach set do
        s = a column segment of length r, for this band and set
        add set to buckets[hash(s)]
    end
end
```  
   
```
# Retrieve candidates
foreach item in buckets[hash(s)] do
    add [set, item] to the list of candidates
end

```

**Hint:** You can use `hash()` function from Python library to calculate a bucket where the string should be stored.  
**Hint:** You can use `itertools.combinations()` to find all pairs of potential candidates.

In [None]:
import itertools
def compute_candidates(mhs, bs, r):
    """
    This function computes the candidates using the LSH technique given a Minhash table
    :param mhs: The minhash signature matrix
    :param bs: The bucketsize
    :param r: The rows per band
    :return: The list of candidates
    """
    
    assert(mhs.shape[0] % r == 0)
    b = mhs.shape[0] / r
    result = set()
    buckets = list()
  
    for i in range(bs):
        buckets.append(list())

    # Initialize the buckets
    for i in range(int(b)):
        for j in range(mhs.shape[1]):
            # Take a segment from an mhs column
            col_segment = mhs[i*r:(i+1)*r,[j]]
            
            # Convert the column segment into a string
            s = np.array2string(col_segment.flatten(), separator = '')
            s = s[1:len(s)-1]
            
            # Append the index of the set to the corresponding bucket in the buckets list
            # START ANSWER 
            # END ANSWER
    
    
    # Retrieve the candidates
    for item in buckets:   
        item = set(item)
        
        # Add all the pairs of the potential nearest neighbors in the bucket to the resulting set. 
        # START ANSWER
        # END ANSWER
        
    return result

$\textbf{Question 6}$: An important issue with this algorithm is that it will work suboptimally if you index the buckets as `buckets[hash(s)]` instead of `buckets[hash(s), band]`. Why is this the case?  

### Step 3: Compute the LSH for our shingle sets
Similarly as before, compute the minhash signature matrix using the 100 random hash functions. Use a bucket size of 10000 and 5 rows per band.

In [None]:
# Initialize a list for the 100 random hash functions
rhashes = [RandomHashFunction(100, 100) for i in range(100)]

# Calculate the Minhash Signature Matrix
mhs = compute_signature(space, rhashes, sets)

# Apply Locally Sensitive Hashing to find candidate pairs
result = compute_candidates(mhs, 10000, 5)

for x in result:
    jd = jaccard_distance(sets[x[0]], sets[x[1]])
    e1 = x[0] + 1
    e2 = x[1] + 1
    if jd < 0.5:
        print("-- ShingleSets: {} within tolerance -- jaccard distance {}".format((e1, e2), jd))
    else:
        print("-- ShingleSets: {} not within tolerance -- jaccard distance {}".format((e1, e2), jd))

$\textbf{Question 7}$: If you run the code multiple times you may notice that sometimes you get different candidates. Why is that the case?

$\textbf{Question 8}$: Run your code 10 times. Write down on a piece of paper which candidates are suggested and how many times each of them is suggested. How does this relate to the Jaccard distance between the two sets of candidate pairs (not in terms of formulas, just an indication)? To verify your understanding, compute the Jaccard distance between all possible combinations of all ShingleSets and compare this to the frequencies (how many times a pair is suggested as a candidate).

$\textbf{Question 9}$: Why (or when) would you use this algorithm?

$\textbf{Question 10}$: What will happen if the number of buckets is too small? For example what would happen
if we only use 10 buckets?  

$\textbf{Question 11}$: What is the effect of the number of rows per band? What will happen if we set the number of rows per band to 1? What will happen if you set the number of rows per band to the length of the signature?  