# Lecture 7b Measures of Distance
__Math 3280: Data Mining__

__Outline__
1. Jaccard Similarity
2. Euclidean Distance
3. Cosine Distance

__Reading__ 
* Leskovec, Sections 3.1, 3.5

A common problem in data science is to compare one dataset to another. This helps in machine learning problems, as well as helps to find similar items to identify patterns.

The easiest method is to go through item by item and compare it with all other items. However, if there are 1 million items, then that is 1 trillion potential pairs to look through.

In this chapter, we'll look at a technique, Locality-Sensitive Hashing, which focuses only on pairs likely to be similar (candidate pairs), while ignoring other pairs.

* Jaccard Similarity
* Shingling
* Minhashing
* Locality-Sensitive Hashing (LSH)
  * LSH for documents

## Jaccard Similarity
The Jaccard Similarity compares two pieces of information to see how similar they are. Each row is a set
The calculation is,
$$J(S,T) = \frac{|S\cap T|}{|S\cup T|}$$

A simple example:
$$A = \{1, 3, 5\} \qquad B = \{3, 4, 5, 6\}$$
Venn diagram (Square brackets encompass elements of A, round brackets encompass elements of B):
$$\Big[1 \Big( 3, 5 \Big] 4, 6\Big)$$

There are 5 elements total, so $|A\cup B| = 5$. Only 2 elements are in both, so $|A\cup B| = 2$.
$$J(A,B) = \frac{|A\cap B|}{|A\cup B|} = \frac{2}{5}$$

There are two similarity calculations:
* Jaccard Similarity
  * Union is all elements, not repeated - just looking at possible values
$$|A\cup B| = \big|\{1, 3, 4, 5, 6\}\big| = 5 \qquad J(A,B) = \frac{2}{5}$$
* Jaccard Bag Similarity
  * Union is all elements in both sets combined, as if they were two bags mixed together
$$|A\cup B| = \big|\{1, 3, 5, 3, 4, 5, 6\}\big| = 7 \qquad J_B(A,B) = \frac{2}{7}$$

Example #2: You create a shopping list including,
* Milk (2), eggs, bread, chips (3), salsa

But you forget the shopping list. So, you get what you can remember, plus some additional things:
* Milk (3), eggs, chips (1), salsa, yogurt, cheese, ice cream

What is the Jaccard similarity?
$$|list \cap purchased| = |\{\text{milk, eggs, chips, salsa}\}|=4$$
$$|list \cup purchased| = |\{\text{milk, eggs, bread, chips, salsa, yogurt, cheese, ice cream}\}|=8$$
$$J(list, purchased) = \frac{|list \cap purchased|}{|list \cup purchased|} = \frac{4}{8}=0.5$$

Notice that we did not repeat milk or chips. For the Jaccard Similarity, we only consider similar items, not repeats. For the Jaccard Bag Similarity, we do consider repeats.
* For chips, it was on the list 3 times, but we only bought 1, so it is only counted once (1)
* For milk, it was bought 3 times, but only on the list 2 times, so there are only two (2) matched pairs
  * $|list \cap purchased|$ = |milk, milk, eggs, chips, salsa| = 5
* The union is all items, even if repeated
  * $|list \cup purchased|$ = |milk, milk, eggs, bread, chips, chips, chips, salsa, milk, milk, milk, eggs, chips, salsa, yogurt, cheese, ice cream| = 17

$$J_B(list, purchased) = \frac{|list \cap purchased|}{|list \cup purchased|} = \frac{5}{17}=0.294$$

Another example:

|     |  S  |  T  |
| --- | --- | --- |
| x_0 |  1  |  0  |
| x_1 |  0  |  1  |
| x_2 |  0  |  0  |
| x_3 |  1  |  1  |
| x_4 |  0  |  1  |
| x_5 |  1  |  0  |
| x_6 |  1  |  1  | 
| x_7 |  0  |  0  |
| x_8 |  1  |  1  |  
| x_9 |  0  |  1  |

To do this, we look at only positive results (entries with a "1"). The intersection would be where both $S$ and $T$ are 1:
$$|S\cap T| = 3$$

The union would be all entries where either $S$ or $T$ have a 1:
$$|S\cup T| = 8$$

We can consider, instead of a list of all datapoints, just count the number of all possibilities.

|  S  |  T  |  #  |
| --- | --- | --- |
|  0  |  0  |  2  |
|  0  |  1  |  3  |
|  1  |  0  |  2  |
|  1  |  1  |  3  |

or, looking at it with a confusion matrix,

|      |  S=1  |  S=0  |
| ---: | :---: | :---: |
|  T=1 |   3   |   3   |
|  T=0 |   2   |   2   |

$$|S\cap T| = 3 \qquad |S \cup T| = 3+3+2 = 8$$

Either way, the Jaccard Similarity is,
$$|S\cap T| = 3 \qquad |S\cup T| = 8 \qquad J(S,T) = \frac{|S\cap T|}{|S\cup T|} = \frac{3}{8}$$

The Jaccard Bag Similarity,
$$J_B(S,T) = \frac{3}{11}$$

The Jaccard Similarity can be used in a variety of ways:
* Similarity of Documents
* Plagiarism
* Mirror Pages
* Articles from the Same Source
* __Collaborative Filtering__
  * On-line Purchases
  * Movie Ratings

## Euclidean Distance
We saw the Euclidean Distance at the beginning of the semester. The Euclidean Distance is just the physical straight-line distance between two points, also known as the Pythagorean Theorem and as the 2-norm.
$$d = \lVert x \rVert_2 = \sqrt{\sum_i x_i^2}$$

Because the 2-norm is the most common norm, we usually drop the subscript.
$$\lVert x \rVert_2 = \lVert x \rVert$$

In [None]:
import numpy as np

def norm(x,n):
  return sum(x**n)**(1/n)

a = np.array([1,3])

print("2-norm = ", norm(a,2))

## Cosine Distance
A low physical distance implies that two points are almost identical. However, is there a way to measure that two points have similar attributes despite being very different? 

Instead of looking at the physical distance between two points, we look at whether two points are pointing in similar, opposite, or perpendicular directions. We do this using the cosine-definition of the dot product.
$$\vec{x}\cdot\vec{y} = \lVert x \rVert\lVert y \rVert \cos\theta \qquad\qquad \cos\theta = \frac{\vec{x}\cdot\vec{y}}{\lVert x \rVert\lVert y \rVert} \qquad\qquad \theta = \arccos\left(\frac{\vec{x}\cdot\vec{y}}{\lVert x \rVert\lVert y \rVert}\right)$$

__The angle $\theta$ is the cosine distance.__ However, we often look at $\cos\theta$ instead as it becomes more intuitive.

| The two points are             | $\theta$                     | $\cos\theta$              |
| -----------------------------: | :--------------------------: | :-----------------------: |
|                                | $0 \le \theta \le 2\pi$      | $-1 \le \cos\theta \le 1$ |
|       in similar directions if | $\theta\approx 0$ or $2\pi$  | $\cos\theta \approx 1$    |
| in perpendicular directions if | $\theta\approx\frac{\pi}{2}$ | $\cos\theta \approx 0$    |
|     in oppposite directions if | $\theta\approx\pi$           | $\cos\theta \approx -1$   |



In [5]:
def dot_product(x,y):
  return sum(x*y)

a = np.array([1,3])
b = np.array([2,5])

print("Dot Product = ", dot_product(a,b))

Dot Product =  17


In [6]:
a = np.array([1,3])
b = np.array([2,5])
c = np.array([7,3])
d = np.array([-4,2])
e = np.array([-5,-5])

print(" cos_d(a,b) = ", dot_product(a,b) / (norm(a,2) * norm(b,2)))
print(" cos_d(a,c) = ", dot_product(a,c) / (norm(a,2) * norm(c,2)))
print(" cos_d(a,d) = ", dot_product(a,d) / (norm(a,2) * norm(d,2)))
print(" cos_d(a,e) = ", dot_product(a,e) / (norm(a,2) * norm(e,2)))

 cos_d(a,b) =  0.9982743731749958
 cos_d(a,c) =  0.6643638388299197
 cos_d(a,d) =  0.14142135623730948
 cos_d(a,e) =  -0.8944271909999159
