# Aula 7 - BigData Algorithms Parte 2

## TODAY


- Finding (Filter) Similar Items
- Association Rules / Recomending Systems

# FINDING SIMILAR ITEMS
for instance, finding near-duplicate pages

---

### Many other problems can be expressed as finding “similar” sets, that is find near-neighbors in high-dimensional space   

#### Examples:

- Pages with similar words - For duplicate detection, classification by topic
- Customers who purchased similar products - Products with similar customer sets
- Images with similar features - Users who visited similar websites

## The problem could be stated as: 

$N$ data points  $x_1, x_2,...$ where each is high dimensional, and $N$ is big.

$x_i$ could be for instance, 

- large documents
- images
- structured data

Goal: 

- find all pairs of data where $d(x_i,x_j) < s$ where d is a distance measure;
- Naïve solution would take $O(N^2)$

## Jaccard Similarity and Distance of Sets

- Jaccard Similarity:

$$SIM(C1, C2) = \frac{|C1 ∩ C2|}{|C1 ∪ C2|}$$

- Jaccard Distance:

$$D(C1, C2) = 1-\frac{|C1 ∩ C2|}{|C1 ∪ C2|}$$


---


<img src="images/jaccard.png" style="width:60%"/>



### Naive Solution

For each element $x_i$ calculate the distance $d(x_i,x_j)$ to every other j elements.

This means $\frac{N(N-1)}{2}$ calculations


Suppose we need to apply this to N = 10 million registers


If each comparison takes $1\mu s$ it would take more than a year ....


##  Finding Similar Items Efficiently

Using only candidate pairs instead all N items we can strongly reduce the the naive approach to a-priori approach. Let us find a priori candidate pairs!



```
Shingling => Min-Hashing => Locality-Sensitive Hashing
```

<img src="images/lsh.png" style="width:60%"/>



#### Shingling 
- Convert documents to sets

Basic principle:

A document is a string of characters. Define a k-shingle for a document to be
any substring of length k found within the document.

> Considering the string D = "abcdabd", and we pick k = 2. Then the set of 2-shingles for D is D1 = {ab,bc,cd,da,bd}.

> k should be picked large enough that the probability of any given shingle appearing in any given document is low.

Thus, if our corpus of documents is emails, picking k = 5 should be fine. To see why, suppose that only letters and a general white-space character appear in emails (randomly). If so, then there would be $27^5 = 14,348,907$ possible shingles. Since the typical email is much smaller than 14 million characters long, we would expect k = 5 to work well, and indeed it does.

> but certain letters are use more than others....

A good rule of thumb is to imagine that there are only 20 characters (the most used ones) and estimate the number of k-shingles as $20^k$

Finally, instead of using substrings directly as shingles for instance 

$D1 = {ab,bc,cd,da,bd}$

we can pick a hash function that maps strings of length k to some number of buckets and treat the resulting bucket number as the shingle:

$h(D1) = \{1, 5, 7,4,2,0 \}$


Each document D can be represented as a binary (0 or 1) vector in the space of k-shingles

- Each unique shingle is a dimension
- Vectors are very sparse

> Documents that have lots of shingles in common have similar text, even if the text appears in different order - we can now measure the common shingles with **Jaccard Distance**.

If we use a binary vector with the k-shingles do define each document. we can use bitwise calculations:

- set intersection as bitwise AND
- set union as bitwise OR


C1 = 10111; 
C2 = 10011

- **Size of intersection** = 3; 
- **size of union** = 4,

- **Jaccard similarity (not distance)** = 3/4

- **Distance**: d(C1,C2) = 1 – (Jaccard similarity) = 1/4

#### Min-Hashing 

- Convert large sets to short signatures, while preserving similarity

So, we have a very sparse matrix of this form:

<img src="images/lsh_2.png" style="width:30%"/>

To solve this problem a naive approach would force us to compute all distances for all values of the matrix. We can try to find a signature whose similarity is equal to column similarity:


KEY IDEA:

“hash” each column C to a small signature h(C) such that:

- h(C) is small enough that the signature fits in RAM
- sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)

or at least, find a hash function h(·) such that:

- If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)   
- If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

> There is a suitable hash function for the Jaccard similarity: It is called Min-Hashing

1) Let's consider a random permutation $π$ of rows over the boolean spares matrix of shingles

2) Define “hash” function $h_π(C)$ the index of the first (in the permuted order $π$) row in which column C has value 1

3) Use several (e.g., 100) independent hash functions (that is, permutations) to create a signature of a column


<img src="images/minhashing_1.png" style="width:60%"/>


#### Why this is minhasshing is a good indicator for similarity?

Yes! The similarity of two signatures is the fraction of the hash functions in which they agree.


#### Locality-Sensitive Hashing (LSH)

- Focus on pairs of signatures likely to be from similar documents

Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s=0.8)

**LSH GENERAL IDEIA:** Use a function f(x,y) that tells whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated

**OPTION 1)** Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows


<img src="images/minhashmatrix.png" style="width:30%"/>

This means $M(i,x)=M(i,y)$ for at least a frac "s" of the values of i

**LSH MAIN IDEA** Hash columns of signature matrix M several times

This will increase perforamnce, once more:

<img src="images/LSH_1.png" style="width:50%"/>

- Divide matrix M into b bands of r rows   
- For each band, hash its portion of each column to a hash table with k buckets   
- Make k as large as possible
- Candidate column pairs are those that hash to the same bucket for ≥ 1 band
- Tune b and r to catch most similar pairs, but few non-similar pairs

<img src="images/LSH_3.png" style="width:50%"/>

**Pick:**
- The number of Min-Hashes (rows of M)
- The number of bands b, and
- The number of rows r per band 

to balance false positives/negatives

---

# Association Rules
---

Given:

- A large set of **items**
- A large set of **baskets**

the main objective is to answer something like this:

```
People who bought {x,y,z} tend to buy {v,w}

```

This is an **association rule**.

----

Consider the following Market-Basket with 5 transactions:

<img src="images/market_basket.png" style="width:30%"/>

The form of an **association rule** is 

$$I → j$$

where I is a set of items and j is an item. The implication of this association rule is that if all of the items in I appear in some basket, then j is “likely” to appear in that basket as well. - Here, "likely" means confidence: let's see some definitions:





### Definitions

**Support $S(I + j)$:** the ratio of registers that supports our assumption (I and j exists how much times in baskets)

$$S(I + j) = \frac{\#(I \cup j)}{T}$$

where $T$ =  Total number of Baskets


**Confidence $C(I → j)$:** the ratio of baskets with all $I$ and also containing $j$

$$C(I → j) = \frac{\#(I \cup j)}{\#I} = \frac{S(I + j)}{S(I)}$$


In the previous case, S = 0.4 and C = 0.67









Given a support threshold **k**, then sets of items that appear in at least s baskets are called **frequent itemsets**


<img src="images/market_basket.png" style="width:30%"/>

Example of Rules:

- {Milk,Diaper} → {Beer} (S=0.4, C=0.67) 
- {Milk,Beer} → {Diaper} (S=0.4, C=1.0) 
- {Diaper,Beer} → {Milk} (S=0.4, C=0.67) 
- {Beer} → {Milk,Diaper} (S=0.4, C=0.67) 
- {Diaper} → {Milk,Beer} (S=0.4, C=0.5) 
- {Milk} → {Diaper,Beer} (S=0.4, C=0.5)

Comments:

- All the rules above correspond to the same itemset: {Milk, Diaper, Beer}
- Rules obtained from the same itemset have identical support but can have different confidence





### How to Mine Association Rules?


#### brute-force
A brute-force approach for mining association rules is to compute the support anw confidence for every possible rule.
And take the ones above specific thresholds.

This approach is **prohibitively expensive** because there are exponentially many rules that can be extracted from a data set.

For N transactions if all had a quantity $d$ of items | we should compare $S$ and $C$ for $M$ itemset candidates per transaction where $M = 2^d-1$ (as we will see) - This means a 

$$O(NM)$$

complexity calculation.

<img src="images/transactions_1.png" style="width:50%"/>

For instance $N=100$ transactions with 20 items each ($w = 20$) will need $100 \times 2^{20} = 104 \space 857 \space 600$ comparisons (!!) 

---

#### Two step approach:

1.  Generate all frequent itemsets (sets of items whose ```Support > K``` )
2.  Generate high confidence association rules from each frequent itemset

–  Each rule is a partition of a frequent itemset because high support and high confidence just happens for frequent items (High Support is just that!)


> Frequent itemset generation is the hardest operation

----

### Generating Frequent Itemsets ($M$):

If we have $d$ items in a transaction, the number of possible non empty frequent transactions is given by 

$$\sum_{i=0}^N \binom{w}{i} = 2^w - 1$$

This quest needs something that reduces our search process. What if we can filter them using some kind of previously known info? 

> Let's imagine that a pair $w$ is not very frequent. what happens to any itemset composed by $w$?

<img src="images/apriori.png" style="width:50%"/>

> they are not very frequent also! 

We can state:
> If a set of items I appears at least s times, so does every subset J of I

So we can reduce our search to pairs:


1. Read baskets and count in main memory the occurrences of each individual item 
    - Needs #$items$ hash
2. Read baskets again and count in main memory only those pairs where both elements are frequent (from Pass 1) and filter them.
    - Needs list of frequent items + #$items^2$ hash 
3. And then move on the the next 3-itemsets

#### Let's see an example of apriori usage:

CONSIDER: 

- Ck = candidate k-tuples = those that might be frequent sets (support > s) based on information from the pass for k–1
- Lk = the set of truly frequent k-tuples


EXAMPLE: 

- C1 ={{b}{c}{j}{m}{n}{p}}
- Count the support of itemsets in C1
- Prune non-frequent: L1 = { b, c, j, m }
- Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
- Count the support of itemsets in C2
- Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
- Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} }
- Count the support of itemsets in C3
- Prune non-frequent: L3 = { {b,c,m} }


<img src="images/apriori_2.png" style="width:60%"/>


----

### Generate high confidence association rules from each frequent itemset

Find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement.
(Support is already assured for being a "frequent itemset")


–  Dado {A,B,C,D} como um frequent itemset, estas são as "rules" candidatas:

- ABC → D | A → BCD | AB → CD | BD → AC
- ABD → C | B → ACD | AC →  BD | CD → AB
- ACD → B | C → ABD | AD →  BC
- BCD → A | D → ABC | BC → AD

Como calcular a confiança (C) dados os suportes:

$$C(A,B → C,D) = \frac{S(A,B,C,D)}{S(A,B)} = \frac{\#(A,B,C,D)}{\#(A,B)}$$

Sendo a última expressão a definifção de C(A,B → C,D)



## PCY (Park, Chen, Yu)

In pass 1 of A-Priori, most memory is idle (not used)

- We store only individual item counts

FIRST IDEA:

In addition to item counts, maintain a hash table with as many buckets as fit in memory: In each Bucket we keep the count for each bucket into which pairs of items are hashed;

So, in Pass 1 of PCY we:

#### PCY PASS 1
- We store individual item counts;
- Store a count of hashed pairs of items for each bucket;


> OBSERVATIONS: 
> - If a bucket contains a frequent pair, then the bucket is surely frequent
> - However, even without any frequent pair, a bucket can still be frequent

So, for a bucket with total count less than s, none of its pairs can be frequent and so, **Pairs that hash to this bucket can be eliminated as candidates** even if they are frequent.

#### PCY PASS 2

Only count pairs that hash to frequent buckets


SECOND IDEA:

Implementation can use a "kind of" Bloom Filter:

> We keep a bit vector instead of buckets: 1 means the bucket "is frequent", that is, it has elements that added are bigger than s

<img src="images/PCY.png" style="width:50%" />

