# 05 Frequent Itemsets
__Math 3280 - Data Mining__ : Snow College : Dr. Michael E. Olson

* Leskovec, Chapter 6
-----

We learned earlier in the semester about how similar two objects may be to each other. We'll turn now to association between two kinds of objects. That is, there may not be any similarity between two objects, but there may be a relationship between them. For example, milk and bread are not very similar at all, but they are frequently bought together.

A __market-basket__ model describes relationships between two kinds of objects. 
* *items*
* *baskets* (sometimes called *transactions*) consist of an *itemset*
    * We usually assume that the size of the itemset is much smaller than the total number of items

## Frequent
An itemset is said to be "frequent" if a subset of items appears in many baskets. Thinking of this mathematically, take an itemset $I$. 
* Define the __support__ of $I$ to be the number of baskets for which $I$ is a subset
* $I$ is frequent if $Support(I) > s$ where $s$ is the __support threshold__

Consider the following sets of letters:
* {A, B, C}
* {B, C, F}
* {B, C, D}
* {C, D, E}
* {B, C, E}
* {A, C, D}
* {B, E, F}
* {B, C, E}

For this example, let's set our support threshold to $s = 5$.

A __singleton set__ is a set of just one item. The supports for all the singleton sets are:
$$Support(A) = 2 \qquad Support(B) = 6 \qquad Support(C) = 7 \qquad Support(D) = 3 \qquad Support(E) = 4 \qquad Support(F) = 2$$

From this, only items $B$ and $C$ are frequent, since they are greater than $s$.

The support for the __doubleton__ subset $I_1 = \{B, E\}$ would be,

$$Support(I_1) = \text{\# of times }\{B,E\}\text{ appears} = 3$$

If we have a support threshold of $s = 5$, then the set $I_1$ is not frequent.

The support for the subset $I_2 = \{B, C\}$ would be,

$$Support(I_2) = \text{\# of times }\{B,C\}\text{ appears} = 5$$

Thus $I_2$ is frequent.

How can we use this information? Sometimes, the results of this calculation are useless. For example, the purchase of milk and eggs would be considered similar since they are often purchased together. However, hot dogs and mustard are not considered similar, but they would have a higher support. This opens a market tactic: offer a sale on hot dogs, but increase the price of mustard. When people buy hot dogs because they are on sale, they may say, "Oh, I need mustard," and they'll get it regardless of the price.

Common applications of frequent itemsets:
1. *Related concepts*: words that often appear in conjunction with a topic. For example, how often does the word "civil" come up in an article about "engineering"
2. *Plagiarism*: sentences that appear in different documents. A document that has a large number of sentences with high support may be indicative of a plagiarized document
3. *Biomarkers*: genes or proteins appear when exposed to certain deseases

## Association Rules
Now, we want to look at how often different subsets appear together. A common application of this would be recommendation systems ("customers who bought what is in your cart also bought [item]"). To indicate the association of basket $I$ with item $j$ as $I\to j$.

So, how likely will basket $I$ be associated with $j$? We'll measure this be defining the __confidence__ of $I\to j$ as,
$$Confidence(I\to j) = \frac{Support(I\cup \{j\})}{Support(I)}$$

Using our earlier example, how likely is the set $I = \{B, C\}$ to be associated with $\{E\}$?
$$Confidence(\{B,C\} \to \{E\}) = \frac{Support(\{B, C, E\})}{Support(\{B, C\})} = \frac{2}{5}$$

Another way to think of this: The set $\{B,C\}$ appears 5 times. Of those 5 times, $E$ appears twice.

The confidence is useful as long as $Support(I)$ is fairly large. However, the confidence means more when the association rule reflects a true relationship. So, we define the __interest__ of an association rule as the difference between the confidence and the fraction of baskets that contain $j$.
$$Interest(I\to j) = Confidence(I\to j) - \frac{Support(\{j\})}{\text{\# of items}}$$

The advantage to this is that if $I$ and $j$ aren't associated, then $Confidence(I\to j) \approx \frac{Support(\{j\})}{\text{\# of items}}$, so $Interest(\{B,C\}\to\{E\}) \approx 0$

Using our earlier example,
$$Interest(\{B,C\}\to\{E\}) = Confidence(\{B,C\} \to \{E\}) - \frac{Support(E)}{8}$$
$$Interest(\{B,C\}\to\{E\}) = \frac{Support(\{B, C, E\})}{Support(\{B, C\})} - \frac{Support(E)}{8} = \frac{2}{5} - \frac{4}{8} = -\frac{1}{10}$$

What do the numbers mean?
* If the interest is high, then $I$ has a high probability of causing $j$
* If the interest is highly negative, then $I$ has a high probability of discouraging $j$
* If the interest is near 0, then any association between $I$ and $j$ is likely coincidental

How does this compare with others? Find the Confidence and Interest of $\{A,B\}\to\{C\}$, $\{B,F\}\to\{E\}$, $\{C,D\}\to\{A\}$, and $\{B,C\}\to\{F\}$.
$$Confidence(\{A,B\}\to\{C\}) = \frac{1}{1} \qquad Interest(\{A,B\}\to\{C\}) = 1 - \frac{7}{8} = \frac{1}{8} = 0.125$$
$$Confidence(\{B,F\}\to\{E\}) = \frac{1}{2} \qquad Interest(\{B,F\}\to\{E\}) = \frac{1}{2} - \frac{4}{8} = 0\qquad\qquad$$
$$Confidence(\{C,D\}\to\{A\}) = \frac{1}{3} \qquad Interest(\{C,D\}\to\{A\}) = \frac{1}{3} - \frac{2}{8} = \frac{1}{12} = 0.0833$$
$$Confidence(\{B,C\}\to\{F\}) = \frac{1}{4} \qquad Interest(\{B,C\}\to\{F\}) = \frac{1}{4} - \frac{2}{8} = 0\qquad\qquad$$

With large datasets, a "reasonably high" interest would be items in 1\% of the baskets.

## A-Priori Algorithm
This method is not too difficult. However, as is always the problem in the real world, we are dealing with very large amounts of data that can't always be held in main memory, much less do calculations with them. So, to simplify the process, we'll look at the __A-Priori__ algorithm, which only looks at the most frequent items.

In order to understand the A-Priori algorithm, there are a few mathematical methods that need to be implemented. We won't go over these here. But they are in section 6.2.2 of the *Leskovec* textbook.

The basic principle of the A-Priori is to decrease the number of calculations. But in order to do this, we make 2 passes through the data instead of just 1.

__Pass 1__: Go through the data and count the frequency of each item
* Select only the items with a support over the support threshold (again, often around 1\%)
* Begin with a list of $n$ items, narrow that down to $m$ items, where $m$ is a small fraction of $n$.

__Pass 2__: For each basket, find only frequent items and count the pairs. Count all pairs for all baskets.

If 50\% of the items are eliminated, then since we are dealing with pairs, there will only be 25\% of the calculations. This saves a lot on time and memory.

--------
## Homework
* Exercise 6.1.1 a,b,c
* Exercise 6.1.2
* Exercise 6.1.5 a,b - Find both the confidence and the interest of each association rule
* Exercise 6.1.7 a
* Exercise 6.2.6 a