# Apriori Algorithm

Apriroi works by joining and pruning every items in dataset. We first define the `min_support` as the threshold for determining which items to save. Here are the steps to explain how aprori works
1. Seperate the list in each item into different items
2. Keep the item that emerge more than the threshold and delete the ones that does not
3. Join each item and list every combination
4. Repeat the steps from second step

## 1. Generating Frequent Itemsets
We're going to create a dataset that contains lists of items. This dataset is going to be used as our primary dataset to determine which the association rules between certain items

In [8]:
dataset = [['Math', 'Physics', 'Chemistry', 'Biology'],
['Math', 'Physics', 'Chemistry'], ['Math', 'Economy', 'Sociology'],
['Chemistry', 'Biology'], ['Sociology', 'Geography'], ['Physics', 'Chemistry'],
['Chemistry', 'Sociology', 'Biology'], ['Math', 'Economy', 'Physics']
]

Since the `apriori` function expects data to be in a one-hot encoded dataframe, then we can transform our dataset using TransactionEncoder

In [9]:
# Import the necessary library
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Encode the dataset and turn it into dataframe
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

df

Unnamed: 0,Biology,Chemistry,Economy,Geography,Math,Physics,Sociology
0,True,True,False,False,True,True,False
1,False,True,False,False,True,True,False
2,False,False,True,False,True,False,True
3,True,True,False,False,False,False,False
4,False,False,False,True,False,False,True
5,False,True,False,False,False,True,False
6,True,True,False,False,False,False,True
7,False,False,True,False,True,True,False


We're going to choose **20 %** as our threshold value. Thus, we only pick items that emerged greater than or equals to 20% of the dataset

In [13]:
from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.2)

Unnamed: 0,support,itemsets
0,0.375,(0)
1,0.625,(1)
2,0.25,(2)
3,0.5,(4)
4,0.5,(5)
5,0.375,(6)
6,0.375,"(0, 1)"
7,0.25,"(1, 4)"
8,0.375,"(1, 5)"
9,0.25,"(2, 4)"


For better readability, we can set `use_colnames=True` to convert these integer values into the respective item names

In [14]:
apriori(df, min_support=0.2, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.375,(Biology)
1,0.625,(Chemistry)
2,0.25,(Economy)
3,0.5,(Math)
4,0.5,(Physics)
5,0.375,(Sociology)
6,0.375,"(Chemistry, Biology)"
7,0.25,"(Math, Chemistry)"
8,0.375,"(Chemistry, Physics)"
9,0.25,"(Economy, Math)"


## 2. Selecting and Filtering Results

Let's assume we are only interesred in itemsets of minimum length 2 that have a support of at least 20 percent. First, we create the frequent itemsets via `apriori` and add a new column that stores the length of each itemset

In [15]:
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x:len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.375,(Biology),1
1,0.625,(Chemistry),1
2,0.25,(Economy),1
3,0.5,(Math),1
4,0.5,(Physics),1
5,0.375,(Sociology),1
6,0.375,"(Chemistry, Biology)",2
7,0.25,"(Math, Chemistry)",2
8,0.375,"(Chemistry, Physics)",2
9,0.25,"(Economy, Math)",2


In [16]:
frequent_itemsets[ (frequent_itemsets['length'] >=2) & (frequent_itemsets['support'] >= 0.2)]

Unnamed: 0,support,itemsets,length
6,0.375,"(Chemistry, Biology)",2
7,0.25,"(Math, Chemistry)",2
8,0.375,"(Chemistry, Physics)",2
9,0.25,"(Economy, Math)",2
10,0.375,"(Math, Physics)",2
11,0.25,"(Math, Chemistry, Physics)",3


Using the Pandas API, we can select entries based on the "itemsets" column

In [17]:
frequent_itemsets[frequent_itemsets['itemsets'] == {'Economy', 'Math'}]

Unnamed: 0,support,itemsets,length
9,0.25,"(Economy, Math)",2


## 3. Working with Sparse Representations

To save memory, we could represent the transaction data in the sparse format. This is especially useful if you have lots of products and small transactions

In [18]:
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df

Unnamed: 0,Biology,Chemistry,Economy,Geography,Math,Physics,Sociology
0,1,1,0,0,1,1,0
1,0,1,0,0,1,1,0
2,0,0,1,0,1,0,1
3,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1
5,0,1,0,0,0,1,0
6,1,1,0,0,0,0,1
7,0,0,1,0,1,1,0


In [19]:
apriori(sparse_df, min_support=0.2, use_colnames=True, verbose=1)

Processing 15 combinations | Sampling itemset size 3


Unnamed: 0,support,itemsets
0,0.375,(Biology)
1,0.625,(Chemistry)
2,0.25,(Economy)
3,0.5,(Math)
4,0.5,(Physics)
5,0.375,(Sociology)
6,0.375,"(Chemistry, Biology)"
7,0.25,"(Math, Chemistry)"
8,0.375,"(Chemistry, Physics)"
9,0.25,"(Economy, Math)"
