# Apriori

Association Rule Learning consists of methodology used to discover relationships between variables, one of these methods being Apriori.  Apriori stands for __A Priori__, which is latin for _from the former_. This is fitting as Apriori determines whether one circumstance is viable based on another. From a business perspective, it can be thought of as:

> "People who bought also bought..." 

Essentially, Apriori forms connections between seemingly unrelated variables in a dataset.

## Definitions

1. __Support:__ The % of circumstances where X is true, represented as a basic fraction.

$$\large support(x) = \frac{\text{# of X Instances}}{\text{# of Total Instances}} $$

2. __Confidence:__ The % of circumstances that Y is true, when X is true.

$$\large \mathit{confidence}(X \to Y) = \frac{\text{# of X & Y Instances}}{\text{# of X Instances}} $$

3. __Lift:__ While applied to many machine learning models, lift in relation to the Apriori algorithm is a numerical representation of how effective a connection between two variables is. In terms of the formula, it represents how much better the chances are to get Y when choosing from datapoints containing X than choosing from all datapoints.

$$\large \mathit{lift}(X \to Y) = \frac{\mathit{confidence}(X \to Y)}{support(X)} $$

## Algorithm

1. Set a minimum support and confidence; this is to prevent weak correlations from being considered and wasting computational power. 

2. Generate all subsets from the dataset that pass the minimum support qualification.
3. Generate all rules _(connections)_ from the dataset that pass the minimum confidence qualification.
4. Sort all rules by decreasing lift so that the most important connection can be considered first. 

<hr>

## Code

__Setting up the Dataset:__

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('market_basket_optimization.csv', header = None)

__NOTE:__ The dataset is different than those currently seen, as it is a _CSV_ file which contains a list of purchases from a grocery store. Each row is a set of purchases made from one buyer, each column containing one item bought. The first line shouldn't be treated as a header because it also contains a set of purchases from a buyer. 

<hr>

__Performing Apriori:__

In [2]:
#All elements should be of the same string type, factoring in NaN values.
filtered_dataset = [[str(element) for element in dataset.values[row]] for row in dataset.index]

#The apyori algorithm isn't available through the current data science libraries and is instead taken from PyPi.
from apyori import apriori
rules = apriori(filtered_dataset, min_support = ((3 * 7) / 5000), min_confidence = 0.2, 
                min_lift = 3, min_length = 2, max_length = 2)

__Reasoning:__
* _Minimum Support_ : The items chosen should be bought at least 3 times a day. This is multiplied by 7 so that the item chosen should be bought at least 21 times per week. The support value should then be (# of X item) / (# of total items), the total item count being around 5000 items bought per week.


* _Minimum Confidence & Lift_ : The current Apriori algorithm filters and sorts connections by all three factors: *support*, *confidence*, and _lift_. 0.2 and 3 were chosen as significant values for the minimum confidence and lift respectively.


* _Minimum Length & Maximum Length_ : No relationship should be determined from a purchase consisting of a single item. For now, we're only paying attention to relationships between two items.


<hr>

__Visualizing the Results:__

In [3]:
results = list(rules)

# Printing the top 5 most viable correlations (For demonstration simplicity):
for result in range(6):
    print(results[result])

RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])
RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)])
RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)])
RelationRecord(items=frozenset({'herb & pepper', 'ground beef'}), support=0.015997866951073192, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), confide

- - - - 

__Visualizing the Results (Clean Format):__

In [4]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

results = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])
results.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
6,pasta,shrimp,0.005066,0.322034,4.506672
5,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
4,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
3,herb & pepper,ground beef,0.015998,0.32345,3.291994
