# Association Rules
## What are Association Rules?
Association Rules are a kind of rule-based machine learning.

## Where did they come from?
Association Rules were designed to discover relationships in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. Data Scientists were looking for items that were commonly bought together so they could use data to offer customers better deals. For example, the rule: {onions, potatoes} > {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy burgers.

Association Rules have since been used to find relationships between any sequences of things. Examples include things like web analytics (how someone behaves on a website), intrusion detection in cybersecurity (strange series of behaviours might indicate a hack) and bioinformatics in healthcare.

Association Rules are most commonly built using the Aprioriand Eclat algorithms. We're going to be using Apriori for this notebook exercise.

## The Brief
Using supermarket data, find products that are commonly brought together with a special interest in the sale of limes.

## Import the required libraries 

You may need to install some using `pip install [package name]`

In [319]:
import numpy as np # linear algebra
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Source & Store

Download the instacart dataset and import it using read.csv. You'll need to read order_products__prior.csv and products.csv

In [320]:
order_products_prior = pd.read_csv("instacart/order_products__prior.csv", nrows=20000)
order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [321]:
products = pd.read_csv("instacart/products.csv")
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## Join the data

The data is supplied pretty much exactly how instacart stored it in their database. We need to join the files together in order to get it into a usable state. 

Currently the `order_products__prior.csv` doesn't have any product names - it only has IDs - so we can map the product names to each product ID using the `products.csv` data.

You'll learn more about how this is done in Sprint 4 when we look at SQL. For now use the following block of code, and have a read through. Try and summarise what each line is doing and write it below. No worries if not all of it makes sense for now.

With the data merged, we can count the occurrences of each product name to see what the most purchased items are. 

In [322]:
order_baskets = pd.merge(order_products_prior,products, on=['product_id'], how='inner')
order_baskets['product_name'].value_counts().rename("freq")

Banana                                                               321
Bag of Organic Bananas                                               236
Organic Strawberries                                                 163
Organic Baby Spinach                                                 137
Organic Hass Avocado                                                 129
Organic Avocado                                                      108
Strawberries                                                          95
Large Lemon                                                           91
Organic Raspberries                                                   87
Limes                                                                 77
Organic Yellow Onion                                                  72
Organic Whole Milk                                                    70
Organic Garlic                                                        69
Organic Fuji Apple                                 

## Grouping orders

At the moment each item in a transaction is separate in the dataframe. We will need to group each product in an array for each transaction. This will also reduce the number of entries into the dateframe.
Once we have done that, take a look at the first few entries to see how it's changed.

In [323]:
order_baskets = order_baskets.groupby(['order_id']).product_name.apply(np.array).reset_index()
order_baskets.head()

Unnamed: 0,order_id,product_name
0,2,"[Organic Egg Whites, Michigan Organic Kale, Ga..."
1,3,[Total 2% with Strawberry Lowfat Greek Straine...
2,4,"[Plain Pre-Sliced Bagels, Honey/Lemon Cough Dr..."
3,5,"[Bag of Organic Bananas, Just Crisp, Parmesan,..."
4,6,"[Cleanse, Dryer Sheets Geranium Scent, Clean D..."


## Explore & Transform

To work efficiently, Association Rules require something called a sparse matrix. A sparse matrix is basically a big table where any cell without a value in it is ignored by a computer.

Creating a sparse matrix cuts out a lot of time and computational effort because a computer doesn't have to look at a great many cells within a dataframe - as far as the computer is concerned, empty cells don't exist.

In [324]:
#transactions = order_baskets.iloc[:,1:].values
te = TransactionEncoder()
te_ary = te.fit(order_baskets['product_name']).transform(order_baskets['product_name'])
dataset = pd.DataFrame(te_ary, columns=te.columns_)
dataset

Unnamed: 0,& Go! Hazelnut Spread + Pretzel Sticks,0% Fat Blueberry Greek Yogurt,0% Fat Free Organic Milk,0% Fat Organic Greek Vanilla Yogurt,0% Greek Strained Yogurt,0% Greek Yogurt Black Cherry on the Bottom,0% Milkfat Greek Plain Yogurt,0% Milkfat Greek Yogurt Honey,1 % Lowfat Milk,1 Apple + 1 Mango Fruit Bar,...,Zucchini Noodles,from Concentrate Mango Nectar,gel hand wash sea minerals,in Gravy with Carrots Peas & Corn Mashed Potatoes & Meatloaf Nuggets,of Norwich Original English Mustard Powder Double Superfine,smartwater® Electrolyte Enhanced Water,vitaminwater® XXX Acai Blueberry Pomegranate,with Crispy Almonds Cereal,with Olive Oil Mayonnaise,with Olive Oil Mayonnaise Dressing
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Create Rules: Most common set of items bought

We can create a rule that will determine the most frequent itemsets; that is, which collections of items are bought most often.
The `support` of a given itemset is the percentage of total transactions in which that itemset was bought. A support of 0.01 for a given itemset means that that itemset was bought in 1% of all transactions. We also need to establish the minimum itemset length to 2 using lambda functions.

In [325]:
frequent_itemsets = apriori(dataset, min_support=0.01, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets[ (frequent_itemsets['length'] == 2)]
frequent_itemsets

Unnamed: 0,support,itemsets,length
108,0.01034,"[Bag of Organic Bananas, Organic Baby Spinach]",2
109,0.01871,"[Bag of Organic Bananas, Organic Hass Avocado]",2
110,0.013786,"[Bag of Organic Bananas, Organic Raspberries]",2
111,0.016248,"[Bag of Organic Bananas, Organic Strawberries]",2
112,0.010832,"[Banana, Cucumber Kirby]",2
113,0.011817,"[Banana, Honeycrisp Apple]",2
114,0.01034,"[Banana, Large Lemon]",2
115,0.016741,"[Banana, Organic Avocado]",2
116,0.016741,"[Banana, Organic Baby Spinach]",2
117,0.014279,"[Banana, Organic Fuji Apple]",2


## Create Rules: First set of Association Rules

Now we are going to create a set of rules to get items often bought with itemsets.
We know from the last couple of steps that the support is the frequency with which itemset appears in the dataset. 

Confidence is how often a rule is found to be true. More precisely, for a given rule from an antecedent itemset X to a set Y, the confidence of that rule is equal to the Support of the intersection of X with Y, divided by the Support of X alone. e.g. If {eggs, milk} => {bread} has a confidence of 0.6 and the itemset {eggs, milk} has a support of 0.0001, then eggs and milk appear in 0.01% of all transactions, and 60% of the time that eggs and milk are bought, bread is bought as well.

> Create a set of association rules using the `apriori` algorithm.
> Then print out a summary of your rules and inspect the first rule

*Hint: To get you started some sensible default options*
- Use the confidence and support values mentioned above. Low support (appears infrequently) but high confidence (bought often together).
- Set `antecedent_len=3` to only evaluate itemsets of 2 or 3 items

In [326]:
frequent_itemsets = apriori(dataset, min_support=0.003, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[ (rules['antecedent_len'] >= 2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
62,"(Organic Hass Avocado, Organic Raspberries)",(Bag of Organic Bananas),0.00837,0.116199,0.003939,0.470588,4.04985,0.002966,1.669402,2
63,"(Organic Hass Avocado, Organic Strawberries)",(Bag of Organic Bananas),0.010832,0.116199,0.003939,0.363636,3.12943,0.00268,1.38883,2
64,"(Boneless Skinless Chicken Breasts, Organic Ba...",(Banana),0.004431,0.15805,0.003447,0.777778,4.92108,0.002746,3.788774,2
65,"(Boneless Skinless Chicken Breasts, Banana)",(Organic Baby Spinach),0.00837,0.067454,0.003447,0.411765,6.104337,0.002882,1.585327,2
66,"(Organic Strawberries, Organic Avocado)",(Banana),0.00837,0.15805,0.003447,0.411765,2.605278,0.002124,1.431315,2
67,"(Organic Zucchini, Organic Baby Spinach)",(Banana),0.007386,0.15805,0.003447,0.466667,2.952648,0.002279,1.578656,2
68,"(Organic Zucchini, Banana)",(Organic Baby Spinach),0.007386,0.067454,0.003447,0.466667,6.918248,0.002948,1.748523,2
69,"(Organic Blackberries, Organic Strawberries)",(Banana),0.006893,0.15805,0.003447,0.5,3.163551,0.002357,1.6839,2
70,"(Organic Blackberries, Banana)",(Organic Strawberries),0.006893,0.080256,0.003447,0.5,6.230061,0.002893,1.839488,2


Note: when you inspect a rule you'll get something like this:
```
antecedents	- (Organic Hass Avocado, Organic Raspberries)
consequents - (Bag of Organic Bananas)
antecedent support - 0.008370
consequent support - 0.116199
support	confidence - 0.003939
lift - 0.470588
leverage - 0.002966
conviction - 1.669402
antecedant_len - 2
```

Note: It might look more complicated, but the important things are
- *antecedents* stands for "when people buy these products"
- *consequents* stands for "they are likely to buy this, too"
- *support	confidence* is how often a rule is found to be true
- *Lift* explained below
- *antecedent_len* stands for the number of items with the antecedent item set

You might also notice how the confidence is 1 for some of the rules. That means the rule is correct 100% of the time but how can that be?
If the support is really low, such as 0.0001, then the rule only needs to appear in 0.08% of the transactions to be accepted. In a dataset with 20,000 transations, only 2 out of 2 people need to adhere to the rule for it to have a confidence of 1. This is why a support too low can give you useless data.

Note: If it is taking too long to run, increase the support. Python isn't as efficient as R as you cant set the min and max length until after so item sets of all lengths are evaluated making it slow!

In [327]:
frequent_itemsets = apriori(dataset, min_support=0.001, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[ (rules['antecedent_len'] >= 2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
85,"(100% Raw Coconut Water, Banana)",(Organic Raspberries),0.002462,0.042836,0.001969,0.80,18.675862,0.001864,4.785820,2
86,"(Organic Raspberries, 100% Whole Wheat Bread)",(Bag of Organic Bananas),0.001969,0.116199,0.001969,1.00,8.605932,0.001741,inf,2
87,"(Apple Cinnamon GoGo Squeez, Organic Strawberr...",(Banana),0.001969,0.158050,0.001477,0.75,4.745327,0.001166,3.367799,2
88,"(Apple Cinnamon GoGo Squeez, Banana)",(Organic Strawberries),0.001477,0.080256,0.001477,1.00,12.460123,0.001359,inf,2
89,"(Apple Honeycrisp Organic, Organic Broccoli)",(Banana),0.001477,0.158050,0.001477,1.00,6.327103,0.001244,inf,2
90,"(Apple Honeycrisp Organic, Organic Reduced Fat...",(Banana),0.001969,0.158050,0.001477,0.75,4.745327,0.001166,3.367799,2
91,"(Apple Honeycrisp Organic, Organic Broccoli)",(Organic Baby Spinach),0.001477,0.067454,0.001477,1.00,14.824818,0.001377,inf,2
92,"(Organic Broccoli, Organic Baby Spinach)",(Apple Honeycrisp Organic),0.001477,0.026096,0.001477,1.00,38.320755,0.001439,inf,2
93,"(Asparagus, Organic Strawberries)",(Bag of Organic Bananas),0.001477,0.116199,0.001477,1.00,8.605932,0.001305,inf,2
94,"(Asparagus, Organic Yellow Onion)",(Organic Baby Spinach),0.001969,0.067454,0.001477,0.75,11.118613,0.001344,3.730182,2


## Lift

The lift of a rule is ratio of the observed support of all items in the rule to that expected if the antecedent and the consequent were independent. So lift is simply the ratio of these values: target response divided by average response.

If the rule had a lift of 1, it would imply that the probability of occurrence, and that of the consequent are independent of each other. 

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

If the lift is < 1, that lets us know the items are mutually substitutable. This means that presence of one item has negative effect on presence of other item and vice versa.

This time, let's filter by lift to take a look at some of the rules with a higher lift.

In [330]:
frequent_itemsets = apriori(dataset, min_support=0.003, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[ (rules['antecedent_len'] >= 2) &
       (rules['lift'] > 2) ]

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedant_len
20,"(Organic Hass Avocado, Organic Raspberries)",(Bag of Organic Bananas),0.00837,0.116199,0.003939,0.470588,4.04985,0.002966,1.669402,2
21,"(Boneless Skinless Chicken Breasts, Organic Ba...",(Banana),0.004431,0.15805,0.003447,0.777778,4.92108,0.002746,3.788774,2
22,"(Boneless Skinless Chicken Breasts, Banana)",(Organic Baby Spinach),0.00837,0.067454,0.003447,0.411765,6.104337,0.002882,1.585327,2
23,"(Organic Strawberries, Organic Avocado)",(Banana),0.00837,0.15805,0.003447,0.411765,2.605278,0.002124,1.431315,2
24,"(Organic Zucchini, Organic Baby Spinach)",(Banana),0.007386,0.15805,0.003447,0.466667,2.952648,0.002279,1.578656,2
25,"(Organic Zucchini, Banana)",(Organic Baby Spinach),0.007386,0.067454,0.003447,0.466667,6.918248,0.002948,1.748523,2
26,"(Organic Blackberries, Organic Strawberries)",(Banana),0.006893,0.15805,0.003447,0.5,3.163551,0.002357,1.6839,2
27,"(Organic Blackberries, Banana)",(Organic Strawberries),0.006893,0.080256,0.003447,0.5,6.230061,0.002893,1.839488,2
