# Association Rule Mining

Techniques to identify underlying relationship between items.

It is used for __Market Basket Analysis (MBA)__.

E.g. shampoo + Conditioner combo product i.e. shops can combine related items to increase sales. You don't find shampoo combined with say Port :)

- Given a database of transactions, each transaction is considered as a list of items. 
- Identify frequent patterns
- Find all the rules that correlate the presence of one set of items with that of another set of items
- Most commonly used for market basket analysis

The relationships will vary from shop to shop because the target audience is different.

When you identify the related items, you can:
1. Place the items close to each other
2. Package/market them together
3. Targetted advert
4. Provide combined discount

The association rule mining is done with apriori algorithm.

Apriori Algorithm: using the data and apriori algorithm, you will find the 
- Support - shows how __popular__ a product is
- Confidence - gives the likelihood of one item being purchased when the other item is being purchased.
- Lift - specifies how much an item sales will increase value <=-1 specifies that it is unlikely that the 2 items will be bought together

### Apriori algorithm:

__Only lists are accepted for apriori algorithm__

Objective : to find the 3 parameters of the Apriori algorithm 
           - Support, Lift and Confidence - between 2 items - Pizzas & Cokes

Scenario:

We have 1,000 customer transaction records with us. Out of these 1000
transactions, 100 transactions contain Coke.
While 150 transactions contain Pizza.
And of these 150 transactions, 50 transactions contain Coke as well.

Using this data, we are going to find the support, lift and confidence
values between Pizza and coke.

__Support__ :

--> It is a parameter which tells out the popularity of an item.

    Support(Pizza) = Total no. of transactions containing Pizza /
                   Total no of transactions
                   = 150 /1000

        Support (Coke) =  100 / 1000 

__Confidence__ : 

--> It is a parameter that tells us about the likelihood of one product 
being purchased when the other product is already bought.

Confidence (Coke -> Pizza) = No.of Transactions contain both Coke and
                             Pizza / No. Transactions containing Pizza 

                            = 50 / 150  = 33.3 %

__Lift__ : 

--> It is a parameter which refers to the increase in the ratio of sales 
of Coke when pizza has already been bought.


    Lift(Coke -> Pizza) = Confidence(Coke->Pizza) /
                      Support (Pizza)
                     = 2.22

This value 2.22 actually indicates how many times the likelihood of 
buying pizza and coke together is more than that of the likelihood of buying
just the Pizza alone.

--> Higher the lift, higher the chances of the 2 items being bought together.
--> if the lift value <= 1, then it means that it is very much unlikely for
    the two items being bought together.


# 1. Import packages

In [1]:
!pip install apyori
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori



# 2. Import the data

In [2]:
store_data = pd.read_csv('store_data.csv', header = None)
store_data.head(n=10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,
6,whole wheat pasta,french fries,,,,,,,,,,,,,,,,,,
7,soup,light cream,shallot,,,,,,,,,,,,,,,,,
8,frozen vegetables,spaghetti,green tea,,,,,,,,,,,,,,,,,
9,french fries,,,,,,,,,,,,,,,,,,,


# 3. Iterate over the rows and columns in the data

In [3]:
# apriori takes only list so we need to iterate over the rows and column indexes
records = []
for i in range(0, 7501):
    records.append([str(store_data.values[i,j]) for j in range(0, 20)])
    
print(records[0])

# i -> refers to the row index
# j -> refers to the column index

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']


# 4. Initialize the apriori

In [4]:
# Initialize the apriori i.e. Apply the apriori algorithm on the dataset 
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)

# min_support is trial and error - try different values to see the best
# the first parameter - records -list of values. It is from this list that we want to extract the association results.
# min_support - lower boundary value i.e. an item whose support value is greater than the min_support value should be chosen
# min_confidence - minimum threshold value for confidence.
# min_lift - minimum thresholds value for lift parameter
# min_length - a minimum of 2 items are needed for finding out the association

association_results = list(association_rules)

print(association_results)

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]), RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]), RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]), RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.015997866951073192, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), con

In [5]:
# pass 0 so as to see the one of the results
print(association_results[0])

RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])


Lift of 4.84 means it increases sales by about 5 times

presenting the data as in cell above will not be meaningful to the cx so you need to make it look better

You cannot get rid of nan/null else it will affect your data.

# 6. Improve the presentation

In [7]:
for item in association_results:
    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])
    
    #second index of the inner list
    print("Support: " + str(item[1]))
    
    #third index of the list located at 0th of the third index of the inner list
    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: light cream -> chicken
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Rule: escalope -> mushroom cream sauce
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
Rule: ground beef -> tomato sauce
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
Rule: olive oil -> whole wheat pasta
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
Rule: shrimp -> pasta
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
Rule: light cream -> nan
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Rule: chocolate -> shrimp
Support: 0.005332622317024397
Confidence: 0.2325