# PROJECT: Apriori Algorithm using mlxtend and efficient-apriori 
#### date of creation: 26-1-2019
#### Purpose: 
1. how to install libraries mlxtend, efficient-apriori and apyori
2. Simple Examples of Apriori Algorithm using mlxtend and efficient-apriori

# Association Rule?
If there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations. But here list is same
We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will come in handy. Since I do not have that, I’ll just look for a couple of illustrative examples.

##### Association rules
are normally written like this: {Diapers} -> {Beer} which means that there is a strong relationship between customers that purchased diapers and also purchased beer in the same transaction.

In the above example, the {Diaper} is the antecedent and the {Beer} is the consequent. Both antecedents and consequents can have multiple items. In other words, {Diaper, Gum} -> {Beer, Chips} is a valid rule.

##### Support 
is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships.

##### Confidence 
is a measure of the reliability of the rule. A confidence of .5 in the above example would mean that in 50% of the cases where Diaper and Gum were purchased, the purchase also included Beer and Chips. For product recommendation, a 50% confidence may be perfectly acceptable but in a medical situation, this level may not be high enough.

##### Lift 
is the ratio of the observed support to that expected if the two rules were independent (see wikipedia). The basic rule of thumb is that a lift value close to 1 means the rules were completely independent. Lift values > 1 are generally more “interesting” and could be indicative of a useful rule pattern.

References for more details please refer:
1. https://en.wikipedia.org/wiki/Association_rule_learning
2. https://en.wikipedia.org/wiki/Apriori_algorithm


# Installation of Libraries: mlxtend, efficient-apriori, apyori

!pip install --upgrade pip  # upgrade pip before installing any library

### Lib 1: mlxtend 

!pip install mlxtend  # to install mlxtend lib execute this command

In [25]:
import mlxtend
print('mlxtend version',mlxtend.__version__)

mlxtend version 0.14.0


### Lib 2: efficient_apriori by pypi
link: https://pypi.org/project/efficient-apriori/

!pip install efficient-apriori  #install efficient-apriori

In [26]:
import efficient_apriori
print('efficient_apriori version',efficient_apriori.__version__)

efficient_apriori version 0.4.5


### Lib 3  apyori

Reference: https://pypi.org/project/apyori/

!pip install apyori     #installing apyori

In [27]:
import apyori
print('apyori version',apyori.__version__)

apyori version 1.1.1


# Output of Lib 1: mlxtend

# Creating input for our algorithm:

In [28]:
pdTransactionEx = pd.DataFrame({'Transaction_ID': \
                                [1,2, 2,3,4,4,5,5,'C6',7,7,8,9,9,'C10', 'C10', np.nan,11],\
                                'Purchased': \
                                [
                                    'A ', 
                                    'A', 'B ', 
                                    'C', 
                                    'D', 'A', 
                                    ' A', 'C',
                                    ' A ', 
                                    'A', 'B', 
                                    'E', 
                                    'M', ' A  ', 
                                    'E', 'F',
                                    'C',
                                    np.nan
                                ],
                                'Quantity': [11,2, 42,32,41,45,50,50,6,79,70,18,29,39,45, 33, 88,55]
                               
                               })
pdTransactionEx

Unnamed: 0,Transaction_ID,Purchased,Quantity
0,1,A,11
1,2,A,2
2,2,B,42
3,3,C,32
4,4,D,41
5,4,A,45
6,5,A,50
7,5,C,50
8,C6,A,6
9,7,A,79


### Cleaning dataframe

In [29]:
pdTransactionEx['Purchased'] = pdTransactionEx['Purchased'].str.strip() 
#make sure the all the same items purchased should match
pdTransactionEx.dropna(axis=0, subset=list(pdTransactionEx.columns), inplace=True) 
#if nan in the mentioned columns found then remove that row. axis =  0 indicates zero along the column
pdTransactionEx

Unnamed: 0,Transaction_ID,Purchased,Quantity
0,1,A,11
1,2,A,2
2,2,B,42
3,3,C,32
4,4,D,41
5,4,A,45
6,5,A,50
7,5,C,50
8,C6,A,6
9,7,A,79


In [30]:
pdTransactionEx['Transaction_ID'] = pdTransactionEx['Transaction_ID'].astype('str')

In [31]:
pdTransactionEx.dtypes

Transaction_ID    object
Purchased         object
Quantity           int64
dtype: object

In [32]:
pdTransactionEx = pdTransactionEx[~pdTransactionEx['Transaction_ID'].str.contains('C')] 
#Now remove some special type of transactions. e.g. ID containing C
pdTransactionEx

Unnamed: 0,Transaction_ID,Purchased,Quantity
0,1,A,11
1,2,A,2
2,2,B,42
3,3,C,32
4,4,D,41
5,4,A,45
6,5,A,50
7,5,C,50
9,7,A,79
10,7,B,70


In [33]:
pdTransactionEx = pdTransactionEx.reset_index()[pdTransactionEx.columns] #since some rows are removed therefore reset index

In [34]:
print('We have these many transactions as:', list(pdTransactionEx.Transaction_ID.unique()))
print('We have products purchase as:', list(pdTransactionEx.Purchased.unique()))

We have these many transactions as: ['1', '2', '3', '4', '5', '7', '8', '9']
We have products purchase as: ['A', 'B', 'C', 'D', 'E', 'M']


## Now creating a table which will hold products purchased in each transaction. This will be the input to our algorithm.

In [35]:
# In each transaction how much quantity of each product is purchased
pdTransactionEx.groupby(['Transaction_ID', 'Purchased'])['Quantity']\
          .sum().unstack().reset_index().fillna(0)\
          .set_index('Transaction_ID')

Purchased,A,B,C,D,E,M
Transaction_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,11.0,0.0,0.0,0.0,0.0,0.0
2,2.0,42.0,0.0,0.0,0.0,0.0
3,0.0,0.0,32.0,0.0,0.0,0.0
4,45.0,0.0,0.0,41.0,0.0,0.0
5,50.0,0.0,50.0,0.0,0.0,0.0
7,79.0,70.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,18.0,0.0
9,39.0,0.0,0.0,0.0,0.0,29.0


In [36]:
# OR use pivot table instead of above line of code
pdTransactionEx.pivot_table('Quantity', ['Transaction_ID'], 'Purchased').fillna(0)

Purchased,A,B,C,D,E,M
Transaction_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,11.0,0.0,0.0,0.0,0.0,0.0
2,2.0,42.0,0.0,0.0,0.0,0.0
3,0.0,0.0,32.0,0.0,0.0,0.0
4,45.0,0.0,0.0,41.0,0.0,0.0
5,50.0,0.0,50.0,0.0,0.0,0.0
7,79.0,70.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,18.0,0.0
9,39.0,0.0,0.0,0.0,0.0,29.0


In [37]:
basket = pdTransactionEx.groupby(['Transaction_ID', 'Purchased'])['Quantity']\
          .sum().unstack().reset_index().fillna(0)\
          .set_index('Transaction_ID')

### applying one hot encoding

In [38]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket.applymap(encode_units) #output will look like this

Purchased,A,B,C,D,E,M
Transaction_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,0,0
2,1,1,0,0,0,0
3,0,0,1,0,0,0
4,1,0,0,1,0,0
5,1,0,1,0,0,0
7,1,1,0,0,0,0
8,0,0,0,0,1,0
9,1,0,0,0,0,1


In [39]:
#There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less 
#the 0 is set to 0. 
basket_sets = basket.applymap(encode_units)
basket_sets.drop('M', inplace=True, axis=1)  #Now if don't want Product M or M column then we can remove it 
basket_sets

Purchased,A,B,C,D,E
Transaction_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,0,0,0,0
2,1,1,0,0,0
3,0,0,1,0,0
4,1,0,0,1,0
5,1,0,1,0,0
7,1,1,0,0,0
8,0,0,0,0,1
9,1,0,0,0,0


In [44]:
basket_sets['A'].sum()

6

In [64]:
# IMPORTING mlxtend LIBRARY:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import numpy as np

In [65]:
frequent_itemsets  = apriori(basket_sets, min_support=0.07, use_colnames=True)
#set use_colnames=True to convert these integer values into the respective item names
frequent_itemsets 

Unnamed: 0,support,itemsets
0,0.75,(A)
1,0.25,(B)
2,0.25,(C)
3,0.125,(D)
4,0.125,(E)
5,0.25,"(A, B)"
6,0.125,"(A, C)"
7,0.125,"(A, D)"


In [66]:
#Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7% 
#(this number was chosen so that I could get enough useful examples):
frequent_itemsets  = apriori(basket_sets, min_support=0.07, use_colnames=True)
frequent_itemsets 

Unnamed: 0,support,itemsets
0,0.75,(A)
1,0.25,(B)
2,0.25,(C)
3,0.125,(D)
4,0.125,(E)
5,0.25,"(A, B)"
6,0.125,"(A, C)"
7,0.125,"(A, D)"


In [84]:
apriori(basket_sets, min_support=0.0, use_colnames=True) # O/P For minimum support 20%

Unnamed: 0,support,itemsets
0,0.75,(A)
1,0.25,(B)
2,0.25,(C)
3,0.125,(D)
4,0.125,(E)
5,0.25,"(A, B)"
6,0.125,"(A, C)"
7,0.125,"(A, D)"
8,0.0,"(E, A)"
9,0.0,"(C, B)"


#### The final step is to generate the rules with their corresponding support, confidence and lift:

In [80]:
#The final step is to generate the rules with their corresponding support, confidence and lift:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(A),(B),0.75,0.25,0.25,0.333333,1.333333,0.0625,1.125
1,(B),(A),0.25,0.75,0.25,1.0,1.333333,0.0625,inf
2,(A),(D),0.75,0.125,0.125,0.166667,1.333333,0.03125,1.05
3,(D),(A),0.125,0.75,0.125,1.0,1.333333,0.03125,inf


In [63]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(A),(B),0.75,0.25,0.25,0.333333,1.333333,0.0625,1.125
1,(B),(A),0.25,0.75,0.25,1.0,1.333333,0.0625,inf
2,(A),(D),0.75,0.125,0.125,0.166667,1.333333,0.03125,1.05
3,(D),(A),0.125,0.75,0.125,1.0,1.333333,0.03125,inf


In [81]:
association_rules(frequent_itemsets, metric="lift", min_threshold=0) # for 0 threshold

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(A),(B),0.75,0.25,0.25,0.333333,1.333333,0.0625,1.125
1,(B),(A),0.25,0.75,0.25,1.0,1.333333,0.0625,inf
2,(A),(C),0.75,0.25,0.125,0.166667,0.666667,-0.0625,0.9
3,(C),(A),0.25,0.75,0.125,0.5,0.666667,-0.0625,0.5
4,(A),(D),0.75,0.125,0.125,0.166667,1.333333,0.03125,1.05
5,(D),(A),0.125,0.75,0.125,1.0,1.333333,0.03125,inf


# Output of Lib 2: efficient_apriori

In [47]:
from efficient_apriori import apriori

In [48]:
#Input for efficient_apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
#Notice that in every transaction with eggs present, bacon is present too. 
#Therefore, the rule {eggs} -> {bacon} is returned with 100 % confidence.

In [49]:
itemsets, rules = apriori(transactions, min_support=0.2,  min_confidence=1)

In [50]:
pd.DataFrame(itemsets)  #if we print what is itemsets

Unnamed: 0,Unnamed: 1,Unnamed: 2,1,2,3
apple,,,,,
apple,bacon,,,,
apple,bacon,eggs,,,1.0
apple,eggs,,,,
bacon,,,,,
bacon,banana,,,,
bacon,banana,soup,,,1.0
bacon,eggs,,,,
bacon,eggs,soup,,,1.0
bacon,soup,,,,


In [51]:
pd.DataFrame(rules)

Unnamed: 0,0
0,"{apple} -> {bacon} (conf: 1.000, supp: 0.333, ..."
1,"{apple} -> {eggs} (conf: 1.000, supp: 0.333, l..."
2,"{banana} -> {bacon} (conf: 1.000, supp: 0.333,..."
3,"{eggs} -> {bacon} (conf: 1.000, supp: 0.667, l..."
4,"{soup} -> {bacon} (conf: 1.000, supp: 0.667, l..."
5,"{banana} -> {soup} (conf: 1.000, supp: 0.333, ..."
6,"{apple, eggs} -> {bacon} (conf: 1.000, supp: 0..."
7,"{apple, bacon} -> {eggs} (conf: 1.000, supp: 0..."
8,"{apple} -> {bacon, eggs} (conf: 1.000, supp: 0..."
9,"{banana, soup} -> {bacon} (conf: 1.000, supp: ..."


In [52]:
# Print out every rule with 2 items on the left hand side,
# 1 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
  print(rule) # Prints the rule and its confidence, support, lift, ...

{apple, eggs} -> {bacon} (conf: 1.000, supp: 0.333, lift: 1.000, conv: 0.000)
{banana, soup} -> {bacon} (conf: 1.000, supp: 0.333, lift: 1.000, conv: 0.000)
{eggs, soup} -> {bacon} (conf: 1.000, supp: 0.333, lift: 1.000, conv: 0.000)
{apple, bacon} -> {eggs} (conf: 1.000, supp: 0.333, lift: 1.500, conv: 333333333.333)
{bacon, banana} -> {soup} (conf: 1.000, supp: 0.333, lift: 1.500, conv: 333333333.333)


###### NOTE: 
- on PyPi webpage of this package it is mentioned that: If you have data that is too large to fit into memory, you may pass a function returning a generator instead of a list. The min_support will most likely have to be a large value, or the algorithm will take very long before it terminates.