<img src= 'http://www.bigbang-datascience.com/wp-content/uploads/2017/09/cropped-Logo-01.jpg' width=300/>

# Association Rules - Apriori

In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules

### Theory of Apriori Algorithm
There are three major components of Apriori algorithm:

  **Support**  
  **Confidence**  
  **Lift**

Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.

**Support**
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

In [None]:
Support(B) = (Transactions containing (B))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

In [None]:
Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)
Support(Ketchup) = 100/1000
                 = 10%

**Confidence**  
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

In [None]:
Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)

Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:

In [None]:
Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)

Confidence(Burger→Ketchup) = 50/150
                           = 33.3%

**Lift**  
Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:

In [None]:
Lift(A→B) = (Confidence (A→B))/(Support (B))

Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:

In [None]:
Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))

Lift(Burger→Ketchup) = 33.3/10
                     = 3.33

**Lift** basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.



### Importing the Dataset
Now let's import the dataset and see what we're working with. Download the dataset and place it in the "Datasets" folder of the "D" drive (or change the code below to match the path of the file on your computer) and execute the following script:

In [None]:
trans = pd.read_csv('Transactions.csv', sep=',', header = None) 

In [None]:
print(type(trans))

In [None]:
trans.head()

Each row of the dataset represents items that were purchased together on the same day at the same store.The dataset is a sparse dataset as relatively high percentage of data is NA or NaN or equivalent.
These NaNs make it hard to read the table. Let’s find out how many unique items are actually there in the table.

In [None]:
items = (trans[0].unique())
items

In [None]:
len(items)

### Data Preprocessing
To make use of the apriori module given by mlxtend library, we need to convert the dataset according to it’s liking. apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

Custom One Hot Encoding

### One Hot Encoding

In [None]:
encoded_vals = []
for index, row in trans.iterrows():  #Iterate over DataFrame rows as (index, Series) pairs.
    labels = {}
    uncommons = list(set(items) - set(row))
    commons = list(set(items).intersection(row)) # Return a set that contains the items that exist in both set x, and set y:
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)
encoded_vals[0]
ohe_trans = pd.DataFrame(encoded_vals)

In [None]:
ohe_trans

# Applying Apriori - Working with Sparse Representations

apriori module from mlxtend library provides fast and efficient apriori implementation.

__Parameters__

__df__ : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as values
__min_support__ : Floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected.  

of observation with item / total observation# of observation with item / total observation

__use_colnames__ : This allows to preserve column names for itemset making it more readable.
__max_len__ : Max length of itemset generated. If not set, all possible lengths are evaluated.
__verbose__ : Shows the number of iterations if >= 1 and low_memory is True. If =1 and low_memory is False , shows the number of combinations.

__low_memory__ :
If True, uses an iterator to search for combinations above min_support. Note that while low_memory=True should only be used for large dataset if memory resources are limited, because this implementation is approx. 3–6x slower than the default.

In [None]:
freq_items = apriori(ohe_trans, min_support=0.02, use_colnames=True, verbose=1)
freq_items

### Applying Association Rules

The next step is to apply the Association Rules algorithm on the dataset. To do so, we can use the Association Rules class that we imported from the Association Rules library.   

The Association Rules class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.   

Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. The support for those items can be calculated as 35/7500 = 0.0045. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen, so you can play with these values and see what difference it makes in the rules you get back out.

Execute the following script:

In [None]:
# Metric can be set to confidence, lift, support, leverage and conviction.

rules = association_rules(freq_items, metric="confidence", min_threshold=0.3)
rules.head()

In [None]:
# association_rules = apriori(freq_items, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
# association_results 

In the second line here we convert the rules found by the apriori class into a list since it is easier to view the results in this form.

### Viewing the Results
Let's first find the total number of rules mined by the apriori class. Execute the following script:

In [None]:
print(len(rules))

### Visualizing results

__1. Support vs Confidence__

In [None]:
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

__2. Support vs Lift__

In [None]:
plt.scatter(rules['support'], rules['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()

__3. Lift vs Confidence__

In [None]:
fit = np.polyfit(rules['lift'], rules['confidence'], 1)
fit_fn = np.poly1d(fit)
plt.plot(rules['lift'], rules['confidence'], 'yo', rules['lift'], 
 fit_fn(rules['lift']))

# Applying Apriori - as a list

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently, we have data in the form of a pandas data frame. To convert our pandas data frame into a list of lists, execute the following script:


In [None]:
# Create the sparsity Matrix using the fo loop
transactions = []
for i in range(0, 7501):
    transactions.append([str(trans.values[i,j]) for j in range(0, 20)]) # Transforming to string

In [None]:
#!pip install apyori
from apyori import apriori

In [None]:
print(type(transactions))

In [None]:
transactions

In [None]:
len(transactions)

In [None]:
transactions[0]

In [None]:
transactions[-1]

### Implementing Apriori
We can now specify the parameters of the apriori class.  
__The List__  
__min_support__  
__min_confidence__  
__min_lift__  
__min_length (the minimum number of items that you want in your rules, typically 2)__ 

Let’s suppose that we want only items that are purchased at least 40 times in a month. The support for those items can be calculated as 40/7500 = 0.0053. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally, min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen and they need to be fine-tuned empirically.

In [None]:
association_rules = apriori(transactions, min_support = 0.0053, min_confidence=0.2, min_lift=3, min_length = 3)
results = list(association_rules)

In [None]:
results[0]

In [None]:
for result in results[:10]:
    items = result[0]
    items = [x for x in items]
    print('Rule: ' + items[0] + '  --> ' + items[1])
    print('Support {:.4f}'.format(result[1]))
    print('Confidence {:.4f}'.format(float(result[2][0][2])))
    print('Lift {:.4f}'.format(result[2][0][3]))
    print('============\n')

In [None]:
# Visualising the results
# results = list(rules)
list(results)