## Association Rule Mining 
Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

1. A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
1. People who buy one of the products can be targeted through an advertisement campaign to buy the other.
1. Collective discounts can be offered on these products if the customer buys both of them.
1. Both A and B can be packaged together.
The process of identifying an associations between products is called association rule mining.

The most prominent practical application of the algorithm is to recommend products based on the products already present in the user’s cart. Walmart especially has made great use of the algorithm in suggesting products to it’s users.

### Apriori Algorithm for Association Rule Mining

Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. 

Apriori algorithm considers 3 important factors which are, support, confidence and lift. Each of these factors is explained as follows:

1. Support:
The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions expressed as :
$Support(I) = \frac{\text{number of  transactions containing I}}{\text{Total number of transactions}}$
1. Confidence:
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought.
$Confidence(A→B) = \frac{\text{number of transactions containing both (A and B)}}{\text{number of transactions containing A}}$
1. Lift:
Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. 
$Lift(A→B) = \frac{\text{Confidence (A→B)}}{\text{Support (B)}}$



### 1. Installation
[Mlxtend](http://rasbt.github.io/mlxtend/) (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. 

In [1]:
pip install mlxtend --upgrade

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth, association_rules

### Importing the Dataset
Now let's import the dataset and see what we're working with.

* Dataset : [Groceries data](http://archive.ics.uci.edu/ml/machine-learning-databases/00352/) 

In [3]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Loading the Data
data = pd.read_excel('Online Retail.xlsx')
data.shape

In [4]:
# Exploring the columns of the data
data.columns

NameError: name 'data' is not defined

In [None]:
# Exploring the different regions of transactions
data.Country.unique()

In [None]:
data.groupby('Country').count()

### Data Proprocessing

1. We will drop the rows without any invoice number
1. We will drop all transactions which were done on credit
1. Splitting the data according to the region of transaction 
1. Dropping all transactions which the frequency is less than 4


In [None]:
# Stripping extra spaces in the description
data['Description'] = data['Description'].str.strip()
 
# Dropping the rows without any invoice number
data.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
 
# Dropping all transactions which were done on credit
data = data[~data['InvoiceNo'].str.contains('C')]

In [None]:
data.describe()

In [None]:
data.head()

In [None]:
min_freq = 3

In [None]:
# Dropping all transactions which the frequency is less than 4
count_item = data.groupby('Description').count().sort_values('InvoiceNo', ascending=False)
count_item = count_item[count_item['InvoiceNo']>min_freq]
print(count_item)

data = data[data['Description'].isin(list(count_item.index))]

In [None]:
# Transactions done in USA
basket_USA = (data[data['Country'] =="USA"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
 
# Transactions done in Canada
basket_CAN = (data[data['Country'] =="Canada"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

# Transactions done in France
basket_France = (data[data['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

# Transactions done in United Kingdom
basket_UK = (data[data['Country'] =="United Kingdom"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [None]:
basket_UK.shape

In [None]:
# Defining the hot encoding function to make the data suitable for the concerned libraries
def hot_encode(x):
    if(x<= 0):
        return 0
    if(x>= 1):
        return 1
 
# Encoding the datasets
basket_encoded = basket_USA.applymap(hot_encode)
basket_USA = basket_encoded
 
basket_encoded = basket_CAN.applymap(hot_encode)
basket_CAN = basket_encoded
 
basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded

basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded

### Apply Apriori Algorithm
For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items.

This process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
1. Extract all the subsets having higher value of support than minimum threshold.
1. Select all the rules from the subsets with confidence value higher than minimum threshold.
1. Order the rules by descending order of Lift.

In [None]:
min_sup = 0.05
min_conf = 0.5

In [None]:
# Now, let us return the items and itemsets with at least 5% support:
# Building the aprior model
frq_items = apriori(basket_France, min_support = min_sup, use_colnames = True)
print(frq_items.head())
print(frq_items.shape)

In [None]:
# Building the FP growth model
frq_items1 = fpgrowth(basket_France, min_support=min_sup, use_colnames=True)
print(frq_items1.head())

In [None]:
frq_items1.shape

1. Selecting and Filtering Results

The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. 
For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:


In [None]:
frq_items['length'] = frq_items['itemsets'].apply(lambda x: len(x))
print(frq_items.head())
frq_items1['length'] = frq_items1['itemsets'].apply(lambda x: len(x))
print(frq_items1.head())

In [None]:
print(frq_items[ (frq_items['length'] == 2) & (frq_items['support'] >= min_sup) ])
print(frq_items1[ (frq_items1['length'] == 2) & (frq_items1['support'] >= min_sup) ])

The generate_rules takes dataframes of frequent itemsets as produced by the `apriori`, `fpgrowth`, or `fpmax` functions in mlxtend.association. To demonstrate the usage of the generate_rules method, we first create a pandas DataFrame of frequent itemsets as generated by the `apriori` function:

a) apriori model

In [None]:
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="confidence", min_threshold = min_conf)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

b) FP growth model

In [None]:
rules1 = association_rules(frq_items1, metric ="confidence", min_threshold = min_conf)
rules1 = rules1.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules1.head())

### Viewing the Results
Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

at least 2 antecedents
a confidence > 0.95
a lift score > 7
We could compute the antecedent length as follows:

In [None]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.95) &
       (rules['lift'] > 7) ]

In [None]:
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

In [None]:
plt.scatter(rules['support'], rules['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs lift')
plt.show()

In [None]:
plt.scatter(rules['lift'], rules['confidence'], alpha=0.5)
plt.xlabel('lift')
plt.ylabel('confidence')
plt.title('lift vs confidence')
plt.show()