# Apriori

**Apriori** is an **Association Rule Learning** which is a new way of learning correlations in the data. The famous statement "people who bought this also bought something else". This is what is done in the **recommendation systems** like Amazon, Netflix, Spotify etc. The recommendation systems **recommend/predict** what customers will buy/watch/listen based on **prior** knowledge of what they bought/watched/listned before. In this part we will learn how to make model of recommendation systems.

The Association Rule Learning is very different from what we learned before where we were predicting dependent variable and we knew what to predict (supervised learning: regression and classification), we also did clustering where we learnd some patterns in the data so as to create a new dependent variable (unsupervised learning) and now we will learn association rules inside an ensemble of movies/transations/songs. This is very important for retail or e-commerce companies.

There are two models **Apriori** and **Eclat** we will learn under Association Rule Learning (Apriori is better).

## Importing the libraries

The `scikit-learn` library does not include the **Apriori model**, so we will use another library `apyori` (**apyori.py** is a Python implementation containing all the algorithms of the Apriori model). Google colab contains all the libraries and packages pre-installed (deep learning library `TensorFlow` is also pre-installed), but Google colab does not include `apyori` module. Sometimes we have to install some libraries or packages from the web using `pip` command which will download first the `apyori` module from the web and will install it inside this google colab notebook.

In [1]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5953 sha256=18aca28c71101999bc8fcbbcacddcec31bd24beafce0fd27b6fc093506beacad
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Preprocessing

The dataset is about market basket optimization. There are beautiful villages and happy people in south france going to groceries, coffee etc., very lively place where people hangout a lot and love going to different commerces and shops. Imagine that you are the business owner of one of these stores selling food and delicious staff. You would like to optimize and boost the sales. You want to offer some new deals (buy this and get something else for free!) to your customers and to identify best association rules among the different products bought by your customers. We are gonna use association rule learning to find the strongest rules saying if customers buy this product then they will have a high chance to buy that other product and we will measure that chance. You as an owner want to give products as free which is associated with products already bought by the customers. The owner hired a data scientist to find the association rules. Collect the data of all the transactions. Find the best deals to maximize the chance that the customers will get the deals by buying products (buy this and get something else for free!). The price of the product in "buy this get that" pair will be calculated by the owner to make profit.

Each row of the dataset corresponds to different transactions and for each transaction you know the products that the customer bought.

In [3]:
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)   # As the dataset does not have the column names, if we do not use header = None, it will take the first row (i.e., data of the first customer) as column names which is wrong and we will miss the data of the first customer. The parameter header = None means there is no header or no column names.
dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,


Here we will not create a matrix of features $X$ and dependent variable vector $y$, because this is a totally different thing, association rule learning, and also we don't have to split the dataset into training and test sets because it will learn all the rules through the whole dataset.

The training of the Apriori model on the dataset is being done using a function `apriori()` of the `apyori` module, this function takes the dataset as input but with a cetain format and this format is not a pandas dataframe. So, we have to change the format from the pandas dataframe to a **list of transactions** which can be read by the `apriori()` function which will train the Apriori model on the dataset.

In [9]:
transactions = []   # Initialize as an empty list.
for i in range(0, 7501):   # The for loop iterates over all the 7501 transactions of the dataset.
  transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])   # Populate the list with each transaction using append() function. This for loop iterates over all products in each transaction. For the transactions which does not have 20 products, the empty products will be populated as None. Since the Apriori algorithms expects all the items to be strings, we convert into string using str() function and then populate in the list.


In [10]:
# Print the first two transactions
transactions[0:2]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan']]

It returns a list of lists or a 2D array which is list of transactions and each transaction is list of products.

## Training the Apriori model on the dataset

Now we are set to train the Apriori model on the whole dataset. The above created transactions list will be the input to the `apriori()` function which will be used to train the Apriori model. The `apriori()` function not only will train the Apriori model on the dataset but also returns the final rules with different supports, confidences and lifts. Apiori model is mostly used to find correlations and association rules among transactions.

In [12]:
from apyori import apriori   # Import apriori() function from apyori package or library.
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)   # 'rules' variable contain the output of the apriori() function. The parameter 'transactions' takes the dataset on which it will train the model. Since we have support for each rule, we will set a minimum support not to compute all the rules but to compute the rules that have at least certain relevance having support greater than the minimum support. The value of other parameters like min_support, min_confidence, min_lift etc depends on business problem. We want pair of products to appear minimum number of times in the dataset. Using common sense, we would like to consider pair of products that appear at least 3 transactions in a day to build strong rules. So for 7 days, 3*7 = 21 times in a whole week. So support for this par of products is = 21/7501 = 0.003. This means products in the rules appear at least 0.3% of the time. The 'min_confidence' parameter value is varied and min_confidence = 0.2 (for each product A, we will have product B at least 20% of the time) gives number of product combinations not too less or not too high. Genreal choice (from experience) of 'min_lift' parameter value is 3, 4, 5, 6, 7, 8. The 'min_length' and 'max_length' parameters define min/max number of elements/items/products you wanna have in the rule. Since we want to give offer "buy 1 get 1", we set both as 2. For "buy 2 get 1" offer, set min_length = 3 and max_length = 3. So, the values of the parameters depend on business problem.

## Visualising the results

We have got the rules and now we wanna visualize the rules and their support, confidence and lift. We will see which is the best combination or pair for "buy 1 get 1" offer.

### Displaying the first results coming directly from the output of the apriori function

In [14]:
# Put the rules into a list
results = list(rules)

In [48]:
results

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0

So, all the rules are listed. e.g., the first pair of products is (light cream, chicken). This means that "if customers buy light cream they have a 29% chance (confidence = 0.29) to buy chicken". This makes sence as French people love to put some light cream with lemon as a sauce for their chicken -- traditional french meal. The support = 0.0045 means this pair of products appears in 0.45% of the total transactions. In this way all other pair of products and their statics are listed.

All the rules has **support $\ge$ min_support, confidence $\ge$ min_confidence and lift $\ge$ min_lift**.

### Putting the results well organised into a Pandas DataFrame

Here we will put the results or rules and their statistics into a well organized table or pandas dataframe. We can even sort the rules by a descending metric, say `lift`. **The `lift` is the most relevant metric to measure the strength of an association rule.**

In [50]:
# In the output results or list of rules, each list item contain 3 items/elements. We use indexing to extract the
# relevant information.

def inspect(results):   # The inspect function takes results or list of rules as input.
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
results_in_DataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])   # pd.DataFrame() takes the output of the inspect() function, add the column names and returns a pandas dataframe or table. results_in_DataFrame is an object of the DataFrame class of Pandas library.

### Displaying the results non sorted

In [51]:
results_in_DataFrame

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672


### Displaying the results sorted by descending lifts

**Lift** is the metric of measuring the relevance of an association rule. We are gonna display the rules by descending lifts so that we can see the most relevant.

In [56]:
results_in_DataFrame.sort_values('Lift', ascending = False)   # Sort the DataFrame based on the values of "Lift" column and sorting is descending "Lift" values.

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
3,fromage blanc,honey,0.003333,0.245098,5.164271
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
8,pasta,shrimp,0.005066,0.322034,4.506672
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,light cream,olive oil,0.0032,0.205128,3.11471


In [57]:
# Another way of sorting.

results_in_DataFrame.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
3,fromage blanc,honey,0.003333,0.245098,5.164271
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
8,pasta,shrimp,0.005066,0.322034,4.506672
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,light cream,olive oil,0.0032,0.205128,3.11471
