## Finding sets of items that commonly occur together in a dataset
The Apriori Algorithm is a way of finding "Association rules" between items in dataset.
It is famously use in supermarkets to identify patterns in the choice of items that customers 
put in their shopping basket.For example, the rule [onions, potatoes] => [burger] found in the 
sales data of a supermarket would indicate that if a customer buys onions 
and potatoes together, they are likely to also buy burgers.

Given a shopping basket: 
<img src= https://m.media-amazon.com/images/I/71yi45a8U1L._AC_SX679_.jpg  width="300">

Association Rule analysis was developed to answer this question:

<img src= https://editor.analyticsvidhya.com/uploads/13952Market-basket-analysis.png  width="500">

But it has found many applications beyond supermarkets for identfying patterns of occurance in many dataset

The contents of 5 shopping baskets might look like this:

<img src= https://miro.medium.com/max/940/1*908489_PRdpPMctC6MT6OQ.png   width="300">

It could be represented as:

<img src= https://miro.medium.com/max/854/1*V-ODhD4KOevmBBDTFKiHUA.png   width="300">

The associations rules could be:

<img src= https://miro.medium.com/max/770/1*meE1hYNAn0B9iV6DQznBXg.png   width="300">

where: "Support" defines how much historical data supports the rule and "Confidence" indicates probility that the rule holds.

In [1]:
!pip install apyori --quiet #Install the apyori library 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import io

### Load the data from csv file describing the content of shopping baskets 
Every row defines the contents of a different basket, every column has the name of a different item in the basket.

In [3]:
pd.set_option('display.max_columns', 6)
df2=pd.read_csv("groceriesDS4M.csv")   

In [4]:
df2.head(4)  #see the data, the first shopping basket contains citrus fruit, semi-finished bread, margarine ..etc

Unnamed: 0,Item 1,Item 2,Item 3,...,Item 30,Item 31,Item 32
0,citrus fruit,semi-finished bread,margarine,...,,,
1,tropical fruit,yogurt,coffee,...,,,
2,whole milk,,,...,,,
3,pip fruit,yogurt,cream cheese,...,,,


In [5]:
df2.shape #The dataset has 9836 shopping baskets shopping baskets, containing a maximum of 32 items

(9835, 32)

In [6]:
# find item sets 
from apyori import apriori

In [7]:
# Fill an array of with the items in each row of the dataset
transactions = []
for i in range(0, 5000):
  transactions.append([str(df2.values[i,j]) for j in range(0, 20)])

# Just for illustration  
print("Transaction 1:", transactions[0][0], transactions[0][1], transactions[0][2], transactions[0][3])
print("Transaction 2:",transactions[1][0], transactions[1][1], transactions[1][2], transactions[1][3])

Transaction 1: citrus fruit semi-finished bread margarine ready soups
Transaction 2: tropical fruit yogurt coffee nan


## Call the Apriori function with arguments that define:
* **Items** in a transaction form an item set
* **Support** refers to items’ frequency of occurrence of items in the data set
* **Confidence** is the probability of the association occuring 
* **Lift** basically tells us that the likelihood of buying a Burger and Ketchup together is, say, 3.33 times more than the  likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together.
* **min_length** defines the number of items to find associations between


In [8]:
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.3, min_lift = 3, min_length = 2, max_length = 2)

In [9]:
# Python's list() constructor changes the rules data structure to a lsit 
results = list(rules)

In [10]:
def inspect(results): # define a function for printing out the results
    product1 = [tuple(result[2][0][0])[0] for result in results]
    product2 = [tuple(result[2][0][1])[0] for result in results]
    supports = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts  = [result[2][0][3] for result in results]
    return list(zip(product1, product2, supports, confidences, lifts))

# Extract the associations found 

In [11]:
AssociatedItems = pd.DataFrame(inspect(results), columns = ['product1', 'product2', 'Support', 'Confidence', 'Lift'])

# Print out the top associations find between items in shoping baskets 


In [12]:
print(AssociatedItems)

           product1            product2  Support  Confidence      Lift
0     baking powder  whipped/sour cream   0.0062    0.326316  4.293629
1            liquor        bottled beer   0.0042    0.396226  4.739550
2            grapes      tropical fruit   0.0064    0.344086  3.393353
3  processed cheese         white bread   0.0048    0.315789  7.483163
4     roll products     root vegetables   0.0042    0.355932  3.095063


### So we can see that  "berries" are often associated with "whipped/sour cream" in  shopping baskets