## Introduction

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.  

Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant association rules. Generally, it operates on a database containing a huge number of transactions. For example, the items customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases the sales performance of the particular store.

### Components of Apriori algorithm

#### Support

Support refers to the frequency or occurrence of an item set in a dataset. It is defined as the proportion of transactions in the dataset that contain the itemset. For example, let's consider a dataset of sales transactions in a retail store that contains the following items - milk, bread, cheese, eggs, butter, and yogurt. For instance, if the itemset {milk, bread} appears in 5 transactions out of 10 transactions in the dataset, then its support is 5/10=0.5, or 50%.

In the Apriori algorithm, itemsets with a support value above the minimum defined support threshold are considered frequent and are used to generate candidate itemsets for the next iteration of the algorithm.

Support(A)= Number of all Transactions/Number of Transactions in which A occurs


#### Confidence

Confidence is also a measure of the strength of the association between two items in an itemset. It is defined as the conditional probability that item B appears in a transaction, given that another item A appears in the same transaction. 

confidence(A⇒B)=P(B/A)= sup(A)/sup(A∪B)

If the confidence value exceeds a specified threshold, it indicates that item B is likely to be purchased with item A. For instance, if the confidence of the association between "bread" and "butter" is 0.8, it means that when a customer buys "bread", there is an 80% chance that they will also buy "butter". This can be useful in recommending to customers or optimizing product placement in a store.

#### Lift

Lift measures the strength of the association between two items. It is defined as the ratio of the support of the two items 
occurring together to the support of the individual items multiplied together. 
Lift for any two items can be calculated using the below formula -

Lift(A→B)= Support(A)∗Support(B)/Support(A and B)

If the lift value is greater than 1, then it indicates a positive association between the two items, 
which means that the two items are more likely to be bought together. A lift value of exactly 1 indicates that the two items 
are independent and there is no association between the two items, while a value less than 1 indicates a negative association,
meaning that two items are more likely to be bought separately.

### Steps in Apriori Algorithm

1. Define minimum support threshold - This is the minimum number of times an item set must appear in the dataset to be considered as frequent. The support threshold is usually set by the user based on the size of the dataset and the domain knowledge.
2. Generate a list of frequent 1-item sets - Scan the entire dataset to identify the items that meet the minimum support threshold. These item sets are known as frequent 1-item sets.
3. Generate candidate item sets - In this step, the algorithm generates a list of candidate item sets of length k+1 from the frequent k-item sets identified in the previous step.
4. Count the support of each candidate item set - Scan the dataset again to count the number of times each candidate item set appears in the dataset.
5. Prune the candidate item sets - Remove the item sets that do not meet the minimum support threshold.
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules - Once the frequent item sets have been identified, the algorithm generates association rules from them. Association rules are rules of form A -> B, where A and B are item sets. The rule indicates that if a transaction contains A, it is also likely to contain B.
8. Evaluate the association rules - Finally, the association rules are evaluated based on metrics such as confidence and lift.

## Implementing market basket analysis

### Import the Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

### Importing the Dataset

In [2]:
data = pd.read_csv('store_data.csv')

In [4]:
data.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


The NaN tells us that the item represented by the column was not purchased in that specific transaction.

In this dataset there is no header row. But by default, pd.read_csv function treats first row as header. To get rid of this problem, add header=None option to pd.read_csv function, as shown below:

In [5]:
data = pd.read_csv('store_data.csv', header=None)

In [6]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [10]:
data.shape

(7501, 20)

### Data Preprocessing

Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists:

In [8]:
records = []
for i in range(0, 7501):
    records.append([str(data.values[i,j]) for j in range(0, 20)])

In [28]:
records[0:3]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['chutney',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan']]

### Applying Apriori

In [19]:
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
results = list(association_rules)

In [20]:
len(results)

48

In [21]:
print(results[0])

RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])


In [22]:
# putting output into a pandas dataframe
def tables(results):
    item1         = [tuple(result[2][0][0])[0] for result in results]
    item2         = [tuple(result[2][0][1])[0] for result in results]
    support    = [result[1] for result in results]
    confidence = [result[2][0][2] for result in results]
    lift       = [result[2][0][3] for result in results]
    return list(zip(item1, item2, support, confidence, lift))
output_DataFrame = pd.DataFrame(tables(results), columns = ['Item 1', 'Item 2', 'Support', 'Confidence', 'Lift'])

In [23]:
output_DataFrame

Unnamed: 0,Item 1,Item 2,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,herb & pepper,ground beef,0.015998,0.32345,3.291994
4,tomato sauce,ground beef,0.005333,0.377358,3.840659
5,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
6,pasta,shrimp,0.005066,0.322034,4.506672
7,light cream,,0.004533,0.290598,4.843951
8,chocolate,shrimp,0.005333,0.232558,3.254512
9,ground beef,spaghetti,0.004799,0.571429,3.281995


In [26]:
# Displaying the results sorted by descending order of Lift column

output_DataFrame.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Item 1,Item 2,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
7,light cream,,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
11,pasta,,0.005866,0.372881,4.700812
28,pasta,,0.005066,0.322034,4.515096
6,pasta,shrimp,0.005066,0.322034,4.506672
27,whole wheat pasta,,0.007999,0.271493,4.130772
5,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
21,spaghetti,ground beef,0.006399,0.393443,4.00436
41,spaghetti,,0.006399,0.393443,4.00436


### Limitations of Apriori Algorithm
Despite being a simple one, Apriori algorithms have some limitations including:

- Waste of time when it comes to handling a large number of candidates with frequent itemsets.
- The efficiency of this algorithm goes down when there is a large number of transactions going on through a limited memory capacity. 
- Required high computation power and need to scan the entire database. 