# Association Rule Learning Practice

The goal of this practice is to find  a promotional product that visitors want to buy, namely buy one or two products, he will receive it as a gift and this product should be most strongly associated with this product or products. 
Two methods are presented here - Apriori and Eclat. In the Apriori, we draw conclusions based on the following indicators: suport, confidence, lift and Eclat is based only on the support of 'the product group'.

The data belongs to a bakery called "The Bread Basket", located in the historic center of Edinburgh. This bakery presents a refreshing offer of Argentine and Spanish products. The data set contains the following columns: Date (YYYY-MM-DD format). Time (HH:MM:SS format). Transaction. Q The rows that share the same value in this field belong to the same transaction, that's why the data set has less transactions than observations.
https://www.kaggle.com/aboliveira/bakery-market-basket-analysis/data

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [3]:
dataset = pd.read_csv('BreadBasket.csv')

In [4]:
dataset

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
...,...,...,...,...
21288,2017-04-09,14:32:58,9682,Coffee
21289,2017-04-09,14:32:58,9682,Tea
21290,2017-04-09,14:57:06,9683,Coffee
21291,2017-04-09,14:57:06,9683,Pastry


# Data Preprocessing

Firstly, remove 'None' items

In [5]:
dataset = dataset[dataset.Item != 'NONE']

In [6]:
dataset

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
...,...,...,...,...
21288,2017-04-09,14:32:58,9682,Coffee
21289,2017-04-09,14:32:58,9682,Tea
21290,2017-04-09,14:57:06,9683,Coffee
21291,2017-04-09,14:57:06,9683,Pastry


For our models we need lists of items and one list is one transaction, so we can drop date and time

In [7]:
dataset.drop(columns=['Date', 'Time'], inplace=True)

Now lets make 2d array, where 1 row - 1 transaction, and before that check duplicates

In [8]:
dataset.duplicated()

0        False
1        False
2         True
3        False
4        False
         ...  
21288    False
21289    False
21290    False
21291    False
21292    False
Length: 20507, dtype: bool

'True' means that this row was before, so drop them

In [9]:
dataset.drop_duplicates(inplace = True)

Now create 2D array of items, where one row - one transaction

In [10]:
X = []
for tr in range(1, dataset['Transaction'].max()+1):
    items = dataset.loc[dataset['Transaction'] == tr]['Item'].tolist()
    if len(items) != 0:
        X.append(items)

In [11]:
len(X) # length is less than max id of transactions in data set because of missed data about some transactions

9465

# Apriori

Lets check the most popular items, so they have a high probability to be our goal

In [12]:
transaction_count = dataset.groupby(by='Item')[['Transaction']].count().sort_values(by='Transaction', ascending=False)
transaction_count.head()

Unnamed: 0_level_0,Transaction
Item,Unnamed: 1_level_1
Coffee,4528
Bread,3097
Tea,1350
Cake,983
Pastry,815


In [13]:
def convert_to_percentage(x):
    return 100 * x / float(x.sum())

transaction_percentage = transaction_count.apply(convert_to_percentage)
transaction_percentage.head()

Unnamed: 0_level_0,Transaction
Item,Unnamed: 1_level_1
Coffee,23.974162
Bread,16.397522
Tea,7.147774
Cake,5.204638
Pastry,4.315137


## Training the Apriori model on the dataset

In [14]:
from apyori import apriori
rules = apriori(transactions = X, min_support = 0.001, min_confidence = 0.1, min_lift = 3, min_length = 2, max_length = 3)

## Visualising the results

#### Putting the results well organised into a Pandas DataFrame (code was taken from course)

In [15]:
results = list(rules)

In [16]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

In [17]:
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

#### Displaying the results sorted by descending lifts

In [18]:
resultsinDataFrame.nlargest(n = 15, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
26,Extra Salami or Feta,Coffee,0.001479,0.368421,56.243633
3,Extra Salami or Feta,Salad,0.00169,0.421053,40.255183
14,Alfajores,Juice,0.001057,0.434783,11.274568
4,Fudge,Jam,0.002536,0.169014,11.265622
12,Salad,Spanish Brunch,0.001268,0.121212,6.67019
37,Coke,Sandwich,0.001057,0.47619,6.628151
30,Spanish Brunch,Coffee,0.002007,0.110465,5.361807
17,Jammie Dodgers,Bread,0.001585,0.12,5.139367
29,Jammie Dodgers,Coffee,0.001373,0.104,5.048
21,Cake,Soup,0.001162,0.169231,4.913403


For my opinion, we should apply small hyperparameters (especially support) because of a lot transaction and small sets of items in them (usually 1).
The most valuable parameter is lift, which shows probability to buy the product (Right Hand Side) if you buy another product (Left Hand Side) considering how popular the products are. So, we can advise basket market these more lifted: Extra Salami or Feta and Coffee, Extra Salami or Feta and Salad, Alfajores(cookie) and Juice. And there is others pairs which look logically: Spanish Brunch	Coffee and Jammie Dodgers(cookie) and Coffee, but they had confidence only 10% 

# Eclat

## Training the Eclat model on the dataset

In [19]:
rules = apriori(transactions = X, min_support = 0.001, min_confidence = 0.05, min_lift = 2, min_length = 2, max_length = 5)


### Displaying the first results coming directly from the output of the apriori function

In [20]:
results = list(rules)

def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    return list(zip(lhs, rhs, supports))

resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Product 1', 'Product 2', 'Support'])

### Displaying the results sorted by descending supports

In [21]:
resultsinDataFrame.nlargest(n = 10, columns = 'Support')

Unnamed: 0,Product 1,Product 2,Support
58,Cake,Coffee,0.006867
16,Cookies,Juice,0.006128
24,Juice,Sandwich,0.005811
34,Sandwich,Soup,0.005494
13,Coke,Sandwich,0.005177
35,Sandwich,Truffles,0.003803
70,Cookies,Coffee,0.003698
71,Cookies,Coffee,0.003698
82,Sandwich,Coffee,0.003592
27,Mineral water,Sandwich,0.003275


This algorithm is based on support = transaction containing product divided by number of transactions.
According to this, pairs to cg\hoose are cake and hot chocolate, cookies and juice, that look realistic.