# Mine relevant rules using a model

1. Download & merge the collected point of sale data., which you created in the class.
https://github.com/doublebyte1/bts_market_basket/
2. Harmonize dataset, if necessary. 
3. Mine association rules using one of these algorithms: Apriori, Eclat, FP-growth. 
4. Produce a brief report: 
    * Explain every step of the way, in order to reproduce the results. 
    * If necessary, add code or link to repository. 
    * Present the results.
 
You may use any tool or combination of tools you like.

For this exercise, I am choosing to use Python as an exercise to learn how to apply these algorithms in this language (since we used R and weka in class). 

I originally attempted this task using the Apyori library and the apriori algorithm, but the class took so long to run that I switched to FP-Growth. There is a specific python library called pyfpgrowth. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  
from apyori import apriori  # didn't end up using this library
import pyfpgrowth

Note: I cleaned the data directly in excel before importing here. This mostly involved reducing item names to ensure consistency, and translating some catalan items to english.

In [3]:
data = pd.read_csv('Merged class transactions file - Sheet1.csv',)
data = data.replace(np.nan, '', regex=True) # to convert nan values to empty strings
data.head()

Unnamed: 0,item 1,item 2,item 3,item 4,item 5,item 6,item 7,item 8,item 9,item 10,...,item 13,item 14,item 15,item 16,item 17,item 18,item 19,item 20,item 21,item 22
0,wine,ham,nachos,tea,croquetas,patè,,,,,...,,,,,,,,,,
1,water,milk,dough,egg,cheese,dumplings,spinach,croquetas,ham,cucumbers,...,,,,,,,,,,
2,water,detergent,salsa,gnocchi,,,,,,,...,,,,,,,,,,
3,orange,plum,salmon,cashew,pistachios,cheese,tomato,shampoo,conditioner,yogurt,...,,,,,,,,,,
4,coke,coconut juice,cereal,potato,yoghurt,cheese,milk,pizza,sausage,turkey,...,,,,,,,,,,


In [4]:
data.shape

(59, 22)

We can see that there are 59 transactions in this dataset, with the largest transaction containing 22 items.

### Data Proprocessing

The Apriori library requires the dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists:

In [5]:
records = []  
for i in range(0, 59):  
    records.append([str(data.values[i,j]) for j in range(0, 22)])

In [6]:
len(records)

59

In [7]:
from mlxtend.preprocessing import TransactionEncoder

The below code learns the unique labels in the dataset(items) and transforms the list of lists into
a One-Hot Encoded boolean array.

The labels that the transaction encoder found are saved in te.columns_

In [8]:
te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
te.columns_

['',
 'almond',
 'apple',
 'avocado',
 'bacon',
 'banana',
 'bath gel',
 'beef',
 'beer',
 'black chocolate',
 'bolsa plastico',
 'bread',
 'breakfast',
 'broccoli',
 'calcots',
 'caldo',
 'caneloni',
 'capers',
 'carrot',
 'cashew',
 'cashew nuts',
 'cereal',
 'cheese',
 'chicken',
 'chicken breast',
 'chicken burger',
 'chiken breasts',
 'chips',
 'chocolate',
 'coca-cola',
 'cocktail al horno',
 'coconut juice',
 'coconut milk',
 'cogollo',
 'coke',
 'conditioner',
 'cookie',
 'cookies',
 'cream',
 'croquetas',
 'cucumber',
 'cucumbers',
 'cups',
 'deodorant',
 'detergent',
 'dough',
 'dumplings',
 'egg',
 'empanada',
 'fish',
 'flan',
 'formatge havar',
 'fruit',
 'fruit cocktail',
 'fruit juice',
 'fruit mix',
 'gnocchi',
 'grape',
 'grapefruit',
 'grapes',
 'ham',
 'hazelnut chocolate',
 'herb mix',
 'honey',
 'hummus',
 'juice',
 'ketchup',
 'kiwi',
 'lasagne',
 'lays artesanas',
 'lays offerta',
 'lemon',
 'lime',
 'limonate',
 'mango',
 'meat',
 'milk',
 'milka chocolate',
 'm

I then converted the lists back into a df, dropping missing values in the process.

In [9]:
df = pd.DataFrame(te_ary, columns=te.columns_)
df = df.drop('', axis = 1)
df.head()

Unnamed: 0,almond,apple,avocado,bacon,banana,bath gel,beef,beer,black chocolate,bolsa plastico,...,turkey breast,varios,vegetables,water,water with gas,wheat,wheat triangles,wine,yoghurt,yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,False,False,False,True,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


Now, we're ready to apply the FP Growth algorithm. In python, this algorithm requires a list of lists (transactions) as input. Here we convert the df to this format. (Note I had to convert the Boolean array to a df, to then convert back into a list because it was the easiest was to drop the missing values).

In [10]:
lists = [df.columns[row.astype(bool)].tolist() for row in df.values]
lists

[['croquetas', 'ham', 'nachos', 'patè', 'tea', 'wine'],
 ['bacon',
  'cheese',
  'croquetas',
  'cucumbers',
  'dough',
  'dumplings',
  'egg',
  'ham',
  'milk',
  'spinach',
  'water'],
 ['detergent', 'gnocchi', 'salsa', 'water'],
 ['cashew',
  'cheese',
  'conditioner',
  'orange',
  'pistachios',
  'plum',
  'salad',
  'salmon',
  'shampoo',
  'tomato',
  'yogurt'],
 ['banana',
  'cereal',
  'cheese',
  'coconut juice',
  'coke',
  'milk',
  'pizza',
  'potato',
  'sausage',
  'turkey',
  'yoghurt'],
 ['avocado',
  'cereal',
  'cheese',
  'coconut juice',
  'ham',
  'ketchup',
  'lemon',
  'milk',
  'onion',
  'potato',
  'quinoa'],
 ['carrot',
  'cereal',
  'chips',
  'cucumber',
  'milk',
  'pepper',
  'pork',
  'potato',
  'shampoo',
  'tangerine',
  'wine'],
 ['sauce'],
 ['beer', 'mountain dew'],
 ['fruit mix', 'mango'],
 ['avocado',
  'beef',
  'beer',
  'cereal',
  'chicken',
  'chocolate',
  'egg',
  'fruit',
  'ham',
  'honey',
  'kiwi',
  'limonate',
  'mushroom',
  'olive

Below, patterns is a dictionary including the common itemsets in our transactions, with a minimum support threshhold of 3. 

In [13]:
patterns = pyfpgrowth.find_frequent_patterns(lists, 3) # minimum support count of 3

In [14]:
patterns

{('water',): 3,
 ('orange',): 3,
 ('shampoo',): 3,
 ('tomato',): 3,
 ('coke',): 3,
 ('lemon',): 3,
 ('cheese', 'lemon'): 3,
 ('pepper',): 3,
 ('mushroom',): 3,
 ('paella',): 3,
 ('pasta',): 3,
 ('vegetables',): 3,
 ('olives',): 3,
 ('pineapple',): 3,
 ('salad',): 4,
 ('salad', 'yogurt'): 3,
 ('salmon',): 4,
 ('banana',): 4,
 ('cereal',): 4,
 ('cereal', 'milk'): 3,
 ('cereal', 'potato'): 4,
 ('cereal', 'milk', 'potato'): 3,
 ('avocado',): 4,
 ('pork',): 4,
 ('sauce',): 4,
 ('wine',): 5,
 ('cheese', 'milk'): 3,
 ('milk', 'potato'): 3,
 ('onion',): 5,
 ('bread', 'onion'): 3,
 ('beer',): 5,
 ('fruit',): 5,
 ('snacks',): 5,
 ('yogurt',): 6,
 ('cheese', 'yogurt'): 3,
 ('chicken',): 6,
 ('bread', 'chicken'): 3,
 ('chocolate',): 7,
 ('other',): 7,
 ('egg', 'ham'): 3,
 ('cheese', 'egg'): 4,
 ('chips',): 8,
 ('chips', 'potato'): 3,
 ('ham',): 9,
 ('cheese', 'ham'): 3,
 ('potato',): 9,
 ('cheese', 'potato'): 3,
 ('cheese',): 11,
 ('bread',): 11}

Next, we save the association rules in these patterns, with a minimum confidence threshhold of 0.1

I saved these rules into a df for readability. 

In [15]:
rules = pyfpgrowth.generate_association_rules(patterns,0.1) # minimum confidence of 0.1

In [16]:
rules

{('cheese',): (('potato',), 0.2727272727272727),
 ('lemon',): (('cheese',), 1.0),
 ('salad',): (('yogurt',), 0.75),
 ('yogurt',): (('cheese',), 0.5),
 ('cereal',): (('milk', 'potato'), 0.75),
 ('potato',): (('cheese',), 0.3333333333333333),
 ('cereal', 'milk'): (('potato',), 1.0),
 ('cereal', 'potato'): (('milk',), 0.75),
 ('milk', 'potato'): (('cereal',), 1.0),
 ('bread',): (('chicken',), 0.2727272727272727),
 ('onion',): (('bread',), 0.6),
 ('chicken',): (('bread',), 0.5),
 ('ham',): (('cheese',), 0.3333333333333333),
 ('chips',): (('potato',), 0.375)}

In [17]:
rules_df = pd.DataFrame.from_dict(rules, orient='index')
rules_df.reset_index(level=0, inplace=True)
rules_df.columns= ['LHS', 'RHS', 'Confidence']
rules_df

Unnamed: 0,LHS,RHS,Confidence
0,"(cheese,)","(potato,)",0.272727
1,"(lemon,)","(cheese,)",1.0
2,"(salad,)","(yogurt,)",0.75
3,"(yogurt,)","(cheese,)",0.5
4,"(cereal,)","(milk, potato)",0.75
5,"(potato,)","(cheese,)",0.333333
6,"(cereal, milk)","(potato,)",1.0
7,"(cereal, potato)","(milk,)",0.75
8,"(milk, potato)","(cereal,)",1.0
9,"(bread,)","(chicken,)",0.272727


Above we can see the rules for our dataset: Antecedent, Consequent and the Confidence value.

The itemsets that best predict the purchase of other items are:

**lemon -> cheese (confidence of 1)**

**cereal + milk -> potato (confidence 1)**

**milk + potato -> cereal (confidence 1)**
