## Market Basket Analysis - ADA - Allison Walker

For this exercise, we will use the file: transactions.csv

This file contains a tabular market basket dataset in csv format. 
Each row contains one transaction, i.e.: a basket of items.
Items are separated by ","
Each row has the same number of columns. To achieve that, there are empty placeholders.
We don't consider more than one item of a kind in each row. Each item=1 unit.
The information has been pre-processed to ensure the naming is consistent, i.e.: one item is always spelled in the same way.
The goal of this exercise is to calculate the support of bread and the support of rice cakes in this dataset.

S(bread)=

S(rice cakes)=

 

Please produce a small report presenting the results and explaining every step of the way, in order to reproduce the results (e.g.: tools, approach). 

 

N.B: you can use any tool, or combination of tools you like!

In [188]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sn

%matplotlib inline

In [189]:
# First we import the csv to dataframe, and remove the nan values

data = pd.read_csv('transactions.csv', names = range(1,17), index_col = False)
data = data.replace(np.nan, '', regex=True)
data.head(6)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,dishwasher soap,toilet papper,,,,,,,,,,,,,,
1,batteries,,,,,,,,,,,,,,,
2,onion,rice cakes,,,,,,,,,,,,,,
3,eggs,,,,,,,,,,,,,,,
4,bread,,,,,,,,,,,,,,,
5,orange,carrot,beef,potato,rice,soup pasta,basilico,zuchini,lemon,beetrot,rice milk,pumpkin,tomato,eggs,parsnip,leek


In [190]:
# Now we convert the dataframe to a list of lists so that it can be processed using TransactionEncoder

new = data.values.tolist()
len(new)

34

In [191]:
from mlxtend.preprocessing import TransactionEncoder

In [192]:
# the below code learns the unique labels in the dataset(items) and transforms the list of lists into
# a One-Hot Encoded boolean array

te = TransactionEncoder()
te_ary = te.fit(new).transform(new)

# the labels that the transaction encoder found are:
te.columns_

['',
 'apple',
 'basilico',
 'batteries',
 'beans',
 'beef',
 'beetrot',
 'bread',
 'bread sticks',
 'broccoli',
 'buckwheat cakes',
 'cabbage',
 'carob spread',
 'carrot',
 'chicken',
 'chickpeas',
 'chocolate bread',
 'cinamon sticks',
 'clams',
 'coffee',
 'compote',
 'cookies',
 'dehodorant',
 'dishwasher cleaning soap',
 'dishwasher soap',
 'dishwasher tablets',
 'eggs',
 'fish',
 'ham',
 'hummus',
 'kitchen paper',
 'leek',
 'lemon',
 'lentills',
 'miso',
 'mushrooms',
 'olive oil',
 'onion',
 'orange',
 'parsnip',
 'pastry pasta',
 'pear',
 'plastic bag',
 'potato',
 'pumpkin',
 'rice',
 'rice cakes',
 'rice milk',
 'rubbish bags',
 'salmon',
 'sardine',
 'soup pasta',
 'spinaches',
 'tea',
 'toilet cleaning soap',
 'toilet papper',
 'tomato',
 'tonic water',
 'tortilla',
 'tuna',
 'unknown',
 'water',
 'zuchini']

In [195]:
# Next we convert the array back to dataframe, and drop the first column (which represents nan values)

df = pd.DataFrame(te_ary, columns=te.columns_)
df = df.drop('', axis = 1)
df.head()

Index(['apple', 'basilico', 'batteries', 'beans', 'beef', 'beetrot', 'bread',
       'bread sticks', 'broccoli', 'buckwheat cakes', 'cabbage',
       'carob spread', 'carrot', 'chicken', 'chickpeas', 'chocolate bread',
       'cinamon sticks', 'clams', 'coffee', 'compote', 'cookies', 'dehodorant',
       'dishwasher cleaning soap', 'dishwasher soap', 'dishwasher tablets',
       'eggs', 'fish', 'ham', 'hummus', 'kitchen paper', 'leek', 'lemon',
       'lentills', 'miso', 'mushrooms', 'olive oil', 'onion', 'orange',
       'parsnip', 'pastry pasta', 'pear', 'plastic bag', 'potato', 'pumpkin',
       'rice', 'rice cakes', 'rice milk', 'rubbish bags', 'salmon', 'sardine',
       'soup pasta', 'spinaches', 'tea', 'toilet cleaning soap',
       'toilet papper', 'tomato', 'tonic water', 'tortilla', 'tuna', 'unknown',
       'water', 'zuchini'],
      dtype='object')

In [194]:
# To validate, we can see that we have 34 transactions and 62 items

df.shape

(34, 62)

### Support calculation:###

The support calculation is as follows:

S(itemset) = transactions itemset / total transactions

In order to calculate the support of bread and rice cakes, we therefore need:
* the count of transactions with bread
* the count of transactions with rice cakes
* the total number of transactions


In [178]:
# count of transactions with bread

len(df[df['bread'] == True])

4

In [179]:
# count of transactions with rice cakes

len(df[df['rice cakes'] == True])

7

In [1]:
# the total number of transactions

len(df)

NameError: name 'df' is not defined

In [187]:
# Support of Bread:

round(len(df[df['bread'] == True]) / len(df), 3)

0.118

In [186]:
# Support of Rice Cakes:

round(len(df[df['rice cakes'] == True]) / len(df), 3)

0.206

Support is an indication of how frequently the items appear in the data.

The support of bread is 11.8%, which means it appears in 11.8% of transactions.

Rice cakes appear in 20.6% of transactions.
