# Market Basket

*   Market basket assignment: Select a dataset of interest to you and perform a market basket analysis, including finding frequent itemsets and mining association rules. This assignment is a little more subjective than previous assignments. Before starting, discuss your dataset with me. You can use the code from the text or any of the shelf method that performs the A Priori algorithm
*   Ernesto: If you are using the apriori algorithm code from the textbook, note that the code in the textbook is written for Python 2.7 and there are some minor differences. There is a GitHub repo with code for Python 3. I did not test-drive the code in that repository, but hopefully it works out of the box. https://github.com/pbharrin/machinelearninginaction3x



### Import and Install

In [None]:
from google.colab import files
uploaded = files.upload()

Saving groceries.csv to groceries.csv


In [None]:
! pip install mlxtend
! pip install xlrd
! pip install apyori
! pip install py-votesmart

Collecting apyori
  Downloading https://files.pythonhosted.org/packages/5e/62/5ffde5c473ea4b033490617ec5caa80d59804875ad3c3c57c0976533a21a/apyori-1.1.2.tar.gz
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-cp36-none-any.whl size=5975 sha256=7698730412fb24af08d35eee83600ea217ba23993af9fb400a42975155743541
  Stored in directory: /root/.cache/pip/wheels/5d/92/bb/474bbadbc8c0062b9eb168f69982a0443263f8ab1711a8cad0
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2
Collecting py-votesmart
  Downloading https://files.pythonhosted.org/packages/05/0a/cb61da0be92ba1e2970645ea1ac7730771b4416645c373bd76f66dbaf2a5/py_votesmart-0.4.4-py2.py3-none-any.whl
Collecting simplejson>=1.8
[?25l  Downloading https://files.pythonhosted.org/packages/73/96/1e6b19045375890068d7342cbe280dd64ae73fd90b9735b5efb8d1e044a1/simplejson-3.17.2-cp36-cp36m-manylinux

## Actual apriori code

In [None]:
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
import sys
from numpy.linalg import inv
from sklearn.model_selection import train_test_split
from apyori import apriori
import csv
from itertools import combinations

#data = pd.read_csv('groceries.csv', header = None)
data = pd.read_csv('groceries.csv', sep='\n', header=None)[0].str.split(',', expand=True)

# Getting the list of transactions from dataset
transactions = []
for i in range(0, len(data)):
    transactions.append([str(data.values[i,j]) for j in range(0, 20)])

# Training Apriori algorithm on the dataset
rule_list = apriori(transactions, min_support = 0.003, min_confidence = 0.3, min_lift = 3, min_length = 2)

# Execution
results = list(rule_list)
for i in results:
    print(i)

RelationRecord(items=frozenset({'hamburger meat', 'Instant food products'}), support=0.003050330452465684, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Instant food products'}), items_add=frozenset({'hamburger meat'}), confidence=0.379746835443038, lift=11.42143769597027)])
RelationRecord(items=frozenset({'beef', 'root vegetables'}), support=0.017386883579054397, ordered_statistics=[OrderedStatistic(items_base=frozenset({'beef'}), items_add=frozenset({'root vegetables'}), confidence=0.3313953488372093, lift=3.0403668431100312)])
RelationRecord(items=frozenset({'bottled beer', 'liquor'}), support=0.004677173360447382, ordered_statistics=[OrderedStatistic(items_base=frozenset({'liquor'}), items_add=frozenset({'bottled beer'}), confidence=0.4220183486238532, lift=5.260520226508993)])
RelationRecord(items=frozenset({'root vegetables', 'herbs'}), support=0.007015760040671073, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herbs'}), items_add=frozenset({'root vege

# Helps

* Support of the item x is the ratio of the number of transactions in which the item x appears to the total number of transactions (probability)

* Confidence (x => y) signifies the likelihood of the item y being purchased when the item x is purchased (popularity of x)

* Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being purchased when the item x is sold (popularity of y)

### Lift
* Lift (x => y) = 1 means that there is no correlation within the itemset.
* Lift (x => y) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.
* Lift (x => y) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, x and y, are unlikely to be bought together.

* Conviction is (1 - supp(y)) / (1 - conf (x=>y))     range: [0, inf]

## Apriori from scratch

In [None]:
# CODE FROM TEXTBOOK CHAPTER 11
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
import sys
from numpy.linalg import inv

def loadDataSet():
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

# creates a list and adds items to it, no repeats
def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item])
    C1.sort()
    return list(map(frozenset, C1))#use frozen set so we
                            #can use it as a key in a dict    

def scanD(D, Ck, minSupport): # D: dataset, Ck: candidate sets
    ssCnt = {} # count of item
    for tid in D:
        for can in Ck:
            if can.issubset(tid): # if each item of frozen set is in dataset
                if can not in ssCnt: ssCnt[can]=1 ################################
                else: ssCnt[can] += 1
    numItems = float(len(D))
    retList = []
    supportData = {}
    for key in ssCnt:
        support = ssCnt[key]/numItems # Defining Support
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key] = support
    return retList, supportData

# making Ck: Candidate itemsets
def aprioriGen(Lk, k): #Lk: frequent itemset, k: size of itemset
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk): 
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1==L2: #if first k-2 elements are equal
                retList.append(Lk[i] | Lk[j]) #set union
    return retList

def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    D = list(map(set, dataSet))
    L1, supportData = scanD(D, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Ck = aprioriGen(L[k-2], k)
        Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData

# Association Rule-Generation Function
def generateRules(L, supportData, minConf=0.7):  #supportData is a dict coming from scanD
    bigRuleList = []
    for i in range(1, len(L)):#only get the sets with two or more items
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList         

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = [] #create new list to return
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
        if conf >= minConf: 
            print(freqSet-conseq,'-->',conseq,'conf:',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)): #try further merging
        Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):    #need at least two sets to merge
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
            
def pntRules(ruleList, itemMeaning):
    for ruleTup in ruleList:
        for item in ruleTup[0]:
            print(itemMeaning[item])
        print("           -------->")
        for item in ruleTup[1]:
            print(itemMeaning[item])
        print("confidence: %f" % ruleTup[2])
        print()       #print a blank line

For each transaction in tran the dataset:
For each candidate itemset, can:
Check to see if can is a subset of tran
If so increment the count of can
For each candidate itemset:
If the support meets the minimum, keep this item
Return list of frequent itemsets

In [None]:
from operator import is_not
from functools import partial

# put item of each type in a list
data = pd.read_csv('groceries.csv', sep='\n', header=None)[0].str.split(',', expand=True)
data.fillna(0,inplace=True)
mb = [] # our market basket

# Converting to list, taking out None values
for i in range(0,len(data)):
    mb.append([str(data.values[i,j]) for j in range(0,20) if str(data.values[i,j])!='0'])

min_support = 0.003 # support showwing items occuring in 0.3% of mb
min_conf = 0.5
# mb_C1 = createC1(mb)
# retList, supportData = scanD(mb, mb_C1, min_support)

freq_itemset, supportData = apriori(mb, min_support)
k = 2 # size of item sets
#retList = aprioriGen(L[0], k)
rules = generateRules(freq_itemset, supportData, min_conf)
#ap_retList = aprioriGen(retList, len(mb))


frozenset({'rice'}) --> frozenset({'other vegetables'}) conf: 0.5270270270270271
frozenset({'specialty cheese'}) --> frozenset({'other vegetables'}) conf: 0.5
frozenset({'baking powder'}) --> frozenset({'whole milk'}) conf: 0.5202312138728323
frozenset({'cereals'}) --> frozenset({'whole milk'}) conf: 0.6428571428571428
frozenset({'rice'}) --> frozenset({'whole milk'}) conf: 0.6081081081081081
frozenset({'tropical fruit', 'root vegetables', 'citrus fruit'}) --> frozenset({'other vegetables', 'whole milk'}) conf: 0.5535714285714286


In [None]:
# 1st entry of market basket data
print(mb[0])

['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups']


In [None]:
# print(retList)

In [None]:
# support
print(supportData)

{frozenset({'citrus fruit'}): 0.08276563294356888, frozenset({'margarine'}): 0.05826131164209456, frozenset({'ready soups'}): 0.0018301982714794102, frozenset({'semi-finished bread'}): 0.017691916624300967, frozenset({'coffee'}): 0.057854600915099134, frozenset({'tropical fruit'}): 0.10493136756481952, frozenset({'yogurt'}): 0.13950177935943062, frozenset({'whole milk'}): 0.25551601423487547, frozenset({'cream cheese '}): 0.03965429588205389, frozenset({'meat spreads'}): 0.004270462633451958, frozenset({'pip fruit'}): 0.07564819522114896, frozenset({'condensed milk'}): 0.010269445856634469, frozenset({'long life bakery product'}): 0.03701067615658363, frozenset({'other vegetables'}): 0.1934926283680732, frozenset({'abrasive cleaner'}): 0.0034570411794611084, frozenset({'butter'}): 0.05541433655312659, frozenset({'rice'}): 0.0075241484494153535, frozenset({'rolls/buns'}): 0.18383324860193187, frozenset({'UHT-milk'}): 0.03345195729537367, frozenset({'bottled beer'}): 0.08022369089984749,

In [None]:
# frequent item set
print(freq_itemset)

[[frozenset({'nuts/prunes'}), frozenset({'jam'}), frozenset({'ketchup'}), frozenset({'light bulbs'}), frozenset({'kitchen towels'}), frozenset({'nut snack'}), frozenset({'liver loaf'}), frozenset({'syrup'}), frozenset({'mayonnaise'}), frozenset({'roll products '}), frozenset({'instant coffee'}), frozenset({'vinegar'}), frozenset({'sauces'}), frozenset({'rum'}), frozenset({'soups'}), frozenset({'pet care'}), frozenset({'liquor'}), frozenset({'skin care'}), frozenset({'house keeping products'}), frozenset({'mustard'}), frozenset({'tea'}), frozenset({'meat'}), frozenset({'dish cleaner'}), frozenset({'female sanitary products'}), frozenset({'cleaner'}), frozenset({'frozen fish'}), frozenset({'dog food'}), frozenset({'finished products'}), frozenset({'specialty cheese'}), frozenset({'cake bar'}), frozenset({'popcorn'}), frozenset({'dental care'}), frozenset({'soft cheese'}), frozenset({'Instant food products'}), frozenset({'canned fruit'}), frozenset({'male cosmetics'}), frozenset({'cling f

In [None]:
# rules
# (frozenset({'cereals'}), frozenset({'whole milk'}), 0.6428571428571428) has the largest confidence level, which makes sense
print(rules)

[(frozenset({'rice'}), frozenset({'other vegetables'}), 0.5270270270270271), (frozenset({'specialty cheese'}), frozenset({'other vegetables'}), 0.5), (frozenset({'baking powder'}), frozenset({'whole milk'}), 0.5202312138728323), (frozenset({'cereals'}), frozenset({'whole milk'}), 0.6428571428571428), (frozenset({'rice'}), frozenset({'whole milk'}), 0.6081081081081081), (frozenset({'tropical fruit', 'root vegetables', 'citrus fruit'}), frozenset({'other vegetables', 'whole milk'}), 0.5535714285714286)]


In [None]:
print("The frequent itemset that seems to have the the highest confidence is the assosciation rule of a 'purchase of whole milk implies a purchase of cereals', at 64%, and this does seem to make sense as one item tends to be used with the other.")
print("In regards to other association rules, other vegetables tends to be included in a lot of them, as it does have a very large support in this dataset, however the confidence is a bit lower than the previously described rule as it is not as")
print("strongly correlated. We chose a confidence of 50% as it cuts off a lot of the rules that may have just been items that are generally bought a lot, but not necessarily items that would be specifically bought together.")

The frequent itemset that seems to have the the highest confidence is the assosciation rule of a 'purchase of whole milk implies a purchase of cereals', at 64%, and this does seem to make sense as one item tends to be used with the other.
In regards to other association rules, other vegetables tends to be included in a lot of them, as it does have a very large support in this dataset, however the confidence is a bit lower than the previously described rule as it is not as
strongly correlated. We chose a confidence of 50% as it cuts off a lot of the rules that may have just been items that are generally bought a lot, but not necessarily items that would be specifically bought together.
