# Association Analysis / Pattern Mining


Association analysis is a data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. 

In retail, association analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.

In this activity we will write python functions to check the support and confidence on a sample grocery store transaction dataset. We also implement the Apriori alorithm to identify frequent individual items in the dataset. 

In [1]:
import sys, getopt

Reading Grocery store csv file that lists sample transactions.

In [15]:
def read_data(file_name):
    result = list()
    with open(file_name, 'r') as file_reader:
        for line in file_reader:
            order_set = set(line.strip().split(','))
            result.append(order_set)
    return result


input_file = "data/grocery_store.csv"
data = read_data(input_file)

orders = data[1:]
print "Number of item sets : " + str(len(orders))
orders


Number of item sets : 20


[{'Bread', 'Eggs', 'Oranges', 'Yogurt'},
 {'Bananas', 'Beef', 'Bread', 'Chicken'},
 {'Bread', 'Milk', 'Spinach', 'Yogurt'},
 {'Bread', 'Eggs'},
 {'Bananas', 'Bread', 'Milk', 'Salad', 'Spinach'},
 {'Bread', 'Salad'},
 {'Bread', 'Milk', 'Salad'},
 {'Chicken', 'Milk', 'Oranges'},
 {'Beef', 'Yogurt'},
 {'Beef', 'Milk'},
 {'Bananas', 'Bread', 'Eggs'},
 {'Bread', 'Salad'},
 {'Oranges'},
 {'Eggs', 'Milk', 'Yogurt'},
 {'Chicken'},
 {'Bread'},
 {'Yogurt'},
 {'Bread', 'Milk', 'Spinach'},
 {'Chicken', 'Eggs'},
 {'Bananas', 'Beef', 'Bread', 'Eggs', 'Spinach'}]

## Support Count 

It is the frequency of occurrence of an itemset.

In [4]:
def support_count(orders, item_set):
    count = 0
    for order in orders:
        if item_set.issubset(order):
            count += 1
    return count

### Exercise 1:
Using the `support_count` function above, compute the support count of {Bread, Milk}.


In [6]:
support_count(orders, {'Bread', 'Milk'})

4

## Support 

The support of an item or item set is the fraction of transactions in our data set that contain that item or item set. It is the percentage of transactions that contain all of the items in an itemset. The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.

In [8]:
def support(orders, item_set):
    N = len(orders)
    return support_count(orders, item_set)/float(N)

### Exercise 2:
Using the `support` function above, compute the support of {Bread, Milk}.

In [9]:
support(orders, {'Bread', 'Milk'})

0.2

## Confidence 

Confidence is the probability that a transaction that contains the items on the left hand side of the rule also contains the item on the right hand side. The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate you can expect for a given rule.

In [10]:
def confidence(orders, left, right):
    left_count = support_count(orders, left)
    both = right.union(left)
    both_count = support_count(orders, both)
    return both_count/(float(left_count))

### Exercise 3:
Using the `confidence` function above, compute the confidence of {Bread, Milk} -> {Spinach}.

In [11]:
confidence(orders, {'Bread','Milk'}, {'Spinach'})

0.75

## Apriori 

Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as <a href="https://en.wikipedia.org/wiki/Affinity_analysis"> market basket analysis</a>.

In [12]:
def apriori(orders, minsup, minconf):
    """Accepts a list of item sets (i.e. orders) and returns a list of
    association rules matching support and confidence thresholds. """
    candidate_items = set()

    for items in orders:
        candidate_items = candidate_items.union(items)

    print("Candidate items are {}".format(candidate_items))

    def apriori_next(item_set=set()):
        """Accepts a single item set and returns list of all association rules
        containing item_set that match support and confidence thresholds.
        """
        result = []

        if len(item_set) == len(candidate_items):
            # Recursion base case.
            return result

        elif not item_set:
            # Initialize with every item meeting support threshold.
            for item in candidate_items:
                item_set = {item}
                if support(orders, item_set) >= minsup:
                    result.extend(apriori_next(item_set))
                else:
                    pass

        else:
            # Given an item set, find all candidate items meeting thresholds
            for item in candidate_items.difference(item_set):
                if confidence(orders, item_set, {item}) >= minconf:
                    if support(orders, item_set.union({item})) >= minsup:
                        result.append((item_set, item))
                        result.extend(apriori_next(item_set.union({item})))
                    else:
                        pass
                else:
                    pass

        return [rule for rule in result if rule]

    return apriori_next()

### Exercise 3:
Using above `apriori` function above, find all rules with `minsup` 0.2 and `minconf` 0.75. 

In [13]:
apriori(orders, 0.2, 0.75)

Candidate items are set(['Bananas', 'Beef', 'Spinach', 'Eggs', 'Salad', 'Oranges', 'Yogurt', 'Chicken', 'Milk', 'Bread'])


[({'Bananas'}, 'Bread'), ({'Spinach'}, 'Bread'), ({'Salad'}, 'Bread')]