# **2-Laboratory-15-10-2020**

| Credits to the authors of the exercises: Andrea Pasini, Giuseppe Attanasio, Flavio Giobergia <br />
| Master of Science in Data Science and Engineering, Politecnico di Torino, A.A. 2020-21

## Online Retail Dataset 
The Online Retail Data Set is a dataset made available on the UCI Machine Learning repository. It which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based online retail. The original version of the dataset is available on UCI ML as a .xlsx. For your convenience, we are alsomaking it available as a CSV file at the following URL. <br />
Each of the 541,909 rows contains an item that has been purchased by someone. Items can be grouped into invoices (you can think of these as receipts), where each invoice has been issued for a specific buyer,and can contain multiple items. The columns contained in the CSV file are the following:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transac-tion. If this code starts with letter §c’, it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each cus-tomer.
- Country: Country name. Nominal, the name of the country where each customer resides.

### Questions
1. First, you need to load the dataset into memory, using the csv module. Make sure you identify all valid rows. Also consider that rows having an InvoiceNo that starts with C should be discarded, asthey indicate that the invoice is about a cancelled purchase.

In [2]:
import csv

head = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 
        'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
OR_dataset = []
with open('../Datasets/online_retail.csv') as f:
    next(f)
    for row in csv.reader(f):
        # canceled item
        if row[0][0] == "C" and len(row) != 8:
            continue
        else:
            OR_dataset.append(row)
            

In [3]:
OR_dataset[0]

['536365',
 '85123A',
 'WHITE HANGING HEART T-LIGHT HOLDER',
 '6',
 '12/1/2010 8:26',
 '2.55',
 '17850',
 'United Kingdom']

2. Now that you have a dataset of items, you should aggregate it at an “invoice” level. For each invoice(identified by InvoiceNo) there can be multiple items (from multiple rows) in the dataset. For each invoice, you should build a list of all items belonging to it. 

Analyzing the transactions' order, you may see that they are in numerical order, so we can use that structure in order to aggregate them by comparing couple of elements.

In [10]:
invoices = {}

for i,transaction in enumerate(OR_dataset):
    """
        if invoices contains already a set of items for 
        that transaction, it will be added, otherwise 
        I'll define another keyword in the dictionary
    """
    if transaction[0] not in invoices:
        invoices[transaction[0]] = [transaction[2]]
    else:
        invoices[transaction[0]].append(transaction[2])

In [11]:
invoices["536365"]

['WHITE HANGING HEART T-LIGHT HOLDER',
 'WHITE METAL LANTERN',
 'CREAM CUPID HEARTS COAT HANGER',
 'KNITTED UNION FLAG HOT WATER BOTTLE',
 'RED WOOLLY HOTTIE WHITE HEART.',
 'SET 7 BABUSHKA NESTING BOXES',
 'GLASS STAR FROSTED T-LIGHT HOLDER']

3. You should now have a list (one for each invoice) of lists (each list containing the items bought forthat invoice). Now, we need to convert this into a matrix form. Of the many possible formats, we will use the one expected by the Mlxtend library, which is as follows. Given an ordered list of M possible items (in this case, all possible products that can be bought), and given N itemsets (in thiscase, invoices), we should build a matrix of N rows and M columns. 

Firstly, we need a new list with all the possible items. But, instad of using a list, we'll use a set, in order to avoid duplicate items. After that we can build up the matrix by using Mlxtend

In [12]:
items = set()

for elements in invoices.values():
    items.update(elements)

# delete the last element ('')
items.pop()

# order the elements for a better visualization
items = sorted(list(items))

Now let's try to build the matrix with a toy example

In [18]:
t = [ ['a','b','c'],['b','c'],['a','c','d'] ]
head = ['a','b','c','d']

m = [ [ int(x in row) for x in head  ] for row in t ]
m

[[1, 1, 1, 0], [0, 1, 1, 0], [1, 0, 1, 1]]

Nice, now we can proceed with the dataset. In this case row is "invoices" and head is "items"

In [11]:
matrix = [ [ int(x in row) for x in items  ] for row in invoices.values() ]

The first transaction contains 7 items, so let's try to sum up

In [17]:
sum(matrix[0])

7

In [21]:
import pandas as pd

df = pd.DataFrame(data=matrix, columns=items)

In [22]:
df.head()

Unnamed: 0,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


4. With the df that you defined in the previous exercise, you can now use the fp_growth function. This function, which is described in the detail in the official documentation. The first argument required is the previously built DataFrame, df. The second is the minimum support (minsup), i.e. the minimum fraction of the entire dataset in which the itemset should show up for it to be considered “frequent”. Try using different values ofminsup, such as 0.5, 0.1, 0.05, 0.02, 0.01. How many results do youobtain as minsup varies?

In [45]:
from mlxtend.frequent_patterns import fpgrowth

fi_5 = fpgrowth(df, 0.5)
print(len(fi_5))
print(fi_5.to_string())

0
Empty DataFrame
Columns: [support, itemsets]
Index: []


In [46]:
fi_1 = fpgrowth(df, 0.1)
print(len(fi))
print(fi.to_string())

9
    support itemsets
0  0.088880   (3918)
1  0.056641    (244)
2  0.062046   (2054)
3  0.051506   (2395)
4  0.082432   (1866)
5  0.050000   (2046)
6  0.083745   (2915)
7  0.065869   (2471)
8  0.056293   (3195)


In [47]:
fi_02 = fpgrowth(df, 0.02)
print(len(fi_02))
print(fi_02.to_string())

215
      support            itemsets
0    0.088880              (3918)
1    0.056641               (244)
2    0.030386              (1746)
3    0.024363              (2034)
4    0.023243              (1082)
5    0.047104              (1833)
6    0.048263              (2753)
7    0.041737               (165)
8    0.038649               (161)
9    0.035058              (3516)
10   0.033243              (2907)
11   0.030849               (164)
12   0.028919              (3026)
13   0.027992              (2064)
14   0.023591              (3795)
15   0.045174              (2439)
16   0.043166              (3984)
17   0.038108              (3980)
18   0.022510              (3966)
19   0.029112              (2847)
20   0.025019              (1771)
21   0.062046              (2054)
22   0.051506              (2395)
23   0.047529              (1864)
24   0.046371              (1879)
25   0.036564              (1854)
26   0.033861              (1865)
27   0.033784              (2393)
28   0.032

In [50]:
fi_01 = fpgrowth(df, 0.01)
print(len(fi_01))

1038


6. Consider the itemsets extracted forminsup= 0.02. How many items are contained? Which ones would you be considered the most useful?

It contains 215 items, and I think that the "useful" items are the multiple itemset, because it could provide some associations among them. 

7. Extract the association rules from the frequent itemsets extracted with minsup = 0.01. You can find the documentation for association_rules() on the official documentation. You can use the confidence as the metric to identify the rules, and a minimum threshold of 0.85 (feel free to vary these values and observe how the results vary).

In [55]:
from mlxtend.frequent_patterns import association_rules

association_rules(fi_01,'confidence', 0.85)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"(3557, 725)",(2871),0.014749,0.040541,0.012664,0.858639,21.179756,0.012066,6.787287
1,"(3995, 3557, 726)",(2871),0.011737,0.040541,0.010077,0.858553,21.177632,0.009601,6.783155
2,"(1864, 1877, 1879)",(1866),0.011969,0.082432,0.010386,0.867742,10.526705,0.009399,6.937706
3,"(3348, 3302)",(3349),0.011429,0.020347,0.010232,0.89527,43.999051,0.009999,9.354101
4,"(2665, 2915)",(1607),0.015521,0.040811,0.013359,0.860697,21.089915,0.012726,6.885608
5,"(2665, 2915, 3013)",(1607),0.013012,0.040811,0.011699,0.89911,22.031167,0.011168,9.507258
6,"(2665, 2915, 1607)",(3013),0.013359,0.043243,0.011699,0.875723,20.251084,0.011121,7.698554
7,"(2665, 3013)",(1607),0.023707,0.040811,0.021197,0.894137,21.909313,0.020229,9.060649
8,"(2665, 1607)",(3013),0.024865,0.043243,0.021197,0.852484,19.713703,0.020122,6.485804
9,(2921),(2920),0.012124,0.014903,0.010888,0.898089,60.260387,0.010707,9.66626


In this case we are evaluating the association "antecedents" -> "consequents". In this case the antecedent support measaure the support on those associations that see that list as an antecedent, as similar for the consequent. Of course "support" is referred to the support "Antecedent U Consequent"

8. (*) Rerun the experiments from point 4 with apriori().Do the results match with the ones found by FP-Growth? Is Apriori faster or slower than FP-Growth?

In [58]:
import timeit
from mlxtend.frequent_patterns import apriori

# number=1 means that it executes the function only once
print("Apriori: ",timeit.timeit(lambda: apriori(df, 0.02), number=1))
print("FP-Growth: ",timeit.timeit(lambda: fpgrowth(df, 0.02), number=1))

Apriori:  6.299921000000268
FP-Growth:  4.1681743000003735


It can be clearly seen that FP-Growth is much faster than Apriori

## COCO Dataset
COCO Dataset is a large-scale object detection, segmentation, and captioning dataset. It offers a largenumber of images (from various contexts) with annotations (i.e. structured information on the contentsof the image). These annotations, in particular, regard the contents of the image and, in particular, theobjects contained within. This dataset is a JSON file. You can open it using the already introduced json module. The file contains a list of images and, for each image, the annotation key contains all the annotations available. <br />

In this exercise, you will implement your own version of Apriori and use it on the COCO dataset to extract frequent itemsets (i.e. groups of annotations that often co-occur within the same image, such as ’car’ and’traffic light’). Note that, while this entire exercise is optional, we recommend you try to solve it anyway. You may argue that there are libraries already implementing these and other algorithms. Despite that, we believe it is important, for a data scientist, to know the underlying theory as well as some implementation details.You can learn the former on textbooks, but you will only be faced with the latter when actually workingon the implementation.

In [287]:
trans = [ ['a','b'],
          ['b','c','d'],
          ['a','c','d','e'],
          ['a','d','e'],
          ['a','b','c'],
          ['a','b','c','d'],
          ['b','c'],
          ['a','b','c'],
          ['a','b','d'],
          ['b','c','e'] ]

In [288]:
def count_freq(array,C,k,minsup):
    dic = {}
    """
        if k == 1, count all the items
        otherwise we have to check if the candidates are 
        subset of any transaction
    """
    if k == 1:
        # scan the transactions
        for row in array:
            # scan the transactions' elemements
            for x in row:
                # define or update the item into the dictionary
                if x in dic:
                    dic[x] += 1
                else:
                    dic[x] = 1 
    
    else:
        # scan the candidates
        for subset in C: 
            # scan the transactions
            for row in trans:
                # check if the subset is actually a subset of some transactions
                if set(subset).issubset(set(row)):
                    # ['a','b'] will become 'ab'
                    if ''.join(subset) in dic:
                        dic[''.join(subset)] += 1
                    else:
                        dic[''.join(subset)] = 1
                        
    # cut the elements with sup < minsup                   
    dic = {k: v for (k,v) in dic.items() if v > minsup}
    
    # order
    return {k: v for k, v in sorted(dic.items(), key=lambda item: item[0])}

def gen_candidates(L,k):
        for i,x in enumerate(list(L.keys())):
            for j,y in enumerate(list(L.keys())[i+1:],start=i+1):
                if x[:k-2] == y[:k-2]:
                    C.append(list(str(x + y[k-2:])))
    
    return C

In [289]:
# k = 0
# minsup = 0.1

def aprioriVBad(array):
    output = []
    
    L = count_freq(array,[],1,1)
    output.append(L)

    k = 2
    while len(L) != 0:
        
        C = gen_candidates(L,k)
        L = count_freq(array,C,k,1)
        output.append(L)
        k += 1
        
    output.remove({})
    
    return output

aprioriVBad(trans)

[{'a': 7, 'b': 8, 'c': 7, 'd': 5, 'e': 3},
 {'ab': 5,
  'ac': 4,
  'ad': 4,
  'ae': 2,
  'bc': 6,
  'bd': 3,
  'cd': 3,
  'ce': 2,
  'de': 2},
 {'abc': 3, 'abd': 2, 'acd': 2, 'ade': 2, 'bcd': 2}]

Now I want to compare it with the apriori function from mlxtend

In [268]:
import timeit
import pandas as pd
from mlxtend.frequent_patterns import apriori

head = ['a','b','c','d','e']

m = [ [ int(x in row) for x in head  ] for row in trans ]
df = pd.DataFrame(data=m, columns=head)

print("Default\t:\t",timeit.timeit(lambda: apriori(df, 0.01), number=1))
print("Mine\t:\t",timeit.timeit(lambda: aprioriVBad(), number=1))

Default	:	 0.006940999999642372
Mine	:	 0.00023750000036670826
