# Hand In 3 - Frequent Itemsets, Random Walks and Sequence Segmentation
Due: May 15th 2020, 23:59

This is a mandatory handin to be done in groups of 2-3 students, who shall submit:
1. A report in **PDF format**. The report should contain your experimental results; and
2. Python code in a zip-file. 

The pdf should **not** be in the zip-file and your report should **not** be a part of the python code. 
In other words, your report should be self-contained and your code should be there to document
that you actually did what you claim :-).

Submission should be done in Blackboard by **May 15th 23.59**.

## Problem 1 - Frequent Itemsets
We have learned the Apriori and FP-Growth algorithms for mining frequent itemsets.

1. Develop an implementation of both.
2. Run an experiment and show to what extent FP-Growth has an advantage.

Obtain the anonymized real-world `retail market basket` data from: http://fimi.ua.ac.be/data/.
This data comes from an anonymous Belgian retail store, and was donated by Tom Brijs from Limburgs Universitair Centrum, Belgium. The original data contains 16,470 different items and 88,162 transactions. You may only work with the top-50 items in terms of occurrence frequency.

_Hint:_ We have used this dataset before.

In [1]:
import numpy as np
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt
import random
import itertools

# Local imports
import sys
sys.path.append('../utilities')
from load_data import load_market_basket, load_dblp_citations

In [4]:
# Load the retail data
transactions = load_market_basket()

def filter_transactions(T, k=50):
    """
        Keep only the top k items in the transactions.
        Remove transactions that become empty.
    """
    # Count occurences of each item
    counts = [0] * 16470
    for t in T:
        for i in t:
            counts[i] += 1

    # Sort and select top k
    counts = np.array(counts)
    order  = np.argsort(counts)[::-1] # reverse the sorted order

    indexes_to_keep = order[:k]       # Keep the top k items
    index_set = set(indexes_to_keep)  # Convert to python set for efficiency

    # Filter transactions
    T_new = [t_ for t_ in  [list(filter(lambda i: i in index_set, t)) for t in T]  if t_]
    return T_new

T = filter_transactions(transactions, k=50)

In [23]:
# Tiny function for generating rules from tuples
# Ex: rule((1, 2), (5)) outputs "(1, 2) => (5)"
rule  = lambda lhs, rhs: "%s => %s" % (str(lhs), str(rhs)) # For generating rule strings


def listify(L):
    return [[l] for l in L]

def get_occurences(I, T):
    occurences = 0
    for t in T:
        contains_all = True
        for i in I:
            if not i in t:
                contains_all = False
                break
        if contains_all:
            occurences += 1
            
    return occurences
            
def get_all_items(T):
    items = []
    for t in T:
        for i in t:
            if not i in items:
                items.append(i)
    return items

def is_frequent(I, T, support):
    return get_occurences(I, T) / len(T) >= support


def join_pair(L1, L2):
    result = []
    
    for l1 in L1:
        result.append(l1)
    for l2 in L2:
        if not l2 in result:
            result.append(l2)
            
    return result

def contains(candidate, C_k):
    return get_occurences(candidate, C_k) > 0

def flatten(L):
    
    flat_list = []
    for sublist in L:
        for item in sublist:
            flat_list.append(item)
            
    return flat_list
   

def apriori_algorithm(T, support=0.05, min_confidence=0.7):
    """
        Apriori algorithm for mining association rules.
        Inputs:
            T:               A list of lists, each inner list will contiain integer-item-ids. 
                             Example: T = [[1, 2, 5], [2, 3, 4], [1, 6]]
            support:         The proportion of occurences needed to keep itemsets.
            min_confidence:  Minimum confidence for the algorithm to output the rule.
        
        Outputs:
            rules:           List of tuples [(rule:str, confidence:float), ... ]
                             Example: [("(1, 2) => (5)", 0.84), ("(3, 4) => (7)", 0.75)]
    """
    
    ### TODO Your code here
    items = get_all_items(T)
    k = 1
    C = [[]]
    L =  [[]]
    C.append(listify(items))
    
    while len(C[k]) > 0:
        print("k", k)
        
        # frequent itemset generation
        L.append([])
        for I in C[k]:
            if is_frequent(I, T, support):
                L[k].append(I)
        
        # candidate generation
        C.append([])
        for l1 in L[k]:
            for l2 in L[k]:
                candidate = join_pair(l1, l2)
                if len(candidate) == k+1:
                    if is_frequent(candidate, T, support):
                        if not contains(candidate, C[k+1]):
                            C[k+1].append(candidate)
        
        k += 1
    
    # find rules
    rules = []
    itemsets = flatten(L)
    for X in itemsets:
        for Y in itemsets:
            candidate = join_pair(X, Y)
            if len(candidate) == len(X) + len(Y):
                confidence = get_occurences(candidate, T) / get_occurences(X, T)
                if confidence >= min_confidence:
                    rules.append((rule(X, Y), confidence))
    
    
    ### TODO Your code here
    return rules

In [10]:
rules = apriori_algorithm(T, support=0.05, min_confidence=0.7)
rules = sorted(rules, key=lambda x: x[1], reverse=True)

print("%-8s \t %s" % ("Conf.", "Rule"))
for r in rules:
    print("%7.4f%% \t %s" % r[::-1])

NameError: name 'apriori_algorithm' is not defined

In [24]:
def get_frequent_item_count_dict(T, support):
    frequent_item_count = {}
    all_item_count = {}
    items = []
    
    # no we have the node/header table
    for t in T:
        for item in t:
            if item in all_item_count.keys():
                all_item_count[item]['count'] += 1
                items += [item]
            else:
                all_item_count[item] = {'count': 0, 'node_pointer': None, 'word': item}
    
    for item in items:
        if all_item_count[item]['count'] >= support: 
            frequent_item_count[item] = all_item_count[item]
            
    return frequent_item_count, items


def build_fp_tree(T, support):
    frequent_item_count, items = get_frequent_item_count_dict(T, support)
    
    fp_tree = {
        'parent': None,
        'children': {}, 
        'word_count': 1, 
        'link': None,
        'word': None
    }
            
    for transaction in T:
        contains_frequent_item = False
        for item in transaction:
            if item in frequent_item_count.keys():
                contains_frequent_item = True
                
        if not contains_frequent_item:
            continue
            
        no_unfreq_transaction = []
        for w in transaction: 
            if w in frequent_item_count.keys(): 
                no_unfreq_transaction +=[w]
            
        sorted_transaction = sorted(no_unfreq_transaction, key=lambda item: -frequent_item_count[item]['count'])
        
        current_node = fp_tree
        for item in sorted_transaction:
            current_children = current_node['children'].keys()
            
            if not item in current_children:
                current_node['children'][item] = {
                    'parent': current_node,
                    'children': {}, 
                    'word_count': 0, 
                    'link': None,
                    'word': item
                }
                if frequent_item_count[item]['node_pointer'] is None:
                    frequent_item_count[item]['node_pointer'] = current_node
                else:
                    tmp_node = frequent_item_count[item]['node_pointer']
                    while(tmp_node['link'] is not None): tmp_node = tmp_node['link']
                    tmp_node['link'] = current_node
                
            current_node['children'][item]['word_count'] += 1
            current_node = current_node['children'][item]
            
    return fp_tree, frequent_item_count


def get_prefixes_of_item(item):
    paths, counts = [], []
    current_node = item['node_pointer']
    
    while current_node is not None:
        path = []
        traversal_node = current_node['parent']
        while traversal_node is not None:
            path += [traversal_node['word']]
            traversal_node = traversal_node['parent']
        
        paths += [path]
        counts += [current_node['word_count']]
        current_node = current_node['link']
       
    return paths, counts


def get_intersection(l1, l2):
    intersection = []
    
    for item in l1:
        if item in l2:
            intersection += [item]
    
    return intersection

def get_many_intersection(list_of_lists):
    intersection = list_of_lists[0]
            
    for l in list_of_lists:
        intersection = get_intersection(intersection, l)    
    
    return intersection


def fp_growth(T, support):
    """
        FPGrowth algorithm for mining frequent item sets.
        Inputs:
            T:                   A list of lists, each inner list will contiain integer-item-ids. 
                                 Example: T = [[1, 2, 5], [2, 3, 4], [1, 6]]
            support:             The proportion of occurences needed to keep itemsets.
        
        Outputs:
            frequent_itemsets:   List of frequent itemsets
                                 Example: [[1, 2, 5], [1, 6]]
    """
    fp_tree, frequent_item_count = build_fp_tree(T, support)
    
    frequent_patterns = []
    
    for item in frequent_item_count.keys():
        paths, counts = get_prefixes_of_item(frequent_item_count[item])
        
        common_items = get_many_intersection(paths)
        
        count_sum = sum(counts)
        subsets = []
                
        for i in range(len(common_items)):
            subsets = [list(blu) for blu in itertools.combinations(common_items, i)]
            
        for subset in subsets:
            frequent_patterns += [subset + [item]]
            
    return frequent_patterns

In [27]:
test_data = [['I1','I2','I5'],
             ['I2','I4'],
             ['I2','I3'],
             ['I1','I2','I4'],
             ['I1','I3'],
             ['I2','I3'],
             ['I1','I3'],
             ['I1','I2','I3','I5'],
             ['I1','I2','I3']]

fp_growth(test_data, 1)

KeyboardInterrupt: 

## Problem 2 - Random Walks
We introduced the notion of hitting time, as the expected length of a random walk 
between two nodes; the expected number of steps before a simple random walk starting 
from a vertex $v$ reaches a vertex $u$. Your present task is to compute the average 
hitting time in the 
`cit-DBLP` dataset from the [citation dataset collection](http://networkrepository.com/cit.php) 
in the [Network Repository](http://networkrepository.com/networks.php).
Here, average is defined across all pairs of nodes, considered in both directions. 
You may ignore pairs that are not connected by a (directed) path. 
Implement an algorithm that computes this average hitting time and report your 
result.

_Hint:_ We have used this dataset before.

In [2]:
# Method will load a list of all pairs of nodes that are 
# connected to each other by an edge.
# Additionally, is will load precomputed positions for plotting nodes.
# It is, however, not a pretty plot but the computatio
edges, pos = load_dblp_citations()
print("Number of edges: ", len(edges))
print("Number of nodes: ", len(pos))
print("Position example: ", pos[1])

Number of edges:  49743
Number of nodes:  12591
Position example:  [0.03738057 0.02658339]


In [56]:
def random_walk(G, u, v, max_walk_length=100):    
    p = u
    for k in range(max_walk_length):
        neighbors = G.out_edges(p)
        neighbors = listify(neighbors)
        
        if len(neighbors) == 0: 
            return -1
        
        p = random.choice(neighbors)[0][1]

        if p == v:
            return k
        
    return -1

def hitting_time(G, n1, n2, k=10, max_walk_length=100):
    walk_lengths = []
    
    for _ in range(k):
        walk_length = random_walk(G, n1, n2, max_walk_length)
        
        if walk_length > 0:
            walk_lengths.append(walk_length)
    
    if len(walk_lengths) == 0:
        return 0
    
    return np.mean(walk_lengths)

def avg_hitting_time(G):
    nodes = nx.nodes(G)

    total_hitting_time = 0

    for n1 in nodes:
        for n2 in nodes:
            ht = hitting_time(G, n1, n2)
            total_hitting_time += ht


    average_hitting_time = total_hitting_time / (len(nodes) * len(nodes))
    
    return average_hitting_time

In [None]:
G=nx.read_edgelist("../exercises/cit-DBLP.edges", nodetype=int).to_directed()

#G = nx.DiGraph()
#G.add_edges_from([(1, 4), (2, 3), (2, 5), (3, 1), 
#                  (4, 3), (5, 4), (5, 2), (5, 6), 
#                  (5, 3), (6, 3), (6, 0), (7, 1),  
#                  (7, 3), (0, 1)])

avg_hitting_time(G)

## Problem 3 - Sequence Segmentation
The Dynamic Programming algorithm for optimally segmenting a sequence $S$ of length $n$ 
into $B$ segments, that we have introduced, is expressed by the following recursive equation:

$$
E(i, b) = \min_{j < i}\left[ E(j, b-1) + Err(j+1, i)\right]
$$

where $Err(j+1, i)$ is the error of a segment that contains items from $j+1$ to $i$.

Answer the following questions:

**1. What is the default space-complexity of this algorithm?**
N * B

**2. If we are willing to recompute some tabulated results, can we then reduce the 
    default space-complexity? _Exactly how_? What is the space-complexity then?**


    
**3. What is the cost of using the above space-efficiency technique in terms of time-complexity?**




**4. For the sub-problem of segmenting the $i$-prefix of sequence $S$ into $b$ segments, consider 
    the segment $M(i, b)$ that contains (if such segment exists) the middle item of 
    index $\lfloor \frac{n}{2} \rfloor$. The boundaries of $M(i, b)$ can be detected and tabulated 
    along with each $E(i, b)$ solution. Based on this observation, devise a method that reduces 
    the time-complexity burden identified in (3). 
    _(hint: use [divide-and-conquer](https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm))_**
    
    
**5. What is the time complexity when using the technique proposed in (4)?**



**Disclaimer:** 
As this is the final handin and we are getting close to the exam, rehandins are not an option.
Therefor, we strongly encourage you to get started early and if you get too stuck, make sure
to send Frederik an email early (not right up to the deadline). For a faster reply, use
[fhvilshoj@cs.au.dk](mailto:fhvilshoj@cs.au.dk) and **not** ~201206000@ post.au.dk~. 

For those of you, who are not used to analyzing algorithms: by time-complexity and space-complexity, 
we refer to the theoretical computation time and memory usage, respectively, as a function of the problem size, i.e., as a 
function of $n$ and $B$ in Problem 3. We use [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation)
to specify this. You should **not** infer it by implementing it in practice ;-) 
Again, when in doubt, shoot Frederik an email. 