# Q2. FP-Tree

Suppose you are given some transactions and a vocabulary that map terms to indexes. Please use
FP-Tree algorithm to discover the frequent itemsets.


## Data Description:

* topi-i.txt

Input file of frequent pattern mining algorithms. Each line represents a transaction with indices of terms.

format: term1_index term2_index term3_index ...

Columns are separated by a space.

* vocab.txt

Dictionary that maps term index to term.

format: term_index term

Columns are separated by a space.

* pattern-i.txt:

The file you need to submit, which contains your result for this frequent pattern mining task. Each line represents a transaction with frequent itemsets sorted in descending order of support count.

format: support_count term1 term2 ...

support_count and term1 are separated by a tab, while terms are separated by a space.

Here we give an example:
```
233 rule association
230 random
227 finding
203 mining pattern
```
## Questions:

(a) Please write a program to implement FP-growth algorithm and find all frequent itemsets with `support >= 400` in the given dataset.

(b) Based on the result of (a), please print out those FP-conditional trees whose height is **larger than 1**.

![Q2-output](Q2-output.png)

## Define preprocessed funtion

* `read_file()` function that converts txt into nested list
* `create_init_set()` function that generates transaction dictionary which is as initial data to input `create_fp_tree()` function

In [1]:
def read_file(file):
    """
    Read files from txt 
    """
    items_bought = [] # origin data
    with open(file,"r") as f:
        for line in f.readlines():
            items_bought.append(line.strip().split())
    return items_bought

def create_init_set(items_bought):
    trans_dict={}
    for trans in items_bought:
        key = frozenset(trans)
        if key not in trans_dict:
            trans_dict[key] = 1
        else:
            trans_dict[key] += 1
    return trans_dict

## Define the class of TreeNode and make a vocab dict

In [2]:
# Define the vocab_dict which is used to transform index to vocabulary
vocab_dict = {}
with open("vocab.txt", 'r') as f:
    for line in f.readlines():
        term = line.strip().split("\t")
        vocab_dict[term[0]] = term[1]

In [3]:
# Define the structure of TreeNode
class TreeNode:
    def __init__(self, name_value, num_occur, parent_node):
        self.name = name_value
        self.node_link = None
        self.count = num_occur
        self.parent = parent_node
        self.children = {}
    
    # Add the count of node
    def inc(self, num_occur):
        self.count += num_occur
    
    # Print tree's structrue in the terminal
    def display(self, index=1):
        print('  '*index, self.name, ' ', self.count)
        for child in self.children.values():
            child.display(index+1)
    
    # Calculate the height of tree
    def get_height(self, index=1):
        mid = 0
        for child in self.children.values():
            if child.get_height(index+1) >= mid:
                mid = child.get_height(index+1)
        return max(index, mid)
    
    # Transform tree to a nested list
    # According to the output EXAMPLE => [parent, children]
    def transform_to_list(self):
        if self.name in vocab_dict:
            node_info = vocab_dict[self.name] + "    " + str(self.count)
        else:
            node_info = self.name + "    " + str(self.count)
        if len(self.children) == 0:
            return node_info
        
        local_list = []
        for child in self.children.values():
            local_list.append(child.transform_to_list())

        return [node_info, local_list]

## Define FP-growth function

In [4]:
def update_header(node_to_test, target_node):
    """
    Update header table
    
    :node_to_test: the entry of the node
    :target_node: Target node to link
    """
    # To find the node with node_link == None
    while node_to_test.node_link != None:
        node_to_test = node_to_test.node_link
    # Link the target_node on node
    node_to_test.node_link = target_node
            
def update_fp_tree(items, fp_tree, header_table, count):
    """
    Update FP Tree
    
    :items: items in frequent itemsets
    :fp_tree: TreeNode
    :header_table: header table
    :count: added count
    """
    if items[0] in fp_tree.children:
        # If items[0] is as child node, fp_tree.count += count
        fp_tree.children[items[0]].inc(count)
    else:
        # If items[0] is not a child node, create a new branch
        fp_tree.children[items[0]] = TreeNode(items[0], count, fp_tree)
        # Update linked list of the frequent itemsets
        if header_table[items[0]][1] == None:
            header_table[items[0]][1] = fp_tree.children[items[0]]
        else:
            update_header(header_table[items[0]][1], fp_tree.children[items[0]])
    # Recursion
    if len(items) > 1:
        update_fp_tree(items[1::], fp_tree.children[items[0]], header_table, count)
            
def create_fp_tree(trans_dict, threshold=400):
    """
    Create FP Tree
    
    :trans_dict: transaction dictionary
    :threshold: bound that decides whether to remove the item in transactions. Default value = 400
    
    :return: fp_tree, header_table, filtered_trans_dict
    """
    header_table = {}
    
    # First Scan: Construct the header table and sort it
    for items in trans_dict:
        for item in items:
            header_table[item] = header_table.get(item, 0) + trans_dict[items]
    # Delete the elements with frequency < threshold
    for k in set(header_table.keys()):
        if header_table[k] < threshold:
            del(header_table[k]) 
    # Sort the header table by descending frequency
    header_table = dict(sorted(header_table.items(), key=lambda p:p[1], reverse=True))
    # Frequent itemsets contain items whose frequency >= threshold
    freq_itemsets = set(header_table.keys())
    
    if len(freq_itemsets) == 0:
        return None, None, None
    
    for k in header_table:
        header_table[k] = [header_table[k], None]

    # Initialize the FP Tree with root node
    fp_tree = TreeNode('Null Set', 1, None)
    
    # Second Scan: remove the infrequent item in transaction and sort it
    filtered_trans_dict = {}
    for trans, count in trans_dict.items():
        local_dict = {}
        for item in trans:
            if item in freq_itemsets:
                local_dict[item] = header_table[item][0]
        if len(local_dict) > 0:
            # Sorted transaction item by descending frequency
            ordered_items = [x[0] for x in sorted(local_dict.items(), key=lambda p:p[1], reverse=True)]
            
            # Build the filtered transaction dictionary
            key = frozenset(ordered_items)
            if key not in filtered_trans_dict:
                filtered_trans_dict[key] = count
            else:
                filtered_trans_dict[key] += count

            # Update FP Tree using ordered items
            update_fp_tree(ordered_items, fp_tree, header_table, count)
    
    # Sort the filtered transaction dictionary as an output
    filtered_trans_dict = dict(sorted(filtered_trans_dict.items(), key=lambda p:p[1], reverse=True))
    
    return fp_tree, header_table, filtered_trans_dict

## Generate FP-Tree from txt file

In [5]:
# Test for topic-0.txt
items_bought = read_file("topic-0.txt")
trans_dict = create_init_set(items_bought)
fp_tree, header_table, filtered_trans_dict = create_fp_tree(trans_dict, threshold=400)

# fp_tree.display()
print(filtered_trans_dict)
header_table

{frozenset({'248'}): 668, frozenset({'390'}): 624, frozenset({'458'}): 500, frozenset({'298'}): 328, frozenset({'382'}): 324, frozenset({'118'}): 303, frozenset({'382', '390'}): 296, frozenset({'473'}): 217, frozenset({'723'}): 208, frozenset({'225'}): 170, frozenset({'382', '723'}): 123, frozenset({'382', '225'}): 108, frozenset({'473', '248'}): 67, frozenset({'458', '248'}): 52, frozenset({'298', '248'}): 47, frozenset({'248', '390'}): 45, frozenset({'298', '390'}): 38, frozenset({'473', '390'}): 37, frozenset({'248', '118'}): 37, frozenset({'382', '458'}): 33, frozenset({'382', '248'}): 28, frozenset({'118', '390'}): 26, frozenset({'225', '248'}): 23, frozenset({'458', '390'}): 23, frozenset({'225', '390'}): 23, frozenset({'723', '390'}): 23, frozenset({'382', '723', '390'}): 23, frozenset({'298', '458'}): 21, frozenset({'382', '248', '390'}): 20, frozenset({'382', '298', '390'}): 20, frozenset({'723', '118'}): 20, frozenset({'298', '723'}): 18, frozenset({'458', '118'}): 18, frozen

{'390': [1288, <__main__.TreeNode at 0x24a69a5d470>],
 '382': [1163, <__main__.TreeNode at 0x24a69a5dcf8>],
 '248': [1087, <__main__.TreeNode at 0x24a69a5dd68>],
 '458': [720, <__main__.TreeNode at 0x24a69a5d518>],
 '298': [560, <__main__.TreeNode at 0x24a69a5dc88>],
 '723': [528, <__main__.TreeNode at 0x24a69a5d6a0>],
 '118': [509, <__main__.TreeNode at 0x24a69a5dcc0>],
 '473': [488, <__main__.TreeNode at 0x24a69a5d630>],
 '225': [416, <__main__.TreeNode at 0x24a69a5db00>]}

## Begin to Mine FP-tree

To mine the FP-tree, we need to start from certain node and go through all nodes in the direction of root and build conditional pattern base. Then using the pattern base to generate FP-conditional tree. Finally, through recursively mining FP-conditional tree, we can gain the frequent itemsets pattern.

In [6]:
# Ascend FP Tree
def ascend_fp_tree(leaf_node, prefix_path):
    if leaf_node.parent != None:
        prefix_path.append(leaf_node.name)
        ascend_fp_tree(leaf_node.parent, prefix_path)
        
# Find FP-conditional base
def find_cond_pat_base(base_pat, header_table):
    tree_node = header_table[base_pat][1]
    cond_pats = {}
    while tree_node != None:
        prefix_path = []
        ascend_fp_tree(tree_node, prefix_path)
        if len(prefix_path) > 1:
            cond_pats[frozenset(prefix_path[1:])] = tree_node.count
        tree_node = tree_node.node_link
    return cond_pats

In [7]:
def mine_fp_tree(in_tree, header_table, threshold, pre_fix, freq_item_list, min_height):
    """
    Mine the frequent pattern itemsets from FP-tree
    
    :in_tree: not used
    :threshold: The minimum support
    :pre_fix: the frequent items set in the last recursion
    :freq_item_list: final frequent items pattern to output
    :min_height: The min height of conditional FP-tree to print
    """

    items_list = [(v[0], v[1][0]) for v in sorted(header_table.items(), key=lambda p:p[1][0])] # Notice a bug p[1][0] is True, p[0] is false
    
    for item, count in items_list:
        
        new_freq_set = pre_fix.copy()
        
        new_freq_set.add(vocab_dict[item]) 
        #new_freq_set.add(item) # test code
        
        freq_item_list.append([count, new_freq_set])
        
        # Construct conditional pattern base
        cond_pattern_base = find_cond_pat_base(item, header_table)
        # print(cond_pattern_base)
        # Build conditional FP trees
        cond_tree, cond_header_table, _ = create_fp_tree(cond_pattern_base, threshold)
        
        # If FP-conditional tree exists
        if cond_tree:
            # Find FP-conditional tree whose height is larger than 1
            if cond_tree.get_height() > min_height:
                print("FP-cond Tree of" + str(new_freq_set))
                print("height = " + str(cond_tree.get_height()))
                print("list form of cond tree:")
                print(cond_tree.transform_to_list())
                cond_tree.display()

        # If header_table exists
        if cond_header_table:
            mine_fp_tree(cond_tree, cond_header_table, threshold, new_freq_set, freq_item_list, min_height)

In [8]:
def mine_from_txt(topic_file, pattern_file, threshold=400, min_height=1):
    """
    Mine frequent itemsets pattern from topic file
    
    :topic_file: (string) the path of topic file
    :pattern_file: (string) the path to store the pattern file
    :threshold: the minimum support
    :min_height: the minimum height of FP-conditional tree to print
    """
    items_bought = read_file(topic_file) # Read topic.txt file to initialize dataset => nested list
    trans_dict = create_init_set(items_bought) # list => transaction dict
    
    fp_tree, header_table, _ = create_fp_tree(trans_dict, threshold) # Create FP-tree and header table

    freq_items = [] # frequent items pattern list
    mine_fp_tree(fp_tree, header_table, threshold, set([]), freq_items, min_height)
    
    # Sort freq_items by descending supported count
    freq_items = sorted(freq_items, key=lambda p:p[0], reverse=True)
    print("frequent items pattern:")
    print(freq_items)
    
    # Write the pattern in txt file
    with open(pattern_file, "w") as f:
        for count, vocabs in freq_items:
            f.write(str(count) + "\t")
            i = 0
            length = len(vocabs)
            for vocab in vocabs:
                i += 1
                # Replace index with terms in vocab.txt
                f.write(vocab)
                if i < length:
                    f.write(" ")
            f.write("\n")

```
# INFORM IMPRTANTLY
# This is test code , CANNOT RUN. To run it, you need to COMMENT OUT some code
data = [
            ['a', 'b', 'd', 'e', 'f', 'g'],
            ['a', 'f', 'g'],
            ['b', 'd', 'e', 'f'],
            ['a', 'b', 'd'],
            ['a', 'b', 'e', 'g']
        ]
        
trans_set = create_init_set(data)

fp_tree, header_table, itemsets = create_fp_tree(trans_set, 3)

freq_items = [] # frequent items pattern list
mine_fp_tree(fp_tree, header_table, 3, set([]), freq_items, 1)
    
# Sort freq_items by descending supported count
freq_items = sorted(freq_items, key=lambda p:p[0], reverse=True)
print("frequent items pattern:")
print(freq_items)
```

## Print the FP-conditional tree and write pattern in file

According to the question description, for eache topic, we need to output to a `pattern.txt` file and we also need to find out FP-conditional tree with height more than 1. Therefore we do this:

In [9]:
# Process all topic files
for i in range(5):
    topic_file = "topic-" + str(i) + ".txt"
    pattern_file = "pattern-" + str(i) + ".txt"
    print("processing " + topic_file)
    mine_from_txt(topic_file, pattern_file, threshold=400, min_height=1)
    print("write the pattern in " + pattern_file)
    print()

processing topic-0.txt
FP-cond Tree of{'mining'}
height = 2
list form of cond tree:
['Null Set    1', ['data    413']]
   Null Set   1
     390   413
frequent items pattern:
[[1288, {'data'}], [1163, {'mining'}], [1087, {'algorithm'}], [720, {'graph'}], [560, {'time'}], [528, {'pattern'}], [509, {'tree'}], [488, {'efficient'}], [416, {'rule'}], [413, {'mining', 'data'}]]
write the pattern in pattern-0.txt

processing topic-1.txt
frequent items pattern:
[[1488, {'learning'}], [1050, {'using'}], [819, {'model'}], [715, {'based'}], [582, {'classification'}], [488, {'feature'}], [474, {'clustering'}], [463, {'network'}]]
write the pattern in pattern-1.txt

processing topic-2.txt
FP-cond Tree of{'retrieval'}
height = 2
list form of cond tree:
['Null Set    1', ['information    475']]
   Null Set   1
     190   475
frequent items pattern:
[[1226, {'web'}], [1211, {'information'}], [1114, {'retrieval'}], [863, {'based'}], [757, {'system'}], [707, {'search'}], [564, {'document'}], [490, {'lang