# Hierachical Clustering

This notebook will walk through the process of how to implement a hierachical clustering in pure python code.
This walk through is inpired by the book, "Programming Collective Intelligence" by Toby Segaran

Initialize the cluster node class. This node has a left and a right reference to other nodes in the tree. Essentially, this will be the unit in the tree.

In [3]:
class cluster_node:
    def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
        self.left=left # left and right nodes 
        self.right=right
        self.vec=vec # cluster's vector of the feature
        self.id=id # id for differentiating if it is leaf or not
        self.distance=distance 
        self.count=count #only used for weighted average 


Cost functions for calculating the distance between each node. This is the most ususal L2 cost function (https://en.wikipedia.org/wiki/Loss_function).


In [4]:
def L2dist(v1,v2):
    return sqrt(sum((v1-v2)**2))
    
def L1dist(v1,v2):
    return sum(abs(v1-v2))

# def Chi2dist(v1,v2):
#     return sqrt(sum((v1-v2)**2))

The main tree construction function. Input a list of items and will output a tree built based on these items.
Running hcluster() on a matrix with feature vectors as rows will create and return the cluster tree.

In [5]:
def hcluster(features,distance=L2dist):
    #cluster the rows of the "features" matrix
    distances={}
    #dictionary for marking processed nodes
    currentclustid=-1
    #default cluster id for marking non leaf node
    
    # initialize a list of nodes, each node is an item from the input list.
    clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]
    
    # try to converge the items in the list into one node eventually. if there is not only node left in the 
    # list, the while loop will keep running.
    while len(clust)>1:
        # place holder for the lowest pair
        lowestpair=(0,1)
        
        # use cost function previously for calculating the distance between two items.
        closest=distance(clust[0].vec,clust[1].vec)
        
        
        # loop through every pair looking for the smallest distance
        for i in range(len(clust)):
            for j in range(i+1,len(clust)):
                # distances is the cache of distance calculations
                # check if we calculate this already or not
                if (clust[i].id,clust[j].id) not in distances: 
                    distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
                
                d=distances[(clust[i].id,clust[j].id)]
        
                # try to get the lowest pair every time
                if d<closest:
                    closest=d
                    lowestpair=(i,j)
        
        # calculate the average of the two clusters.
        mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \
            for i in range(len(clust[0].vec))]


        # create the new cluster and assing the parent nodes as the left and right. 
        newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]],
                             right=clust[lowestpair[1]],
                             distance=closest,id=currentclustid)
        
        # cluster ids that weren't in the original set are negative
        currentclustid-=1
        del clust[lowestpair[1]] # delete the original ones. A copy of them will be in the children nodes' left  
        del clust[lowestpair[0]] # and right
        clust.append(newcluster)

    return clust[0] # Eventually there will be only one cluster node object in the list and 
                    #it is the tree object actually.

The following function helps us to extract the clusters recursively from the tree. It will traverse from the top until a node with distance value smaller than some threshold is found.

In [6]:
def extract_clusters(clust,dist):
    # extract list of sub-tree clusters from hcluster tree with distance<dist
    clusters = {}
    if clust.distance<dist:
        # we have found a cluster subtree
        return [clust] 
    else:
        # check the right and left branches
        cl = []
        cr = []
        if clust.left!=None: 
            cl = extract_clusters(clust.left,dist=dist)
        if clust.right!=None: 
            cr = extract_clusters(clust.right,dist=dist)
        return cl+cr 

This function will return a list of sub-trees containing the clusters. To get the leaf nodes that contain the object ids, traverse each sub-tree and return a list of leaves

In [7]:
def get_cluster_elements(clust):
    # return ids for elements in a cluster sub-tree
    if clust.id>0:
        # positive id means that this is a leaf
        return [clust.id]
    else:
        # check the right and left branches
        cl = []
        cr = []
        if clust.left!=None: 
            cl = get_cluster_elements(clust.left)
        if clust.right!=None: 
            cr = get_cluster_elements(clust.right)
        return cl+cr
