# Gene Expression Clustering by Hierarchical clustering
## Adam Kim

*This program was completed for CSI 4352 (Introduction to Data Mining).*

*This function creates the gene -> feature dictionary.*

*The gene index is the key, and the 12 datapoints for gene is the value*

In [None]:
def init_DB(filename):

    handle          = open(filename,newline='')
    csvHandle       = myCSV.reader(handle,delimiter='\t',quotechar='"')

    myDB = {0:[]}
    myDB.pop(0,None)

    rowCount       = 0

    for row in csvHandle:

        for point in row:
            
            if point:

                if rowCount not in myDB:
                    myDB[rowCount] = [float(point)]
                else:
                    myDB[rowCount].append(float(point))

        rowCount += 1

    return myDB

*This function creates the initial clusters.*

*This assignment uses agglomerative (bottom-up) approach, so initialize each gene to a single member cluster.*

*There should be initially 500 clusters with 1 gene in each cluster.*

In [None]:
def init_clusters(myDB):

	clusters = []

	for gene in myDB:
		clusters.append([gene])

	return clusters

*This function builds distance matrix for genes.*

*Using average link distance between clusters and Manhattan (L1) distance between cluster members.*

*For distance of gene to itself, insert infinity to prevent 0s on diagonal.*

In [None]:
def init_distance_matrix(clusters,myDB):

    distMatrix = []

    i = 0

    while i < len(clusters):

        j = 0

        row = []

        while j < len(clusters):

            if i != j:

                avgLinkDist     = 0.0
                rootCluster     = clusters[i]
                neighborCluster = clusters[j]

                for gene1 in rootCluster:

                    for gene2 in neighborCluster:

                        avgLinkDist += manhattan(gene1,gene2,myDB)

                n           = float(len(rootCluster))
                m           = float(len(neighborCluster))
                avgLinkDist /= n
                avgLinkDist /= m
                
                row.append(avgLinkDist)
                
            else:

                row.append(float('inf'))

            j+=1

        distMatrix.append(row)

        i+=1
        
    return distMatrix

*This function calculates the Manhattan (L1) distance between two genes.*

In [None]:
def manhattan(gene1,gene2,myDB):

    dimsA           = myDB[gene1]
    dimsB           = myDB[gene2]
    manhattanDist   = 0.0

    assert( len(dimsA) == len(dimsB) )

    itr = 0

    while itr < len(dimsA):

        manhattanDist += abs( dimsA[itr] - dimsB[itr] )

        itr += 1

    return manhattanDist

*This is the AGNES algorithm.*

*Start with clusters of size 1*

*Continuously merge the closest clusters until all member end up in 1 cluster*

*For this algorithm I will be reporting at cluster number 50, 30, and 10*

In [None]:
def AGNES(clusters,myDB):

    distMatrix = init_distance_matrix(clusters,myDB)

    # This is used to report results when cluster number is 50,30,10
    
    sizes = [50,30,10]
    
    while len(clusters) > 1:

        '''
        This code here outputs results when cluster number is 50,30,10
        '''
        if len(clusters) in sizes:
            sizes.remove(len(clusters))
            report(clusters,len(clusters))

        # Object used to track closest clusters
        minDist = [float('inf'),-1.0,-1.0]

        i=0
        while i < len(distMatrix):

            '''
            j=0
            
            This traverses the entire square.
            However, matrix is mirrored along diagonal.
            This is because distance is symmetric function.
            Distance from a to b is same distance from b to a
            Therefore duplicate information exists.
            
            To optimize, only read triangle.
            This reduces readtime from n^2 to 0.5n^2
            '''
            
            j = i # This traverses only the triangle
            
            while j < len(distMatrix):
                
                if distMatrix[i][j] < minDist[dIndex] and i != j:
                    
                    minDist[aIndex] = i
                    minDist[bIndex] = j
                    
                    minDist[dIndex] = distMatrix[i][j]
                    
                j+=1
                
            i+=1

        a = minDist[aIndex]
        b = minDist[bIndex]

        '''
        Here we merge cluster a with cluster b and delete cluster b
            within the clusters data structure.
        '''

        # Update a <- a+b, delete b in CLUSTERS
        clusters[a] = clusters[a] + clusters[b]
        del clusters[b]

        '''
        It is further necessary to update the distance matrix.

        First, delete the row containing cluster b's distance information
            because it is no longer needed
        '''
        
        # Delete b row in DISTANCE MATRIX
        del distMatrix[b]

        '''
        First, delete the row containing cluster b's distance information
            because it is no longer needed

            After these steps, the distance matrix size goes
                from n by n to n-1 by n-1
        '''
        
        # Delete b column in DISTANCE MATRIX
        for row in distMatrix:
            del row[b]
            
        rowA    = []
        k       = 0

        '''
        Here cluster b's distance information is fully removed
            from distance matrix.

        Now it is necessary to update distance information
            for cluster a (note a is merged with b)

        I will do this by created a row with recalculated distances
            from a to all other clusters (note b will not be calculated
            because it was removed in the previous step)
        '''

        rootCluster = clusters[a] # newly married cluster

        while k < len(clusters):

            if k != a:

                avgLinkDist     = 0.0
                neighborCluster = clusters[k]

                for gene1 in rootCluster:

                    for gene2 in neighborCluster:

                        avgLinkDist += manhattan(gene1,gene2,myDB)

                n           = float(len(rootCluster))
                m           = float(len(neighborCluster))
                avgLinkDist /= n
                avgLinkDist /= m

                rowA.append(avgLinkDist)

            else:
                
                rowA.append(float('inf'))

            k+=1

        # Here new row for a <- a+b built

        '''
        Here a row containing cluster a's new distances is created.
        Now it is necessary to update cluster a's row and column in the
            distance matrix.
        '''

        # Update cluster a's row in distance matrix
        distMatrix[a] = rowA

        # Update cluster a's column in distance matrix
        rowItr = 0
        while rowItr < len(distMatrix):
            distMatrix[rowItr][a] = rowA[rowItr]
            rowItr += 1

*Driver for AGNES Algorithm.*

In [None]:
def main():

    myDB        = init_DB(fetch_filename())

    clusters    = init_clusters(myDB)

    print('\nAGNES typically runs in 10 to 60 seconds depending on hardware.\n')

    print('Outputting to *_clusters_result-AdamKim.txt.\n')

    AGNES(clusters,myDB)

if __name__ == "__main__":
    main()

Output with cut points at 10, 30, 50 clusters.

For brevity, output for 50 clusters attach below.

In [None]:
310 : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 
       54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 
       71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
       88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 
       146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 
       160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 
       174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 
       188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 
       203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 
       218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 
       233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 248, 249, 
       250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 
       265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 277, 278, 279, 280, 
       281, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 295, 296, 297, 298, 
       299, 300, 301, 302, 303, 304, 305, 306, 313, 314, 319, 320, 323, 400, 401, 
       404, 426, 482}
    
46 : {374, 376, 377, 378, 380, 383, 384, 385, 386, 387, 388, 389, 390, 392, 
      393, 394, 396, 397, 398, 399, 402, 403, 405, 406, 407, 409, 414, 415, 
      427, 431, 432, 433, 434, 435, 436, 437, 438, 442, 443, 444, 445, 446, 
      447, 449, 450, 451}

25 : {342, 354, 355, 359, 360, 363, 364, 365, 366, 368, 369, 373, 412, 413, 
      416, 417, 420, 457, 458, 459, 460, 461, 462, 463, 480}

13 : {282, 283, 335, 336, 337, 338, 341, 349, 350, 353, 466, 478, 479}

13 : {294, 307, 308, 310, 311, 312, 315, 316, 317, 318, 321, 322, 381}

13 : {356, 357, 367, 370, 371, 372, 467, 468, 469, 470, 471, 472, 47}

7 : {410, 411, 419, 440, 441, 452, 464}
    
2 : {327, 328}

2 : {329, 332}

2 : {343, 344}

2 : {345, 361}

2 : {429, 453}

2 : {477, 493}

2 : {484, 486}

2 : {490, 491}

3 : {379, 382, 428}

3 : {430, 439, 448}

3 : {497, 498, 499}

4 : {276, 330, 483, 485}

4 : {309, 324, 325, 326}

4 : {395, 421, 422, 423}

4 : {481, 488, 494, 495}

5 : {334, 339, 340, 347, 348}

1 : {246}

1 : {247}

1 : {331}

1 : {333}

1 : {346}

1 : {351}

1 : {352}

1 : {358}

1 : {362}

1 : {375}

1 : {391}

1 : {408}

1 : {418}

1 : {424}

1 : {425}

1 : {454}

1 : {455}

1 : {456}

1 : {465}

1 : {474}

1 : {475}

1 : {476}

1 : {487}

1 : {489}

1 : {492}

1 : {496}

1 : {500}