# Hierarchical Clustering - UPGMA

---

## Before Class
This weeks class we will be implementing UPGMA for hierarchical clustering. We will start with the implementation of a distance matrix today.

Prior to class, please do the following:
1. Review the structure of a distance matrix
2. Compare Hamming and Levenshtein distances
3. Re-familiarize yourself with numpy arrays

---
## Learning Objectives

1. Implement a simple distance metric
* Build a distance matrix using the metric and a set of alignments

---
## Background
Today we will start implementing a frequently used hierarchical clustering algorithm from Sokal and Michener (1958) called UPGMA - unweighted pair group method using arithmetic averages. The first step of this algorithm requires that we build a distance matrix for all of the alignments that we will be clustering. To accomplish this today, we will be using the Hamming distance of the sequences. The Hamming distance is total edit distance between two strings (the total number of changes required to make two strings exactly match). This metric requires that the strings are the same length. Luckily, our previous work on alignments results in strings that are all the same length and optimally aligned! We will be using those strings as input.


---
## Imports

In [1]:
import numpy as np

---
## Distance metrics for comparing sequences

### Hamming Distance



In [2]:
def hamming_distance(alignment1, alignment2): 
    ''' Function to calculate Hamming distance between two alignments
    
    Args: 
        alignment1 (str): first sequence that has already been aligned
        alignment2 (str): second sequence that has already been aligned

    Returns:
        distance (int): hamming distance between the two alignment
    
    '''
    distance = 0 
    #need to get the result. 
    result = zip(alignment1, alignment2) 
    for seq1, seq2 in result:
        if seq1 != seq2:
            #increment whenever you don't get a match. 
            distance += 1
    return distance 
    

In [3]:
test1 = "AATTCC"
test2 = "XXYYZZ"

cache = zip(test1, test2)
list(cache)

[('A', 'X'), ('A', 'X'), ('T', 'Y'), ('T', 'Y'), ('C', 'Z'), ('C', 'Z')]

In [4]:
hamming_distance("AATTCC", "AAtcgh")

4

In [5]:
# Example data as from slides:
alignment1 = "TA-TTTA"
alignment2 = "TA-TTAA"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTCCA"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTAAC"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

1
3
0


In [6]:
def build_distance_matrix(alignments): 
    ''' Function to build a distance matrix from a list of alignments
    This is a number of alignments x number of alignments matrix with 
    all pairwise distances (and 0 along the diagonal).
    All alignments must be same length!
    
    Args: 
        alignments (list of strings): a list of our sequence alignments

    Returns:
        distance_matrix (np.array of floats): n x n distance matrix
    
    '''
    #build empty matrix the size of the matrix needs to be the size of teh hamming distance outputs. 
    #call the hamming distance function 
    #
    
    # distance_matrix = np.empty((len(alignments), len(alignments)), dtype = int)
    #fill in the empty matrix. 
    
    #the matrix needs to be built off of the number of hamming distance results we get. 
    #so need to call the haming distance first and then form the matrix. 
    #then fill in the matrix. 
    
    #the diagnoal will be a 0 because you're calculating the hamming distnace between teh same points. so it'll be 0. 
    
    ham_final_list = []
    for index1 in range(len(alignments)):
        for index2 in range(len(alignments)):
            # print(alignments[index1], alignments[index2])
            ham_list = hamming_distance(alignments[index1], alignments[index2])
            # print(ham_list)
            ham_final_list.append(ham_list)
            
    # print(ham_final_list)
    # print(type(ham_final_list))
    matrix_temp = np.array(ham_final_list, dtype = float)
    # print(matrix_temp)
    # print(type(matrix_temp))

    matrix_final = np.reshape(matrix_temp, (len(alignments), len(alignments)))
    # print(matrix_final)
    return matrix_final


In [7]:
# Example data as from slides:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]
D = build_distance_matrix(alignments)
print(D)

[[0. 1. 0. 2. 2.]
 [1. 0. 1. 2. 1.]
 [0. 1. 0. 2. 2.]
 [2. 2. 2. 0. 1.]
 [2. 1. 2. 1. 0.]]


In [8]:
mylist = ["amelia", "ameeia", "ammlia", "aaelia"]
for index_1 in range(len(mylist)):
    for index_2 in range(index_1 + 1, len(mylist)):   #I don't want to include the first item in the list. 
        print(mylist[index_1], mylist[index_2])
        test = hamming_distance(mylist[index_1], mylist[index_2])
        # print(test)
        

amelia ameeia
amelia ammlia
amelia aaelia
ameeia ammlia
ameeia aaelia
ammlia aaelia


In [16]:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]
index1 = 0
alignments[index1]
test_list = []

for index2 in range(index1 + 1, len(alignments)):
    # print(alignments[index2])
    # print(alignments[index1], alignments[index2])
    test = hamming_distance(alignments[index1], alignments[index2])
    test_list.append(test)
# print(test_list)
test_final = sum(test_list)
print(test_final)
type(test_final)

5


int

In [78]:
matrix = np.zeros((test_final, test_final), dtype = int)
matrix

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [16]:
test1 = ["AATTCCTT"]
matrix = np.zeros(shape = (len(test1), len(test1)))
matrix

array([[0.]])

In [72]:
matrix = np.empty((len(test1), len(test1)), dtype = float)
matrix

array([[0.]])

In [73]:
test = np.zeros((3,4))
test

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [18]:
matrix = np.zeros((4,5))
matrix

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [17]:
def get_min_distance(matrix):
    ''' Function to find the smallest value off-daigonal in the distance
    matrix provided. This is used in the UPGMA algorithm.
    
    Args: 
        matrix (2D numpy array): a distance matrix

    Returns:
        min (float): The smallest distance in the matrix
        pos (tuple): The x and y position of the smallest distance
    
    '''
    #already have the matrix. 
    # print(matrix)
    min_score = None
    pos = (0,0) 
    #need to get 
    test = list(matrix.shape)
    #iterate through half the matrix since it's mirror image. 
    #so split it in half. 
    for index1 in range(0, test[0]): #this will serve as the x coordinate 
        for index2 in range(index1 + 1, test[1]): #this will serve as the y coordinate, prevents it from being a zero. 
            # print(index1, index2)
            if matrix[index1][index2] == matrix.min():
                min_score = matrix[index1][index2]
                pos = (index1,index2)
    
    return min_score, pos

#find the minimum value that is greater than 0. those will all be in the off diagnoal of the matrix. 
#multiple alignments that are identical, the minimum value then ignore it. 

In [18]:
# Example data as from slides:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]

D = build_distance_matrix(alignments)
get_min_distance(D)

(0.0, (0, 2))

Expected output:
(0.0, (0,2))

In [84]:
for index in range(0, 4):
    print(index)

0
1
2
3


In [62]:
print(D)

[[0. 1. 0. 2. 2.]
 [1. 0. 1. 2. 1.]
 [0. 1. 0. 2. 2.]
 [2. 2. 2. 0. 1.]
 [2. 1. 2. 1. 0.]]


In [71]:
x = D.min()
print(x)
diag = np.diag(D)
print(diag)

if x not in diag:
    final = np.argwhere(D == x)
    print(final)

0.0
[0. 0. 0. 0. 0.]


In [66]:
np.argwhere(D == x)

array([[0, 0],
       [0, 2],
       [1, 1],
       [2, 0],
       [2, 2],
       [3, 3],
       [4, 4]])

In [None]:
print(matrix)
    min_score = None 
    pos = (0,0) 
    
    
    test = list(matrix.shape)
    #get range of 0-4. 
    for index_1 in range(test[0]):
        for index_2 in range(test[1]):
            #have indices to index the matrix. 
            #need to prevent from taking the from the diag. 
            matrix.min() 
            
            if matrix[index_1][index_2] not in np.diag(matrix):
                min_score = matrix[index_1][index_2]
                pos = (index_1, index_2)
                

In [204]:
np.diag(D)

array([0., 0., 0., 0., 0.])

In [74]:
for i in range(0,5):
    print(i)

0
1
2
3
4


In [205]:
np.min(D)

0.0

In [55]:
print(D)

[[0. 1. 0. 2. 2.]
 [1. 0. 1. 2. 1.]
 [0. 1. 0. 2. 2.]
 [2. 2. 2. 0. 1.]
 [2. 1. 2. 1. 0.]]


In [61]:
x = D.min()
final = None

if x not in np.diag(D):
    final = x
    print(final)

In [54]:
np.diag(D)

array([0., 0., 0., 0., 0.])

In [35]:
test = list(D.shape)
test[0]
type(test[0])
print(range(test[0]))
for i in range(test[0]):
    print(i)

range(0, 5)
0
1
2
3
4


In [30]:
test1 = D.shape
test1[0]

5