# Hierarchical Clustering - UPGMA

---

## Before Class
This weeks class we will be implementing UPGMA for hierarchical clustering. We will start with the implementation of a distance matrix today.

Prior to class, please do the following:
1. Review the structure of a distance matrix
2. Compare Hamming and Levenshtein distances
3. Re-familiarize yourself with numpy arrays

---
## Learning Objectives

1. Implement a simple distance metric
* Build a distance matrix using the metric and a set of alignments

---
## Background
Today we will start implementing a frequently used hierarchical clustering algorithm from Sokal and Michener (1958) called UPGMA - unweighted pair group method using arithmetic averages. The first step of this algorithm requires that we build a distance matrix for all of the alignments that we will be clustering. To accomplish this today, we will be using the Hamming distance of the sequences. The Hamming distance is total edit distance between two strings (the total number of changes required to make two strings exactly match). This metric requires that the strings are the same length. Luckily, our previous work on alignments results in strings that are all the same length and optimally aligned! We will be using those strings as input.


---
## Imports

In [7]:
import numpy as np

---
## Distance metrics for comparing sequences

### Hamming Distance



In [8]:
def hamming_distance(alignment1, alignment2): 
    ''' Function to calculate Hamming distance between two alignments
    
    Args: 
        alignment1 (str): first sequence that has already been aligned
        alignment2 (str): second sequence that has already been aligned

    Returns:
        distance (int): hamming distance between the two alignment
    
    '''
    # Make sure that alingments are the same length
    assert len(alignment1) == len(alignment2)
    
    # Initialize distance
    distance = 0
    
    # Compare all locations alignments and add to distance if they are different
    for base_1, base_2 in zip(alignment1, alignment2):
        if base_1 != base_2:
            distance += 1
            
    return distance

In [9]:
# Example data as from slides:
alignment1 = "TA-TTTA"
alignment2 = "TA-TTAA"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTCCA"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTAAC"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

1
3
0


In [10]:
def build_distance_matrix(alignments): 
    ''' Function to build a distance matrix from a list of alignments
    This is a number of alignments x number of alignments matrix with 
    all pairwise distances (and 0 along the diagonal).
    All alignments must be same length!
    
    Args: 
        alignments (list of strings): a list of our sequence alignments

    Returns:
        distance_matrix (np.array of floats): n x n distance matrix
    
    '''
    # Make sure that all alignments are the same length
    for i in range(1, len(alignments)):
        if len(alignments[0]) != len(alignments[i]):
            raise ValueError("Undefined for alignments of unequal length")
       
    #Initialize an empty matrix of floats
    distance_matrix = np.zeros((len(alignments),len(alignments)), dtype=float)
    
    #Compare all of the alignments and store their distances
    for i, alignment1 in enumerate(alignments):
        for j, alignment2 in enumerate(alignments):
            distance_matrix[i][j] = hamming_distance(alignment1, alignment2)
            
    return distance_matrix    


In [11]:
# Example data as from slides:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]

D = build_distance_matrix(alignments)
print(D)

[[0. 1. 0. 2. 2.]
 [1. 0. 1. 2. 1.]
 [0. 1. 0. 2. 2.]
 [2. 2. 2. 0. 1.]
 [2. 1. 2. 1. 0.]]


In [12]:
def get_min_distance(matrix):
    ''' Function to find the smallest value off-daigonal in the distance
    matrix provided. This is used in the UPGMA algorithm.
    
    Args: 
        matrix (2D numpy array): a distance matrix

    Returns:
        min (float): The smallest distance in the matrix
        pos (tuple): The x and y position of the smallest distance
    
    '''
    
    # Set the starting minimum value to be large
    minimum = float('inf')

    # Iterate through half the matrix to find the minimum score
    # We could do the whole matrix, but this is a bit more efficient
    # because the matrix is symmetric
    for i in range(0, matrix.shape[0]):
        for j in range(i+1, matrix.shape[1]):
            if matrix[i][j] < minimum:
                minimum = matrix[i][j]
                position = (i,j)

    return minimum, position


In [14]:
# Example data as from slides:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]

D = build_distance_matrix(alignments)
get_min_distance(D)

(0.0, (0, 2))

Expected output:
(0.0, (0,2))