# Creating a Distance Matrix

## Background Info

One of approaches to build phylogenies is **distance-based phylogeny**, which constructs a tree from evolutionary distances calculated between pairs of taxa. In this problem, we will consider an evolutionary function based on *Hamming distance*, which compares two homologous strands of DNA by counting the minimum possible number of point mutations that could have occurred on the evolutionary path between the two strands.

* **Distance-based phylogeny**: The use of a distance matrix to construct phylogeny.
* **Homologous**: Descending from the same ancestor.

## Problem

For two strings $s_1$ and $s_2$ of equal length, the **p-distance** between them $d_p(s_1, s_2)$, is the proportion of corresponding symbols that differ between $s_1$ and $s_2$. The distances between pairs of taxa via a **distance matrix** $D$ can be represented by $D_{i,j}=d(s_i,s_j)$.

**Given**: A collection of $n \ (n \le 10)$ DNA strings $s_1,...,s_n$ of equal length. Strings are given in FASTA format.<br>
**Return**: The matrix $D$ corresponding to the p-distance $d_p$ on the given strings.

# Solution Explained

In order to obtain the p-distance matrix, for all given DNA strings, we'd have to calculate the p-distance between each DNA string and all other given DNA strings.

In [2]:
# read in the sample FASTA file containing the DNAs
f = open('../Sample_Creating_a_Distance_Matrix.txt', 'r')
l = f.read().splitlines()
d = {}
for s in l:
    if s.startswith('>'):
        name = s[1:]
        d[name] = ''
    elif not s.startswith('>'):
        d[name] += s

In [3]:
def p_matrix(dic):
    """Print the p-distance matrix"""
    for id1 in dic.keys():
        print(get_row_of_p_dist(id1, dic))

def get_row_of_p_dist(id1, dic):
    """Return the p-distances between DNA with id1 and all the given DNA sequences"""
    p_dist_row = ''
    for id2 in dic.keys():
        p_dist_row += ' ' + str(p_dist(id1, id2, dic))
    return p_dist_row[1:]

def p_dist(id1, id2, dic):
    """Return the p-distance between DNA sequence with id1 and DNA sequence with id2"""
    s1 = dic[id1]
    s2 = dic[id2]
    # -- code 1: using for loop
    # divisor = len(s1)
    # dividend = 0
    # for i in range(len(s1)):
    #     if s1[i] != s2[i]:
    #         dividend += 1
    # return dividend / divisor
    # -- code 2: using list comprehension
    return sum([s1[i] != s2[i] for i in range(len(s1))]) / len(s1)



In [4]:
p_matrix(d)

0.0 0.4 0.1 0.1
0.4 0.0 0.4 0.3
0.1 0.4 0.0 0.2
0.1 0.3 0.2 0.0


## Actual Dataset

In [6]:
f = open('../rosalind_pdst.txt', 'r')
l = f.read().splitlines()
d = {}
for s in l:
    if s.startswith('>'):
        name = s[1:]
        d[name] = ''
    elif not s.startswith('>'):
        d[name] += s

In [7]:
p_matrix(d)

0.0 0.5823244552058111 0.4745762711864407 0.3305084745762712 0.662227602905569 0.6719128329297821 0.4963680387409201 0.559322033898305 0.6416464891041163
0.5823244552058111 0.0 0.324455205811138 0.5181598062953995 0.49515738498789347 0.5 0.5847457627118644 0.5096852300242131 0.3365617433414044
0.4745762711864407 0.324455205811138 0.0 0.3196125907990315 0.5799031476997578 0.6222760290556901 0.49031476997578693 0.3268765133171913 0.5121065375302664
0.3305084745762712 0.5181598062953995 0.3196125907990315 0.0 0.635593220338983 0.6464891041162227 0.3196125907990315 0.49031476997578693 0.5968523002421308
0.662227602905569 0.49515738498789347 0.5799031476997578 0.635593220338983 0.0 0.463680387409201 0.6549636803874092 0.6331719128329297 0.3135593220338983
0.6719128329297821 0.5 0.6222760290556901 0.6464891041162227 0.463680387409201 0.0 0.6731234866828087 0.6707021791767555 0.30145278450363194
0.4963680387409201 0.5847457627118644 0.49031476997578693 0.3196125907990315 0.6549636803874092 0.

## Problem solved!