Dan Shea  
2021-06-16  

#### Problem
For two strings $s_1$ and $s_2$ of equal length, the p-distance between them, denoted $d_p(s_1,s_2)$, is the proportion of corresponding symbols that differ between $s_1$ and $s_2$.

For a general distance function $d$ on $n$ taxa $s_1,s_2,\ldots,s_n$ (taxa are often represented by genetic strings), we may encode the distances between pairs of taxa via a distance matrix $D$ in which $D_{i,j}=d(s_i,s_j)$.

__Given:__ A collection of $n$ $(n \leq 10)$ DNA strings $s_1,\ldots,s_n$ of equal length (at most 1 kbp). Strings are given in FASTA format.

__Return:__ The matrix $D$ corresponding to the p-distance $d_p$ on the given strings. As always, note that your answer is allowed an absolute error of $0.001$.

##### Sample Dataset
```
>Rosalind_9499
TTTCCATTTA
>Rosalind_0942
GATTCATTTC
>Rosalind_6568
TTTCCATTTT
>Rosalind_1833
GTTCCATTTA
```
##### Sample Output
```
0.00000 0.40000 0.10000 0.10000
0.40000 0.00000 0.40000 0.30000
0.10000 0.40000 0.00000 0.20000
0.10000 0.30000 0.20000 0.00000
```

In [1]:
from Bio import SeqIO

In [2]:
import math

def compute_d(i, j):
    mismatch = 0
    for a,b in zip(i, j):
        if a != b:
            mismatch += 1
    return mismatch / len(i)

def print_ans(d):
    i = int(math.sqrt(len(d)))
    for a in range(i):
        for b in range(i):
            print(f'{d[a*i+b]:0.5f}', end=' ')
        print('')

def parse_file_print_ans(filename):
    with open(filename, 'r') as fh:
        seqio = SeqIO.parse(fh, 'fasta')
        seqs = [s.seq for s in seqio]
        d = [0.0] * len(seqs)**2
        for i in range(0, len(seqs)-1):
            for j in range(1, len(seqs)):
                distance = compute_d(seqs[i], seqs[j])
                d[i*len(seqs)+j] = distance
                d[j*len(seqs)+i] = distance
        print_ans(d)

In [3]:
parse_file_print_ans('sample.txt')

0.00000 0.40000 0.10000 0.10000 
0.40000 0.00000 0.40000 0.30000 
0.10000 0.40000 0.00000 0.20000 
0.10000 0.30000 0.20000 0.00000 


In [4]:
parse_file_print_ans('rosalind_pdst.txt')

0.00000 0.30277 0.48827 0.46162 0.48294 0.56183 0.60661 0.48721 0.31770 
0.30277 0.00000 0.31130 0.55544 0.57356 0.44776 0.50000 0.31663 0.47548 
0.48827 0.31130 0.00000 0.61087 0.63966 0.55544 0.58529 0.47761 0.58529 
0.46162 0.55544 0.61087 0.00000 0.46162 0.66311 0.68124 0.63539 0.30171 
0.48294 0.57356 0.63966 0.46162 0.00000 0.65458 0.67058 0.63966 0.29957 
0.56183 0.44776 0.55544 0.66311 0.65458 0.00000 0.50000 0.30810 0.62900 
0.60661 0.50000 0.58529 0.68124 0.67058 0.50000 0.00000 0.34328 0.65565 
0.48721 0.31663 0.47761 0.63539 0.63966 0.30810 0.34328 0.00000 0.59701 
0.31770 0.47548 0.58529 0.30171 0.29957 0.62900 0.65565 0.59701 0.00000 
