Problem
-------

For two strings *s1* and *s2* of equal length, the p-distance between them, denoted *dp(s1,s2)*, is the proportion of corresponding symbols that differ between *s1* and *s2*.

For a general distance function dd on nn taxa s1,s2,…,sns1,s2,…,sn (taxa are often represented by genetic strings), we may encode the distances between pairs of taxa via a distance matrix D in which Di,j=d(si,sj).

**Given:** A collection of nn (n≤10n≤10) DNA strings s1,…,sns1,…,sn of equal length (at most 1 kbp). Strings are given in FASTA format.

**Return:** The matrix D corresponding to the p-distance dpdp on the given strings. As always, note that your answer is allowed an absolute error of 0.001.



In [1]:
def readTab(infile): # read in txt file
    with open(infile, 'r') as input_file:
    # read in tab-delim text
        output = []
        for input_line in input_file:
            input_line = input_line.strip()
            temp = input_line.split('\t')
            output.append(temp)
    return output
def extract_fasta(fasta):
    sequences = {}
    headers = []
    flag = ""
    for i in fasta:
        if i[0].startswith(">"):
            headers.append(i[0])
            flag = i[0]
            sequences[flag] = ""
        else:
            sequences[flag] = sequences[flag] + i[0]
    return sequences, headers
def Hamming_distance(string1,string2, normalize):
    diffs = 0
    for i in range(len(string1)):
        if string1[i] != string2[i]:
            diffs +=1
    dist = float(diffs)
    if normalize:
        dist = dist / float(len(string1))
    return "{0:.5f}".format(dist)
def distance_matrix(fasta):
    sequences, headers = extract_fasta(fasta)
    cols = []
    for i in headers:
        cols.append(i)
    distance_matrix = []
    for i in headers:
        line = []
        for j in headers:
            line.append(str(Hamming_distance(sequences[i],sequences[j],True)))
        distance_matrix.append(line)
    return distance_matrix
def matrix_toString(matrix):
    output = ""
    for i in matrix:
        output = output + " ".join(i) + "\n"
    return output

In [2]:
test = readTab("dist_matrix.fasta")
test_matrix = distance_matrix(test)
print matrix_toString(test_matrix)

0.00000 0.40000 0.10000 0.10000
0.40000 0.00000 0.40000 0.30000
0.10000 0.40000 0.00000 0.20000
0.10000 0.30000 0.20000 0.00000



In [3]:
final = readTab("rosalind_pdst.txt")
final_matrix = distance_matrix(final)
print matrix_toString(final_matrix)

0.00000 0.64302 0.47558 0.29884 0.56512 0.45930 0.62907 0.57209
0.64302 0.00000 0.62442 0.58023 0.64419 0.47558 0.47326 0.31047
0.47558 0.62442 0.00000 0.31744 0.34186 0.49070 0.63721 0.57442
0.29884 0.58023 0.31744 0.00000 0.49767 0.31860 0.59070 0.49186
0.56512 0.64419 0.34186 0.49767 0.00000 0.58953 0.66395 0.62674
0.45930 0.47558 0.49070 0.31860 0.58953 0.00000 0.45349 0.30233
0.62907 0.47326 0.63721 0.59070 0.66395 0.45349 0.00000 0.30814
0.57209 0.31047 0.57442 0.49186 0.62674 0.30233 0.30814 0.00000

