# [Consensus and Profile](http://rosalind.info/problems/cons/)

**Problem**

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

DNA Strings	
```
A T C C A G C T
G G G C A A C T
A T G G A T C T
A A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T
```

Profile	
```
A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
```

Consensus	
```
A T G C A A C T
```

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

**Sample Dataset**
```
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
```

**Sample Output**
```
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
```

In [None]:
def parse(data):
    dna_strings = {}
    name = ''
    
    for line in data.split('\n'):
        if line[0] == '>':
            name = line[1:]
            dna_strings[name] = ''
        else:
            dna_strings[name] += line
    
    return dna_strings

In [None]:
def get_profile(dna_strings):
    dna_string_length = len(dna_strings[0])

    profile = []

    letter_indexes = {
        'A': 0,
        'C': 1,
        'G': 2,
        'T': 3
    }

    for i in range(len(dna_strings[0])):
        profile_vector = [0, 0, 0, 0]

        for j in range(len(dna_strings)):
            letter = dna_strings[j][i]
            profile_vector[letter_indexes[letter]] += 1

        profile.append(profile_vector)
        
    return profile

In [None]:
def get_consensus(profile):
    letter_indexes = {
        0: 'A',
        1: 'C',
        2: 'G',
        3: 'T'
    }

    consensus = ''

    for p in profile:
        top_letter = letter_indexes[p.index(max(p))]
        consensus += top_letter

    return consensus

In [None]:
file = open('./dataset.txt', 'r')
data = file.read().strip()

fasta_dna_strings = parse(data)
dna_strings = [list(dna_string) for dna_string in list(fasta_dna_strings.values())]

profile = get_profile(dna_strings)
consensus = get_consensus(profile)

In [None]:
print(consensus)

formatted_profile = [
    ['A:'],
    ['C:'],
    ['G:'],
    ['T:']
]

for i in range(len(profile[0])):
    for j in range(len(profile)):
        formatted_profile[i].append(profile[j][i])
        
for row in formatted_profile:
    print(' '.join([str(n) for n in row]))