# Computing GC Content

http://rosalind.info/problems/gc/

## Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

**Given**: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

**Return**: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.



In [1]:
# Sample Dataset
lines = ['>Rosalind_6404',
'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC',
'TCCCACTAATAATTCTGAGG',
'>Rosalind_5959',
'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT',
'ATATCCATTTGTCAGCAGACACGC',
'>Rosalind_0808',
'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC',
'TGGGAACCTGCGGGCAGTAGGTGGAAT']

In [31]:
def fasta_read(lines):
    strings = {}
    i = 0
    while i < len(lines):
        key = lines[i][1:]
        i += 1
        string = ''
        while i < len(lines) and lines[i][0] !=  '>':
            string += lines[i]
            i += 1
        strings[key] = string
    return strings
 
def gc_content(string):
    return sum(map(string.count, ['G','C'])) / len(string)

def solution(lines):
    strings = fasta_read(lines)
    gcs  = {k: gc_content(v) for k, v in strings.items()}
    max_k = max(gcs.items(), key=operator.itemgetter(1))[0]
    print(max_k)
    print(gcs[max_k]*100)


In [32]:
# Sample  Output
solution(lines)

Rosalind_0808
60.91954022988506


In [34]:
f = open('/Users/eculbertson/Downloads/rosalind_gc.txt').readlines()
strings = list(map(lambda x: x.strip(), f))
solution(strings)

Rosalind_9752
52.517162471395885


In [35]:
len(strings)

126