### Rosalind Exercicio 1 - Counting DNA Nucleotides

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [1]:
# leitura de dataset
dataset = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
# conta a quantidade existente na string de cada nucleotidio 
print(dataset.count('A'), dataset.count('C'),dataset.count('G'),dataset.count('T'))

20 12 17 21


### Rosalind Exercício 2 - Transcribing DNA into RNA

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t

In [2]:
DNA = 'GATGGAACTTGACTACGTAAATT'
print(DNA.replace("T", "U"))

GAUGGAACUUGACUACGUAAAUU


### Rosalind Exercício 3 - Complementing a Strand of DNA

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement c of s

In [5]:
DNA = "AAAACCCGGT"
print(DNA.replace('A', 't').replace('T', 'a').replace('C', 'g').replace('G', 'c').upper()[::-1])

ACCGGGTTTT


### Rosalind Exercício 4 - Computing GC Content

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [6]:
#arquivo com 3 strings de DNA para calcular o GC_content
with open('../datasets/rosalind_gc.txt') as f:
    lines = f.readlines()
gc_content = []
for i in range(len(lines)): #varre todas as linhas do arquivo
    if ">" not in lines[i]: #calcula o GC_content para as linhas que representam a string do DNA ignorando o cabeçalho
        gc_content.append((lines[i].count('C')+lines[i].count('G'))/(len(lines[i])-1)) #calcula o GC concent
print(lines[2*gc_content.index(max(gc_content))],max(gc_content)) 

>Rosalind_0808
 0.6091954022988506


### Exercicio 6 - Translating RNA into Protein

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by s

In [7]:
# Dicionario para representar os 20 símbolos de proteínas
protain_dictionary ={   'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
                        'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
                        'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
                        'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
                        'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
                        'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
                        'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
                        'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
                        'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
                        'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
                        'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
                        'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
                        'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
                        'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
                        'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
                        'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'}

In [10]:
RNA = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA" #string RNA
protain = '' #string da proteína
for i in range(0,len(RNA)-3,3): #le o RNA a cada 3 símbolos
    protain+=protain_dictionary[RNA[i:i+3]] #busca no dicionario o símbolo correspondente e adiciona na string de proteina
print(protain) 

MAMAPRTEINSTRING
