### [Counting DNA Nucleotides](https://rosalind.info/problems/dna/) 

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [7]:
from collections import Counter
data = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
c = Counter(data)

In [23]:
with open('datasets/rosalind_dna.txt','r') as f:
    data = f.read().strip('\n')

In [20]:
c = Counter(data)

In [21]:
c

Counter({'T': 212, 'C': 263, 'G': 247, 'A': 221})

In [22]:
print(c['A'],c['C'],c['G'],c['T'],sep=' ')

221 263 247 212


### [Transcribing DNA into RNA](https://rosalind.info/problems/rna/)
Problem  
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

Sample Dataset
```
GATGGAACTTGACTACGTAAATT
```
Sample Output
```
GAUGGAACUUGACUACGUAAAUU
```

In [41]:
with open('datasets/rosalind_rna.txt','r') as f:
    s = f.read().strip('\n')
# s = 'GATGGAACTTGACTACGTAAATT'
print(s)

s.replace('T','U')

CCGTCTGAGGTTCACTAAAAGTTTCCATCCAGCCTGTATTCGATAGGAGTCATCTCCGTGAGGAAAGGTGTGTAAGGAAATCGTCGATAGGAACGAAGCAAGTGTTTCACCCTTCTCATTAGTGGGCTGTATGGTGACTCGAGGGAAGCCCAAGATAAAGGGCTATTGTACATTGACCCTCATTTTATAATCGTAATGATGTTAGTCCCAAGCCTGGCCCTGCCAGGCAGCAAGTAGGCACAGATCTTCCCGACGATTCAAGAAAGCATCATGCCTCTTCCCGGTATTTTCTTCAGGAAGCAGCCCGTCGGGTGAAACTCTGCTACAGGTTTGGCGCATAAGCGCTCAGGGATCGATCGGAATAAACAGCTAAGGAATTTTTTAAGTAACGGTCCCAATACAAGATTACGCCTGCACAGGGCGAGTGGGCCCTACTCGCGGCCTAGGTGCATGCATAGCTACGCGCAACACGACTCGGTGACTTCCGGAATGGTTTTCCGAAGCAAAGGCGCTCTTTTCACGTCCGAAATTTCCATGGTGTTAAACTCGTCATGAGAGGACCTTGGCCAGCTAACCATTTCGCGGTTGCTACCAGGTATTGTGAAGGTCAAGAGCAGGATGCTAAACATAGGAGAGGCAATATTGATGTAGCCTGAGGGCCAGCAGCCTTTTCTGGAGGTTGAAATTCCCCTTCCCCTCCTACAACACGTCGCGTTAACTGCAAGAGCATTAGGACGACGTTCCGACAATTGGCACCCTAGGACGCCCGTGACCCTAGCCCGTCATAAATTACCAAGAGCGACCGTTTAGTGATGCGTGTAAGATAGGATAGAACAAAGATGAGATCCTCCACAGTTGTCCCATAGCTCTCCATTATGACCATATATGGAGGGGCGAAACAAACATAGAATATGATATTAATTCCCGGATGTCATAA


'CCGUCUGAGGUUCACUAAAAGUUUCCAUCCAGCCUGUAUUCGAUAGGAGUCAUCUCCGUGAGGAAAGGUGUGUAAGGAAAUCGUCGAUAGGAACGAAGCAAGUGUUUCACCCUUCUCAUUAGUGGGCUGUAUGGUGACUCGAGGGAAGCCCAAGAUAAAGGGCUAUUGUACAUUGACCCUCAUUUUAUAAUCGUAAUGAUGUUAGUCCCAAGCCUGGCCCUGCCAGGCAGCAAGUAGGCACAGAUCUUCCCGACGAUUCAAGAAAGCAUCAUGCCUCUUCCCGGUAUUUUCUUCAGGAAGCAGCCCGUCGGGUGAAACUCUGCUACAGGUUUGGCGCAUAAGCGCUCAGGGAUCGAUCGGAAUAAACAGCUAAGGAAUUUUUUAAGUAACGGUCCCAAUACAAGAUUACGCCUGCACAGGGCGAGUGGGCCCUACUCGCGGCCUAGGUGCAUGCAUAGCUACGCGCAACACGACUCGGUGACUUCCGGAAUGGUUUUCCGAAGCAAAGGCGCUCUUUUCACGUCCGAAAUUUCCAUGGUGUUAAACUCGUCAUGAGAGGACCUUGGCCAGCUAACCAUUUCGCGGUUGCUACCAGGUAUUGUGAAGGUCAAGAGCAGGAUGCUAAACAUAGGAGAGGCAAUAUUGAUGUAGCCUGAGGGCCAGCAGCCUUUUCUGGAGGUUGAAAUUCCCCUUCCCCUCCUACAACACGUCGCGUUAACUGCAAGAGCAUUAGGACGACGUUCCGACAAUUGGCACCCUAGGACGCCCGUGACCCUAGCCCGUCAUAAAUUACCAAGAGCGACCGUUUAGUGAUGCGUGUAAGAUAGGAUAGAACAAAGAUGAGAUCCUCCACAGUUGUCCCAUAGCUCUCCAUUAUGACCAUAUAUGGAGGGGCGAAACAAACAUAGAAUAUGAUAUUAAUUCCCGGAUGUCAUAA'

### [Complementing a Strand of DNA](https://rosalind.info/problems/revc/)  
Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

Sample Dataset
```
AAAACCCGGT
```
Sample Output
```
ACCGGGTTTT
```

In [44]:
with open('datasets/rosalind_revc.txt','r') as f:
    dna = f.read().strip('\n')
# dna = 'AAAACCCGGT'
complements = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
reverse_complement = dna[::-1]
reverse_complement = ''.join([complements[ele] for ele in reverse_complement])
reverse_complement

'CGCTTGCATGATTAAGGTCGTCTATTTCATACGCTTTTATGGACGGACTGAAGAAACTCGTCTACTTTCGGCGAGGCAAGCGACGCAAGGATGCCTGAGTGCTGGACGCGGTCCGCTGACCACCTCTAGGCGATCTCTCACATGCATGCTCTACTATTTCCCTTAGTACTGATAACGATAGCACGTTTACGCCCTGTACAGCGAACGTTCGCTCAGAGTCTGGCGGCAATAGCTTCAAATGGAAGCTGAACGTATTACCACATCAGTGTTTGCCGATGCGGTAGATACCTTGCGGTATACGCCCTAAATCCGTGTATTACAGCAATTAATATGACTCGGCGCGGTCGACGGTTTAGCCGTTCTTCCGTCCTTGACACGAACAATGAGCAATGCGTGAGCATTTATAGGGATGCCTACAGGGTTCTCTCAGGTAGGCGATATGATAGAGGGACTGCCCACGCCTAATCATCTGCAACTGTAGACCGGTTATCCCGACAGCAGCACAGCTGCATTGACTCCGTCACCCCCCTGCTCTTCCAAACAGTCAATCTAGTACGTGGTTGTGTACGACTTGATATATCGCTTAGTCTACAATCGGGATCCACACGTGGTGAAAATGCACCCTAGAGGAGCCGTTGGCGGCACGACGACACCACAGTAACAATCCTACGAAGACTGTTAATTCGGCACGAGATAGTGGAGATCCAATACTCTTGGGTTCGTGGGTGGGTATGCTCTTTTGATGAACCTACTCCAGGATGTCTCCTGCAAAACCTCTTATATGCCTAAGACCCATTAAGATTTATCTAGCAGTGGGTTACGAACGAATTTAGGCTTGCTCGGGGGACACAGGGTCCACTGTCACATTCGGTGTCCACGCTTAGGGTTCAAGGAAACTCTTCTTTACGCTGTTTCTAGGACGGGGAAGCAAGGGCACATGAGGAAGCGACAGTTCTGTCGTATCACTGTTCTCC'

### [Rabbits and Recurrence Relations](https://rosalind.info/problems/fib/)

Problem

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2‾√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

Sample Dataset

```5 3```

Sample Output

```19```

### Different ways of doing fibonacci

**Let's see which method is fastest by computing the 10th term**

In [125]:
def recursive_fib(n):
    '''
    Recursive Fibonacci solution
    Runtime: O(branches^depth) --> O(2^n)
    Branches represents number of times each recursive call branches
    In this case it is 2 because function is called twice  
    '''
    if n <= 1:
        return n
    return recursive_fib(n-1) + recursive_fib(n-2)

In [126]:
%%timeit
recursive_fib(10)

14.4 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [127]:
def recursive_memo_fib(n):
    '''
    Top down recursive solution with memoization
    Runtime: O(n)
    '''
    def fibonacci(i, memo):
        if i <= 1:
            return i
        if memo[i] == 0:
            memo[i] = fibonacci(i-1, memo) + fibonacci(i - 2, memo)
        return memo[i]
    return fibonacci(n, [0]*(n+1))

In [128]:
%%timeit
recursive_memo_fib(10)

3.37 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [146]:
def iterative_fib(n):
    '''
    Iterative Fibonacci solution. Bottom up
    runtime: O(n)
    
    With basic timing test with n = 10, this was the fastest
    '''
    a, b = 0, 1
    for i in range(n):
        print(a,b)
        a, b = b, a + b
        
    return a

In [147]:
iterative_fib(5)

0 1
1 1
1 2
2 3
3 5


5

In [130]:
%%timeit
iterative_fib(10)

505 ns ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [162]:
def fib_list(n):
    '''
    Generates list of Fibonacci sequence iteratively. Bottom up
    '''
    a, b = 0, 1
    sequence = []
    for i in range(n):
        a, b = b, a + b
        sequence.append(a)
    return sequence

In [163]:
# %%timeit
fib_list(10)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [133]:
def fib(n):
    '''
    Another way of storing previous values in list
    Returns the nth term
    '''
    if n <= 1:
        return n
    memo = [0,1]
    for i in range(2,n):
        memo.append(memo[i - 1] + memo[i - 2])
    return memo[n - 1] + memo[n - 2]

In [134]:
%%timeit
fib(10)

1.3 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [152]:
def iterative_fib_k_rabbits(n,k):
    '''
    Iterative Fibonacci solution. Bottom up
    runtime: O(n)
    Each rabbit pair produces k rabit pairs each month
    '''
    a, b = 0, 1
    for i in range(n):
        a, b = b, k*a + b
    return a

In [160]:
with open('datasets/rosalind_fib.txt','r') as f:
    data = f.read().strip('\n').split()
data = [int(x) for x in data]

iterative_fib_k_rabbits(data[0],data[1])

66507086889

### [Mortal Fibonacci Rabbits](https://rosalind.info/problems/fibd/)

There are various ways to solve this problem. Below are some different solutions

def iterative_fib_mortal_rabbits(n,k=1):
    '''
    Iterative Fibonacci solution. Bottom up
    runtime: O(n)
    Each rabbit pair dies after m months
    '''
    ages = [1] + [0]*(k-1)
    for i in range(n-1):
        ages = [sum(ages[1:])] + ages[:-1]
    return sum(ages)

In [199]:
with open('datasets/rosalind_fibd.txt','r') as f:
    data = f.read().strip('\n').split()
data = [int(x) for x in data]
iterative_fib_mortal_rabbits(data[0],data[1])

61115936848684058

In [197]:
iterative_fib_mortal_rabbits(9,3)

9

In [206]:
def population(n, m=1):
    '''
    recurrence relation solution
    '''
    h = [0]*(m-2) + [1, 1, 1]
    print(h)
    for i in range(n-2):
        h += [h[-1] + h[-2] - h[-m-1]]
    print(h)
    return h[-1]

In [207]:
population(9,3)

[0, 1, 1, 1]
[0, 1, 1, 1, 2, 2, 3, 4, 5, 7, 9]


9

In [202]:
population(82,19)

61115936848684058

### [Computing GC Content](https://rosalind.info/problems/gc/)

In [278]:
'''
steps:
1. split up the different strings in the FASTA file
2. calculate gc content for each string
3. Return string identifier and gc content for one with highest gc content
'''
with open('datasets/rosalind_gc.txt','r') as f:
    data = f.read()
# sample = '''>Rosalind_6404
# CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
# TCCCACTAATAATTCTGAGG
# >Rosalind_5959
# CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
# ATATCCATTTGTCAGCAGACACGC
# >Rosalind_0808
# CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
# TGGGAACCTGCGGGCAGTAGGTGGAAT
# '''
import re

def get_gc(fasta_str):
    dna_id = re.findall(r'Rosalind_\d{4}',fasta_str)
    dna_strings = re.split(r'>Rosalind_\d{4}',fasta_str)
    dna_strings = [line.replace('\n','') for line in dna_strings][1:]
    
    gc_max = 0
    id_max = ''
    for i in range(len(dna_strings)):
        gc = 0
        for letter in dna_strings[i]:
            if letter == 'G' or letter == 'C':
                gc += 1
        gc = gc/len(dna_strings[i]) * 100
        if gc > gc_max:
            gc_max = gc
            id_max = dna_id[i]
    return (id_max, gc_max)

In [279]:
for item in get_gc(data):
    print(item)

Rosalind_3370
52.8281750266809


### [Counting Point Mutations](https://rosalind.info/problems/hamm/)

In [295]:
with open('datasets/rosalind_hamm.txt','r') as f:
    data = f.read().strip('\n').split('\n')

# data = ['GAGCCTACTAACGGGAT','CATCGTAATGACGGCCT']
def hamming_distance(strand_1, strand_2):
    '''
    Count number of corresponding characters that differ between two strings.
    Strings are same length.
    '''
    count = 0
    for i in range(len(strand_1)):
        if strand_1[i] != strand_2[i]:
            count += 1
    return count

In [296]:
hamming_distance(data[0],data[1])

515

### [Mendel's First Law](https://rosalind.info/problems/iprb/)

In [386]:
from itertools import combinations
def quantify_dominant_allele(k,m,n):
    '''
    Given:
        k - homozygous dominant
        m - heterozygous
        n - homozygous recessive
    Returns probability that two randomly selected mating organisms will produce an individual possessing a dominant allele.
    
    
    6 possible unique phenotype combos - listed in d except for aa-aa which contains no dominant alleles
    
    Steps:
    1. Get a list of possible phenotype combinations - C(k+m+n, r) 
    2. Determine number of each possible combo
    3. For each combo get number of possible offspring with dominant allele --> # combos * 4 * rate of dominant phenotypes
    4. Sum counts of offspring for each combo. Divide by length of possible combos. Return this probability.
    
    '''
    l = ['AA']*k + ['Aa']*m + ['aa']*n
    combos = list(combinations(l,2))
    # d = dominant counts
    d = {'AA-AA':0,
         'AA-Aa':0,
         'AA-aa':0,
         'Aa-Aa':0,
         'Aa-aa':0}
    for pair in combos:
        if 'AA' in pair[0] and 'AA' in pair[1]:
            d['AA-AA'] += 1 * 4
        elif 'AA' in pair[0] and 'Aa' in pair[1]:
            d['AA-Aa'] += 1 * 4
        elif 'AA' in pair[0] and 'aa' in pair[1]:
            d['AA-aa'] += 1 * 4
        elif 'Aa' in pair[0] and 'Aa' in pair[1]:
            d['Aa-Aa'] += 1 * 4 * 0.75
        elif 'Aa' in pair[0] and 'aa' in pair[1]:
            d['Aa-aa'] += 1 * 4 * 0.5
    d_sum = 0
    for pair in d:
         d_sum += d[pair]
    return d_sum/(len(combos)*4)

In [387]:
quantify_dominant_allele(2,2,2)

0.7833333333333333

In [388]:
quantify_dominant_allele(26,23,21)

0.7868530020703933

**Looking at other solutions**

In [449]:
def firstLaw(k,m,n):
    '''
    Subract the probability of getting recessive only offspring
    1 - recessive probability = dominant probability
    within recessive probability:
    1st term: n-n --> 100% chance
    2nd term: m-n --> 50% chance * 2 (2 possible pairings)
    3rd term: m-m --> 25% chance
    Calculate each probability of sampling without replacement (products)
    Then add these products to get event of recessive
    '''
    N = float(k+m+n)
    return(1 - 1/N/(N-1)*(n*(n-1) + n*m + m*(m-1)/4.))

In [450]:
firstLaw(26,23,21)

0.7868530020703934

### [Translating RNA into Protein](https://rosalind.info/problems/prot/)

In [418]:
rna_codon_table = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",}

In [421]:
def translate(rna_string): 
    protein_string = ''
    for i in range(0,len(s),3):
        if rna_codon_table[s[i:i+3]] == 'STOP':
            break
        protein_string += rna_codon_table[s[i:i+3]]
    return protein_string

In [422]:
# s = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA';
with open('datasets/rosalind_prot.txt','r') as f:
    s = f.read().strip('\n')
translate(s)

'MEGSPKLFKARMCGCAVTISGKLLGPPQRYRIRIVMSRARRSLSCLENRHGRVIYSPSCADDNLQSRAHHPLCLYINNACTRPLPAVQELASSFQGITLVHSDRPWSQIRTEWTRGKGAVSRAKAHEWDPKLYYYLRAVAWTNTTGVLIITGLGCNEQSARFILRMSFMVMCFTSERHLGCASTRALIITLGHITILIIALLWKQRSAAIQGGNYLSQPTCERYTVQVLLQSPKRRRVTRCPEQPLSRTWASSPSSSPCLDRTLQNWISPRLDSHEPGSAPGLQQLLPLRVDHTVLFRAANFVPTIKLYVKGGVREGHTGRTCSSQNRSGKYPAFVLVPPTTTQPSDPRVSPPIMQRGDPPVILPNNTWFSFRATKIPLSPTLRPNTDTVCCRERGLRFSLWSLIYESVLYRGEKRSLLHFTLCPLIRTIATALAPRICMHRRSSCPGQVVVLPKLDTRIKPLSYGGNLSLRYGPLTQHGFQYVVKAVLQSILVSYGSNGEQSRSRTSNSHRSFARTSCTAYVTDVTYHSSQFGLQKNSFHITYTQEWTDHDRPQACIMGPEEHGTPSYVDGIPPRSGWAVLPHILCPRSSVVINATIKALRQHSFLPHHCEFSYESWSIGHVTYRKSEGRLKQYLFFLASRPRQDSMDACSVRQYVSEEQRSSPFCTSLPIKIVLRIFTGGVVPDPKSGLCSPLRADAVEELNLANPLIDVPPRKGFGLGGSPMMDQDSRRRLTYDTAAESIAISDTSSLFKLCTRICYVFACIRRVETHLNDFGKSCSVRDPLVRYIYIKGSRIIKYRSRKKFKNQLHWRGRNMRFLNQLCQYNVDTSLSGIKRGGITSRQRLASLQLKLGVHFYKVTDGKDTLVVRILNSPVLLGTHHSVRSGTSRDDCRCYCLVSPWERSVNGPAREHGDDISSISEVDTGGGNVAEAGSEYLLNRQTMPTHLERPYKIQPLKYAASEKARTEGSPLLEPTPINAGSANTCCQSPAVEKNAPSVS

### [Finding a Motif in DNA](https://rosalind.info/problems/subs/)

In [432]:
def find_substring(s, sub_s):
    idx = []
    for i in range(len(s)):
        if s[i:i+len(sub_s)] == sub_s:
            idx.append(i+1)
    return idx

In [439]:
# sub_s = 'ATAT'
# s = 'GATATATGCATATACTT'
with open('datasets/rosalind_subs.txt','r') as f:
    data = f.read().splitlines()
s = data[0]
sub_s = data[1]
for num in find_substring(s, sub_s):
    print(num)

2
43
106
153
169
221
228
315
361
395
455
483
498
523
549
556
563
582
604
693
700
807
834
863


### [Calculating Expected Offspring](https://rosalind.info/problems/iev/)

In [454]:
def expected_offspring(genotypes):
    prob_dominant = [1,1,1,0.75,0.5,0]
    total_offspring = 0
    for i in range(len(genotypes)):
        total_offspring += genotypes[i]*2*prob_dominant[i]
    return total_offspring

In [453]:
expected_offspring([1,0,0,1,0,1])

3.5

In [456]:
with open('datasets/rosalind_iev.txt','r') as f:
    genotypes = f.read().strip('\n').split()
genotypes = [int(x) for x in genotypes]
expected_offspring(genotypes)

153475.0

### [Calculating Protein Mass](https://rosalind.info/problems/prtm/)

In [476]:
def protein_mass(protein_str):
    with open('datasets/mass_table.txt','r') as f:
        data = f.read().splitlines()
    mass_table = {}
    for ele in data:
        protein, mass = ele.split()
        mass_table[protein] = float(mass)
    total_mass = 0
    for letter in protein_str:
        total_mass += mass_table[letter]
    return total_mass

In [477]:
protein_mass('SKADYEK')

821.3919199999999

In [478]:
with open('datasets/rosalind_prtm.txt','r') as f:
    protein_str = f.read().strip('\n')
protein_mass(protein_str)

106920.8708400006

### [Inferring mRNA from Protein](https://rosalind.info/problems/mrna/)

**Exploring modulo rules - congruency**

In [492]:
a=29
b=73
c=10
d=32
n=11

In [483]:
a % n

7

In [484]:
b % n

7

In [488]:
(c) % n

10

In [489]:
(d) % n

10

In [490]:
(a + c) % n

6

In [491]:
(b + d) % n

6

In [514]:
with open('datasets/rna_codon_table.txt','r') as f:
    table = f.read().splitlines()
# d = {}
# for ele in table:
#     codon, amino_acid = ele.split()
#     d[codon] = amino_acid
d_inverse = {}
for ele in table:
    codon, amino_acid = ele.split()
    if amino_acid in d_inverse:
        d_inverse[amino_acid] += [codon]
    else:
        d_inverse[amino_acid] = [codon]
d_inverse

{'F': ['UUU', 'UUC'],
 'L': ['UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'],
 'S': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'],
 'Y': ['UAU', 'UAC'],
 'Stop': ['UAA', 'UAG', 'UGA'],
 'C': ['UGU', 'UGC'],
 'W': ['UGG'],
 'P': ['CCU', 'CCC', 'CCA', 'CCG'],
 'H': ['CAU', 'CAC'],
 'Q': ['CAA', 'CAG'],
 'R': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
 'I': ['AUU', 'AUC', 'AUA'],
 'M': ['AUG'],
 'T': ['ACU', 'ACC', 'ACA', 'ACG'],
 'N': ['AAU', 'AAC'],
 'K': ['AAA', 'AAG'],
 'V': ['GUU', 'GUC', 'GUA', 'GUG'],
 'A': ['GCU', 'GCC', 'GCA', 'GCG'],
 'D': ['GAU', 'GAC'],
 'E': ['GAA', 'GAG'],
 'G': ['GGU', 'GGC', 'GGA', 'GGG']}

In [517]:
def total_rna_strings(protein_str):
    '''
    Iterate through string
    For each amino acid letter, count number of different codons --> inverse codon table
    Multiple numbers together to get total number of possible rna strings
    Return total possible rna strings modulo 1,000,000
    '''
    total_strs = 1
    for letter in protein_str:
        total_strs *= len(d_inverse[letter])
    total_strs *= len(d_inverse['Stop'])
    return total_strs % 1000000

In [518]:
total_rna_strings('MA')

12

In [519]:
with open('datasets/rosalind_mrna.txt','r') as f:
    protein_str = f.read().strip('\n')
total_rna_strings(protein_str)

450176