<a href="https://colab.research.google.com/github/cmrn-rhi/bioinformatics-practice/blob/master/Rosalind_Bioinformatics_Stronghold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Rhiannon Cameron - [Rosalind Progress Profile](http://rosalind.info/users/cmrn-rhi/)*

*Python 3.6*

# Bioinformatics Stronghold
Learning bioinformatics and programming through problem solving

## Counting DNA Nucleotides





[Problem](http://rosalind.info/problems/dna/)

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

**Given:** A DNA string s of length at most 1000 nt.

**Return:** Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

<h4> Sample Dataset </h4>

```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
```

<h4> Sample Output </h4>

```
20 12 17 21
```


<h4> Version 1 </h4>

In [0]:
import collections

def dna_nuc_count(dna_string):
  """
  DNA Nucleotide Count
  
  Given: A DNA string (A, C, G, T) s of length at most 1000 nt.
  Return: Four integers (separated by spaces) counting the respective number of
          times that symbols 'A','C','G', and 'T' occur in s.
  """
  # ensure incoming data are all uppercase
  dna_string.upper()
  
  # create orderws dictionary to ensure correct output order
  dna_dict = collections.OrderedDict()
  dna_dict['A'] = 0
  dna_dict['C'] = 0
  dna_dict['G'] = 0
  dna_dict['T'] = 0
  
  # count nucleotides and store associated count in dictionary
  for nucleotide in dna_string:
      dna_dict[nucleotide] += 1
  
  # return nucleotide counts separated by spaces
  return ' '.join([str(i) for i in dna_dict.values()])
  
  
  """
  Alternative Ouput
  Returns nucleotide counts separated by spaces; better format for inputing
          parameters in future functions. Works with unordered Dictionary.
  
  """
  # return dna_dict['A'], dna_dict['C'], dna_dict['G'], dna_dict['T']

In [0]:
# Sample Dataset Test

test_string = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
dna_nuc_count(test_string)

In [0]:
# Dataset Test

test_string = 'GAAGGAATGGGCTCATGCTGGTTTGCCCAACGGCTCCTCAACATGAAAGCCTGAAGTACACGTGAGATATATACCATTCCATACGCACCCGCGTACCGGGCATGGGAGTACAACCTGACGGCTCCGGCCGTGAGATCGGGTCATCTTGGCAGTGACACTAGGATGTTACAGGGTTAACGGACAGTTTGAACGCAGCTGTGATGGCGGCGAGCAATGAACTCAATCCATCACGAGTCACCTGGGTTAAACGGAGTATAAAGCTACTTCCGTCGGACTGGTCTTGTGGTCTGAGAAAATTGGATCGTCCGGCGAGACTAGTCCGGTACTCGAGTTAGGGTTCGCATTCCCGGGCTTTCCTCAGAGGGCTCACTACCTAGTGAACTTAGGAAGATCTACTCTGTGCGGCGTGTCAATTAGACAATGCTTTATGGAACGGGGCACTGGACAATTAGTGGTAAAAACTGAATCGCCCCCACGTGAAACTGCTCACGCTACATGCTGAATTGTTAACCCCTATAACAGTCTCTGTCAGGACTAGAACGGCTGGCGATCATATCCGGCGACGACTTTGCTCTCGTTTCAGAGGTATTGATCCTGCAGGCAGAAGGCGACCCCCGATTTTCATTCCTTTCCTGGTGTCATTTTGCTCAATATGGTATGATATAAGGTAGATGCGAATGTGTCTATATGTTGTCGAGCGCGCTTAGCCATTCGGTCACGCGTATTGACCATGGTTCGTCTAGATCATCACCCTCTCATAAGAGGAATGCCCCGGCAAGCAAGGGTGTCGCGTTTAGGGAGACTTTGTTGCCATAGCACGTGGAGAGAGTAAGAGGGGAAGCGCATAAGCGGGTGCTACGACCCGAAGCTGTCTCGGATGGCTATGTATCCTGGGGTAGGGACAGTAGAAAAGCGTGTAATGGCCGATACATCAGTCGCAATCCCGGGGCGTAGCTCTCGATAACTCTCCT'
dna_nuc_count(test_string)

In [0]:
# Dataset Test - From File

with open('rosalind_dna.txt') as infile:
  test_string = infile.readline()
  test_string_clean = test_string.rstrip('\n')
  print(dna_nuc_count(test_string_clean))

<h4> Version 2 (Better) </h4>

In [0]:
def dna_nuc_count2(dna):
  """
  DNA Nucleotide Count v2
  
  Given: A DNA string (A, C, G, T) s of length at most 1000 nt.
  Return: Four integers (separated by spaces) counting the respective number of
          times that symbols 'A','C','G', and 'T' occur in s.
  """
  return dna.count("A"), dna.count("G"), dna.count("C"), dna.count("T")

<h4> Version 3 (Simple) </h4>

In [0]:

print(*map(input().count, "ACGT"))

## Transcribing DNA into RNA


[Problem](http://rosalind.info/problems/rna/)

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

**Given:** A DNA string t having length at most 1000 nt.

**Return:** The transcribed RNA string of t.

**Sample Dataset**


```
GATGGAACTTGACTACGTAAATT
```


**Sample Output**


```
GAUGGAACUUGACUACGUAAAUU
```



<h4> Version 1 </h4>

In [0]:
def dna_to_rna(dna_string):
  """
  Given: a DNA string (A, C, G, T), having length of at most 1000 nt.
  Return: the transcribed RNA (A, C, G, U) string.
  
  """
  rna_string = ''
  
  for nucleotide in dna_string:
    if nucleotide == 'T':
      rna_string += 'U'
    else:
      rna_string += nucleotide
      
  return rna_string

In [0]:
# Sample Dataset Test

test_string = 'GATGGAACTTGACTACGTAAATT'
dna_to_rna(test_string)

In [0]:
# Dataset Test

dna_to_rna(input())

<h4> Version 2 </h4>

In [0]:
# User Input

dna_string = input()
print(dna_string.replace("T", "U"))

In [0]:
# From File

with open('rosalind_rna.txt') as infile:
  print(infile.read().replace("T", "U"))

UAUUUGCUAACCAUAGACCGAGAAGUCAACAAAUCGUGAAGCGUGACUAUUUCGGGCGCUACUUCACCCCGACGCCCGGCAUUUCACGCUUGGUGUGUACUACCACGGGGCCGCGGCGGAAUCCCACGCGGAGCUAACCCGUUGGGAUCAUGCAUGCGCCUCUCGGUGGUCAACGGAUGAACUAGUACACCGUCGACGUAAACAAGCUUAUCGCUACCUGGAGGAUUACUGCUGAACUAGUAAACCGGUACCCGGUAAGCCGAAUGCCUGUGAUCGAUGAGUUCUGGGUGUGCAGGAAGUACAUCGGUUAAAGUCUUGGCAGACAACUACCUUAUUUGUCCCCCAGGUGGAAUUACGAGGUGAGGCACUGACGGUAUAACAAGGAAGAGCAAUCAUUUUACACCGGACCUGUGGCCAACACCAAGGCACAGCAGUAUGCGGGAAUGACGCAAAGUCUUUGCACGAGACGGGCAAGCAGUGGCUGUAAAGUUGGACACAGUUGACUACAUUAGAGAUCUAAUACGCGUUCCAUAUCCGACUGGAGCUGGCUAAACAGGUAGAACACAUCGGCCAUUAGCAGAAGCUGUUGAAUUUGGAUCUAAUCCAAAACUAAACUCCAGUUAUGAACUCCCAAAUGAACACAACUCUCGGCGCCCAGUAGAAAUUCGUGCCUCAUUGUUCGGCGCUUGGUACACGUCGGGGCCUGGUGAGUAGUAUAGAUUUGCGCUCGUACAUUACUUCCCGCGCCGCGGGGAGAGUUUUAAAACGUGGUCUACUUGGGAUCUGAGCCUGAUAUGAAAUGGGAGAUGACAUCAGUGUACGGCGCAGCCGCACACGUCAGAAACGAAUAACUUUUUCGUAGCGCCAUUUGCCAAAGGGCUGAUUGUUGAUUGAGGAGCUCAAUCGUAUUGUAAGUCACUUGACAGCCCCUGCCUCUGGA



<h4> Version - Bash </h4>

`tr T U < rosalind_rna.txt`

## Complementing a Strand of DNA

[Problem](http://rosalind.info/problems/revc/)

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

**Given:** A DNA string s of length at most 1000 bp.

**Return:** The reverse complement sc of s.

**Sample Dataset**


```
AAAACCCGGT
```


**Sample Output**


```
ACCGGGTTTT
```



<h4> Version 1 </h4>

In [0]:
def dna_rev_comp(dna_str):
  """
  DNA Reverse Complement
  
  Given: a DNA string (A, C, G, T) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  
  # t & c temporarily hold positions; t will become T and c will become C
  temp_str = dna_str.replace("A", "t").replace("G", "c")
  temp_str2 = temp_str.replace("T", "A").replace("C", "G")
  comp_str = temp_str2.replace("t", "T").replace("c", "C")
  
  return comp_str[::-1]

In [0]:
# Sample Dataset Test

test_string = 'AAAACCCGGT'
dna_rev_comp(test_string)

In [0]:
# From File

with open('rosalind_revc.txt') as infile:
  print(dna_rev_comp(infile.read()))

<h4> Version 2 (Better) </h4>

In [0]:
def dna_rev_comp2(dna_str):
  """
  DNA Reverse Complement v2
  
  Given: a DNA string (A, C, G, T) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  
  # t & c temporarily hold positions; t will become T and c will become C
  comp_str = dna_str.replace("A", "t").replace("G", "c").replace("T", "A")\
             .replace("C", "G").replace("t", "T").replace("c", "C")
  
  return comp_str[::-1]

<h4> Version 3 (Best) </h4>

In [0]:
def dna_rev_comp3(dna_str):
  """
  DNA Reverse Complement v3
  
  Given: a DNA string (A, C, G, T) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  comp_str = { "A" : "T", "C" : "G", "G" : "C", "T" : "A" }

  return "".join([comp_str[n] for n in dna_str][::-1])

## Rabbits & Recurrence Relations

[Problem](http://rosalind.info/problems/fib/)

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence).

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

**Given:** Positive integers n≤40 and k≤5.

**Return:** The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

**Sample Dataset**


```
5 3
```


**Sample Output**


```
19
```



In [0]:
def fib_recur(n, k):
  """
  Fibonacci Recursion
  
  Given: Positive integers n <= 40 and K <= 5
    n = months
    k = rabbit pairs
  Return: the total number of rabbit pairs that will be present after n months,
          if we begin with 1 pair and in each generation, every pair of 
          reproduction-age rabbits produces a litter of k rabbits.
  """
  # Case for when pair matures and is not yet producing offspring
  if n == 0 or n == 1:
    return n
  # Case for when mature pair produces offspring, adjusted for k
  else:
    return fib_recur(n-1, k) + fib_recur(n-2, k)*k
 

In [0]:
fib_recur(29, 5)

1850229480761

## Computing GC Content

[Problem](http://rosalind.info/problems/gc/)

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

**Given**: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

**Return**: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

**Sample Dataset**


```
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
```


**Sample Output**



```
Rosalind_0808
60.919540
```



<h4> Version 1 - BioPython </h4>

In [0]:
# install BioPython (includes NumPy).
!pip install biopython

In [0]:
# import Bio SeqIO package from BioPython.
from Bio import SeqIO

def gc_content_greatest(dna_strings):
  """
  Computing Greatest GC Content
  
  Given: file containing <= 10 DNA strings (A, C, G, T), FASTA format,
         length <= 1kbp/string.
  Return: DNA string ID with the highest GC content, followed by the GC content
          of that string. Default error of 0.001.
          
  nt_count = total nucleotide count of DNA string.
  gc_count = total GC nucleotide counts of DNA string.
  gc_percent = GC content; percent of nucleotides in DNA string that are G+C.
  gc_greatest = DNA string with the highest GC content percent.
  id_greatest = ID for DNA string with the highest GC content percent.
  """
  gc_greatest = 0
  
  for seq_record in SeqIO.parse(dna_strings, "fasta"):
    nt_count = len(seq_record)
    gc_count = 0  

    for nt in seq_record.seq:
      if nt == 'G' or nt == 'C':
        gc_count += 1

    gc_percent = (gc_count/nt_count)*100

    # store/update record of ID and GC% of sequence with greatest GC%
    if gc_greatest < gc_percent:
      gc_greatest = gc_percent
      id_greatest = seq_record.id

  print(id_greatest)
  print(round(gc_greatest, 3)) 

In [0]:
gc_content_greatest("rosalind_gc.fasta")

<h4>Version 2 - BioPython (Best)</h4>

*Be sure to install package using first code block from Version 1.*

In [0]:
# import Bio SeqIO package from BioPython.
from Bio import SeqIO
# import GC function: calculates G+C content, returns percentage as float.
from Bio.SeqUtils import GC

def gc_content_greatest2(dna_strings):
  """
  Computing Greatest GC Content
  
  Given: file containing <= 10 DNA strings (A, C, G, T), FASTA format, length
          <= 1kbp/string.
  Return: DNA string ID with the highest GC content, followed by the GC content
          of that string. Default error of 0.001.
          
  gc_greatest = DNA string with the highest GC content percent.
  id_greatest = ID for DNA string with the highest GC content percent.
  """

  gc_greatest = 0
  id_greatest = ""

  with open(dna_strings, 'r') as infile:

    for record in SeqIO.parse(infile, "fasta"):
        if gc_greatest < GC(record.seq):
            gc_greatest = GC(record.seq)
            id_greatest = record.id

  print(id_greatest)
  print(round(gc_greatest,3))

In [0]:
gc_content_greatest2("rosalind_gc.fasta")

<h4>Version 3 (Python Only)</h4>

In [0]:
def gc_content_greatest3(dna_strings):
  """
  Computing Greatest GC Content
  
  Given: file containing <= 10 DNA strings (A, C, G, T), FASTA format, 
         length <= 1kbp/string.
  Return: DNA string ID with the highest GC-content, followed by the GC-content
          of that string. Default error of 0.001.
          
  gc_dict = dictionary where DNA string IDs are stored as keys while associated
            DNA string GC content is stored as values.
  gc_percent = GC content; percent of nucleotides in DNA string that are GC.
  seq_id = DNA string ID.
  """
   
  with open(dna_strings, 'r') as infile:
    data = infile.read()
    gc_dict = {}
    
    # iterate over sequence blocks; beginning indicated by '>'
    for block in data.split(">")[1:]: 
        
        # split data at newline; [0] is ID while [1:] is DNA string.
        parts = block.split("\n")      
        # assign sequence ID
        seq_id = parts[0]  
        # join all sequence strings into one string
        seq = ''.join(parts[1:])
        gc_percent = 100 * ( seq.count("G") + seq.count("C") ) / float(len(seq))
        # insert sequence ID and associated GC content into dictionary
        gc_dict[gc_percent] = seq_id
        
  print(gc_dict[max(gc_dict)])
  print(round(max(gc_dict), 3))

In [0]:
gc_content_greatest3("rosalind_gc.txt")

## Counting Point Mutations

[Problem](http://rosalind.info/problems/hamm/)

Given two strings s and t of equal length, the **Hamming distance** (the minimum number of symbol substitutions required to transform one string into the other) between s and t, denoted d<sub>H</sub>(s,t), is the number of corresponding symbols that differ in s and t. See Figure 2.
<br />
<br />

![Figure 2.](http://rosalind.info/media/problems/hamm/Hamming_distance.png)

<sub>Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.</sub>
<br />
<br />
**Given**: Two DNA strings s and t of equal length (not exceeding 1 kbp).

**Return**: The Hamming distance d<sub>H</sub>(s,t).
<br />
<br />

**Sample Dataset**


```
GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
```


**Sample Output**


```
7
```



<h4>Version 1</h4>

In [0]:
def calc_point_mutation(s,t):
  """
  Counting Point Mutations
  
  Given: two DNA strings (A, C, G, T) s and t of equal length (<= 1kbp).
  Return: the hamming distance (the minimum number of symbol substitutions
          required to transform one string into the other) between s and t.
          
  pt_mut = point mutation; corresponding nucleotides differ between s and t.
  """
  pt_mut = 0
  
  for nt in range(len(s)):
    if s[nt] != t[nt]:
      pt_mut += 1
  
  return pt_mut

In [0]:
# Sample Dataset

calc_point_mutation('GAGCCTACTAACGGGAT','CATCGTAATGACGGCCT')

In [0]:
# Test Dataset - From File

with open("rosalind_hamm.txt", 'r') as infile:
  data = infile.read()
  # separate sequences by newlines
  seqs = data.split("\n")

calc_point_mutation(seqs[0],seqs[1])

<h4>Version 2 (Better)</h4>

*'zip' iterates over tuples one at a time such that none of them remain in memory.*

In [0]:
def calc_point_mutation2(s,t):
  """
  Counting Point Mutations
  
  Given: two DNA strings (A, C, G, T) s and t of equal length (<= 1kbp).
  Return: the hamming distance (the minimum number of symbol substitutions
          required to transform one string into the other) between s and t.
          
  pt_mut = point mutation; corresponding nucleotides differ between s and t.
  """
  return sum(nt1 != nt2 for nt1, nt2 in zip(s,t))

In [0]:
# Sample Dataset
calc_point_mutation2('GAGCCTACTAACGGGAT','CATCGTAATGACGGCCT')

## Mendel's First Law

[Problem](http://rosalind.info/problems/iprb/)

...

**Given**: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

**Return**: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

**Sample Dataset**


```
2 2 2
```


**Sample Output**


```
0.78333
```



<h4>Version 1</h4>

In [0]:
def dominant_allele_prob(k,m,n):
  """
  Dominant Allele Probability
  
  Given: three positive ints k, m, and n representing interfertile populations;
         k = indiviuals homozygous for a factor.             (e.g. AA)
         m = individuals heterozygous for a factor.          (e.g. Aa)
         n = individuals homozygous recessive for a factor.  (e.g. aa)
  Return: the probability that two randomly selected mating organisms will
          produce an individual possessing a dominant allele.
  """
  
  # punnet square dominant factor probabilities.
    # AA+AA = 1
    # AA+Aa = 1
    # AA+aa = 1
    # Aa+Aa = 0.75
    # Aa+aa = 0.5
    # aa+aa = 0
  
  total = k + m + n
  
  # AA+AA event
  kk_prob = (k/total) * ((k-1)/(total-1))
  # AA+Aa and Aa+AA events
  km_prob = ((k/total)*(m/(total-1)) + ((m/total)*(k/(total-1))))
  # AA+aa and aa+AA events
  kn_prob = ((k/total) * (n/(total-1)) + ((n/total) * (k/(total-1))))
  # Aa+Aa event
  mm_prob = ((m/total) * ((m-1)/(total-1))) * 0.75
  # Aa+aa and aa+Aa events
  mn_prob = ((m/total) * (n/(total-1)) + ((n/total) * (m/(total-1)))) * 0.5
  # aa+aa event
  nn_prob = ((n/total) * (n/(total-1))) * 0
  
  return round(kk_prob + km_prob + kn_prob + mm_prob + mn_prob + nn_prob, 5)

In [0]:
# Sample Dataset
dominant_allele_prob(2,2,2)

1.0

In [0]:
# Test Dataset
dominant_allele_prob(28,19,15)

<h4>Version 2 (Simplier)</h4>

In [0]:
def dominant_allele_prob2(k,m,n):
  """
  Dominant Allele Probability
  
  Given: three positive ints k, m, and n representing interfertile populations;
         k = indiviuals homozygous for a factor.             (e.g. AA)
         m = individuals heterozygous for a factor.          (e.g. Aa)
         n = individuals homozygous recessive for a factor.  (e.g. aa)
  Return: the probability that two randomly selected mating organisms will
          produce an individual possessing a dominant allele.
  """
  
  # punnet square non-dominant factor probabilities.
    # AA+AA = 0
    # AA+Aa = 0
    # AA+aa = 0
    # Aa+Aa = 0.25
    # Aa+aa = 0.5
    # aa+aa = 1
  
  total = k + m + n
  
  # Aa+Aa event resulting in aa.
  # (m*(m-1)*0.25) / (total*(total-1))
  
  # Aa+aa and aa+Aa event resulting in aa.
  # (m*n)(m*n)*0.5 / (total*(total-1))
  
  # aa+aa event.
  # n*(n-1) / (total*(total-1))
  
  prob_aa = (((m*n) + (m*(m-1)*0.25) + (n*(n-1))) / (total*(total-1)))
  
  # return 1 minus probability of non-dominant to get dominant probability.
  return round(1 - prob_aa, 5)

In [0]:
# Sample Dataset
dominant_allele_prob2(2,2,2)

##Translating RNA into Protein

[Problem](http://rosalind.info/problems/prot/)

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

**Given**: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

**Return**: The protein string encoded by s.

**Sample Dataset**


```
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
```


**Sample Output**


```
MAMAPRTEINSTRING
```

<h4>Version 1</h4> (1/6 Translations)

In [0]:
def translate_rna(rna_string):
  """
  Translate RNA into Protein
  
  Given: an RNA string s corresponding to a strand of mRNA (length <=10 kbp).
  Return: the protein string (amino acid chain) encoded by s.
          (One of six possible translations)
  """
  # RNA codon table
  codon_dict = {
    "UUU":"F",    "CUU":"L", "AUU":"I", "GUU":"V",
    "UUC":"F",    "CUC":"L", "AUC":"I", "GUC":"V",
    "UUA":"L",    "CUA":"L", "AUA":"I", "GUA":"V",
    "UUG":"L",    "CUG":"L", "AUG":"M", "GUG":"V",
    "UCU":"S",    "CCU":"P", "ACU":"T", "GCU":"A",
    "UCC":"S",    "CCC":"P", "ACC":"T", "GCC":"A",
    "UCA":"S",    "CCA":"P", "ACA":"T", "GCA":"A",
    "UCG":"S",    "CCG":"P", "ACG":"T", "GCG":"A",
    "UAU":"Y",    "CAU":"H", "AAU":"N", "GAU":"D",
    "UAC":"Y",    "CAC":"H", "AAC":"N", "GAC":"D",
    "UAA":"stop", "CAA":"Q", "AAA":"K", "GAA":"E",
    "UAG":"stop", "CAG":"Q", "AAG":"K", "GAG":"E",
    "UGU":"C",    "CGU":"R", "AGU":"S", "GGU":"G",
    "UGC":"C",    "CGC":"R", "AGC":"S", "GGC":"G",
    "UGA":"stop", "CGA":"R", "AGA":"R", "GGA":"G",
    "UGG":"W",    "CGG":"R", "AGG":"R", "GGG":"G"}
  
  aa_chain = ''
  
  # find first occurence of a start codon.
  start = rna_string.find('AUG')
  # truncate sequence preceding start codon.
  trim_string = rna_string[start:]

  for i in range(0, len(trim_string),3):
    codon = trim_string[i:i+3]
    
    if codon in codon_dict:
      # check for stop codon.
      if codon_dict[codon] == "stop":
        break
      # if not stop codon, add the amino acid chain.
      else:
        aa_chain += codon_dict[codon]
      
  return aa_chain

In [0]:
# Sample Dataset
translate_rna('AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA')

'MAMAPRTEINSTRING'

In [0]:
# Test Dataset - From File
with open("rosalind_prot.txt", 'r') as infile:
  data = infile.read()
  
translate_rna(data)

'MSSPLGWPAVGLRIVDDPGKLINLGPKTEGLRFYRRAGSGLPETFNAHKHVRRRTAIVTPTVYDKRLQTPTKHHYATAIVGDNPKLRRVLPSIRGHPTCGRHLHYEDSDSSTRMNCRRMRGLLNMLARRSSSCERRCISLQGSLFDPAGSKVSGFLDPVTQHCSITAQTPLLIFSCSHYIADNEDCSKLNFHRLFVWVLAPFTVTQYEYTPSHSAHQVQLSNRPPQSMPSVQTYINTNANRFLMRRDVTLRGSTSDGKNIGDPERGSRTRSTARASPASTSVSRIKSIKLSEDFWTYAMLCLFPIAIYCQHRPRFNYDIYGYMPIEASLLQNAKSIRWSWPLTTMVYLVKRCPEPRHESPRKTRNNVYGPNTTWGSIPHDIPAHNARPPLSLRETRSPLQYYVGRLKTKVCRISAAIQSRKNPKHSYTRMKASACYGDNCMRDSFGLSPNDSVSSGQQSFIRDYQKTASTPFVTLGYGIERNVWCLPRYQTLLTLANASQRLWAHLVVLSRTKELKSFFTINVTFFLKYIGNIDCISFRYSPTISYIYVFRIWHTGVRYVEIGTHWGGHEGALLIPTKTCFRMFLGPGTYSALGLRFLHFRNGSLCELFSHSHNGPYLQHVQYVKTESLMTLCLTDTMIEWPYNGEVLKPASRQICDCPTCRCEHWYHKSILRLPGLKNRRANNCEFICKCDLLQSTHGYRANLSMSNAKQLLSRGPSTRVWDRSKYTPRSEYTSLVFEEMMNRSRAWPNMSESISKTGRLKVRGNSDLRSDKDPVNPVADKQLKRDQLVYTCPIRTPGLCVLFIASGNIPGTYLGVSCAENEAAYKGLPRARSSSHYTPLYGSLRVGSRLADGKLGCNLSAVLDRCTRNAFRYVNIATQKPSELSSSDLLPLSQNTQNYVPSIVLWRMRDKKMDCQSVRPEVSRSSPRVEDGQVAGVAKVHKHFALRRCCHFSWLQQKLKILNHLVSFHTFSLSVERTVSPTPQPTYRIAQTYAKLTE

<h4>Version 2 - BioPython (In Progress)</h4>(1/6 Translations)

In [0]:
# install BioPython (includes NumPy).
!pip install biopython

In [0]:
from Bio.Seq import Seq

def extend_seq(seq_string):
    """ 
    Extend Sequence
    
    Given: DNA/RNA string.
    Return: Original DNA/RNA string extended with 'N' such that the length is 
            a multiple of 3 if it is not already so. Avoids "Biopython Warning:
            Partial codon" and potential future errors.
    
    """
    remainder = len(seq_string) % 3
    
    if remainder == 0:
      return seq_string
    else:
      return seq_string + Seq('N' * (3 - remainder))
      #return '\n'.rstrip(seq_string) + str('N' * (3 - remainder))


**Issue:** Cannot currently get **extend_seq()** to work as it keeps adding a **\n** when contatinating the return variables. This happens with string and Seq types (though it should happen with neither) and I cannot seem to successfully strip it.

In [0]:
from Bio.Seq import Seq

def translate_rna(rna_string):
  """
  Translate RNA into Protein
  
  Given: a file containing an RNA string s corresponding to a strand of mRNA 
         (length <=10 kbp).
  Return: the protein string (amino acid chain) encoded by s One of six possible
          tanslations). Will raise "BiopythonWarning: Partial codon" if trailing
          amino acids to not form a codon but will still output protein string.
          
  """  
  with open(rna_string, 'r') as infile:
    data = Seq(infile.read())
    #data = infile.read()
    
    #data_extend = extend_seq(data)
    #data_extend = Seq(extend_seq(data))
  
  return str(data.translate())[:-1]
  #return type(data_extend), type(data)

In [0]:
#Test Dataset - From File
translate_rna("rosalind_prot.txt")



'MSSPLGWPAVGLRIVDDPGKLINLGPKTEGLRFYRRAGSGLPETFNAHKHVRRRTAIVTPTVYDKRLQTPTKHHYATAIVGDNPKLRRVLPSIRGHPTCGRHLHYEDSDSSTRMNCRRMRGLLNMLARRSSSCERRCISLQGSLFDPAGSKVSGFLDPVTQHCSITAQTPLLIFSCSHYIADNEDCSKLNFHRLFVWVLAPFTVTQYEYTPSHSAHQVQLSNRPPQSMPSVQTYINTNANRFLMRRDVTLRGSTSDGKNIGDPERGSRTRSTARASPASTSVSRIKSIKLSEDFWTYAMLCLFPIAIYCQHRPRFNYDIYGYMPIEASLLQNAKSIRWSWPLTTMVYLVKRCPEPRHESPRKTRNNVYGPNTTWGSIPHDIPAHNARPPLSLRETRSPLQYYVGRLKTKVCRISAAIQSRKNPKHSYTRMKASACYGDNCMRDSFGLSPNDSVSSGQQSFIRDYQKTASTPFVTLGYGIERNVWCLPRYQTLLTLANASQRLWAHLVVLSRTKELKSFFTINVTFFLKYIGNIDCISFRYSPTISYIYVFRIWHTGVRYVEIGTHWGGHEGALLIPTKTCFRMFLGPGTYSALGLRFLHFRNGSLCELFSHSHNGPYLQHVQYVKTESLMTLCLTDTMIEWPYNGEVLKPASRQICDCPTCRCEHWYHKSILRLPGLKNRRANNCEFICKCDLLQSTHGYRANLSMSNAKQLLSRGPSTRVWDRSKYTPRSEYTSLVFEEMMNRSRAWPNMSESISKTGRLKVRGNSDLRSDKDPVNPVADKQLKRDQLVYTCPIRTPGLCVLFIASGNIPGTYLGVSCAENEAAYKGLPRARSSSHYTPLYGSLRVGSRLADGKLGCNLSAVLDRCTRNAFRYVNIATQKPSELSSSDLLPLSQNTQNYVPSIVLWRMRDKKMDCQSVRPEVSRSSPRVEDGQVAGVAKVHKHFALRRCCHFSWLQQKLKILNHLVSFHTFSLSVERTVSPTPQPTYRIAQTYAKLTE

## Finding a Motif in DNA

[Problem](http://rosalind.info/problems/subs/)


Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

_**Note:** in Python we use 0-based numbering whereas this example is using 1-based numbering_

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

**Given:** Two DNA strings s and t (each of length at most 1 kbp).

**Return:** All locations of t as a substring of s.

**Sample Dataset**


```
GATATATGCATATACTT
ATAT
```


**Sample Output**


```
2 4 10
```



<h4>Version 1</h4>

In [0]:
def get_dna_motif(s,t):
  """
  Find DNA Motif Locations

  Given: two strings s and t (each of length <= 1kbp).
  Return: all locations of t as a substring of s.
  """
  sub_position = []
  
  for i in range(len(s)):
   
    if s[i:i+len(t)] == t:
      sub_position.append(i+1)
    
  print(*sub_position)

In [0]:
# Sample Dataset
get_dna_motif('GATATATGCATATACTT','ATAT')

2
4
10
2 4 10


In [0]:
# Test Dataset
get_dna_motif('CGGATCTAAACGAGTTCTAAACATCTAAACAATTCTAAACCGGCTCTAAACGGTGTTCTAAACTCTAAACCGTCTAAACTCTAAACTCTAAACTAAGCTCTCTAAACTGTTTCTAAACGGAGGGTCTAAACGTCTAAACCTTCTAAACAATCTAAACTCTCTAAACCTCTAAACGTACTCTAAACTCTAAACTCTAAACTTCTAAACCTCTAAACGTAATCTAAACTATCTAAACTTTCTAAACGGTCTAAACGGCTACGCTTCTAAACTCTCTAAACTTGAAGATCTCTAAACTCTAAACTCTAAACTCTAAACATTCTAAACTTCTAAACTTCTAAACGCTGGATCTAAACACTTCTAAACATCTAAACTGATCGCTTCTAAACTTCTAAACTCTAAACTTCTAAACGCTCTAAACAACGAGAGTCTAAACACTCTAAACTCTAAACCATCTAAACAGGATTAACTTCTAAACTTCTAAACTCCTCTAAACATCTAAACTATGTCTAAACGCAATCTATCTAAACATCTAAACTCTAAACTCTAAACTCTAAACGATATTCTAAACCGATCTAAACAATCTAAACCTCTAAACCATCTAAACTCTAAACGTCTAAACGCCTTCTAAACTATCTAAACTCTAAACTAATCTCTAAACTCTAAACCTCTAAACTCTAAACGGCGTGGTCTAAACTCTAAACGTCTAAACTCTAAACCCTTCTAAACTCTAAACGAAATCTAAACCCGTCTAAACATCTAAACTCTAAACCCCCGAATCTAAACTCTAAACTAATCTAAACTCTAAACATCTAAACTTCTAAACTTCTAAACCAGCCCTCATAAAGGGCACTTAATCTAAACATCTAAACCATGTCTAAACGGGCGTCTAAACTGTTCTCCGTCTAAACTCTAAACGTACTGATTCTAAAC','TCTAAACTC')

In [0]:
# Test Dataset - From File
sequence,motif = open('rosalind_subs.txt').read().split()
get_dna_motif(sequence,motif)

<h4>Version 2</h4>

In [0]:
def get_dna_motif2(s,t):
  """
  Find DNA Motif Locations

  Given: two strings s and t (each of length <= 1kbp).
  Return: all locations of t as a substring of s.
  """
  print(*[i+1 for i in range(len(s)) if s[i:i+len(t)] == t])

In [0]:
# Sample Dataset
get_dna_motif2('GATATATGCATATACTT','ATAT')

## Consensus and Profile

[Problem](http://rosalind.info/problems/cons/)

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write A<sub>i,j</sub> to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P<sub>1,j</sub> represents the number of times that 'A' occurs in the jth position of one of the strings, P<sub>2,j</sub> represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

**Given:** A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

**Return:** A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

**Sample Dataset:**


```
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
```


**Sample Output:**


```
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
```



<h4> Version 1 - Biopython </h4>

In [0]:
# install BioPython (includes NumPy).
!pip install biopython
# import Bio SeqIO package from BioPython.
from Bio import SeqIO

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/ed/77/de3ba8f3d3015455f5df859c082729198ee6732deaeb4b87b9cfbfbaafe3/biopython-1.74-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 9.6MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.74


In [0]:
# import NumPy
import numpy as np
# import Counter subclass from collections library.
from collections import Counter

In [0]:
def consensus_profile(fasta_file):
  """
  Given: A collection of at <=10 DNA strings of equal length (at most 1 kbp) 
         in FASTA file format.
  Return: A single consensus string, even if multiple exist, and profile matrix
         for the collection.
          
  dna_matrix = a matrix where the rows are DNA strings and the columns are 
         relative nucleotide positions.
  freq_matrix =
  """
  dna_lists = []
  consensus = ""
  
  with open(fasta_file, 'r') as infile:

    for record in SeqIO.parse(infile, "fasta"):
      dna_lists += [(record.seq)]
  
  # convert to sequence array.
  dna_matrix = np.vstack(dna_lists)
 
  # make iterable copy of dna matrix.
  temp_matrix = zip(*dna_matrix)
  
  # count nucleotide frequencies and save as counter (dict subclass).
  nuc_freq = np.array([Counter(col) for col in temp_matrix])
  
  # generate consensus sequence.
  for counter in nuc_freq:
    consensus += max(counter, key=counter.get)
    
  # generate profile matrix
  profile_matrix = (np.array([[Counter(col).get(char,0) for char in 'ACGT']
                   for col in nuc_freq]).T)
    
  print(consensus)
  
  for i,letter in enumerate('ACGT'):
    print(f"{letter}: {' '.join([str(count) for count in profile_matrix[i]])}")

In [0]:
# Sample Dataset - From File
consensus_profile('rosalind_sample.fasta')

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


In [0]:
# Test Dataset - From File
consensus_profile('rosalind_cons.txt')

TCGGTTCTTCATATGTTCACCGAGGCCGACAACAGGTGGTAGTGCCTACTTCTGAGTGCTAGGAATCCATGGCAAGGCGCTTCGAGTCCCTTCCAGTTATTCTTGGTCAGTTCTGAAAACCAGCCTGGGTAGATTGTAGGTCACCGCTAATACAACACGGTCTTTTCGACGGTGCATCTCATTCAACCTCTCATACTCTGAGTCCCTAAATACAGAAGCTTAGGGTTTTTTACCCCTGCAACTCATGCTCGAGCTCAGTAAAGTAGTGGACGCTGAGCTGGACTCACCTGACTACCCATTCAACAATAAATAGGGGCGCAAGCTACGGCCGGTTCGCGCGGCTTAGAACCATCTTTTTAAGTCCGCTTGAGATCTTCCTGTTCGCGACCATGAGCTGCACCATACTTGTCTCTTTGGCCTGTGACCTAACTTAGATAGACACACTCTTTCCCGTGGTATCACCAGTTTAATACCTATGCTCCATGGCAGCCACCGGTAGCAGGAGGACGAATACAAATTCCGTGGAAGAACAATACAGAGGATTGGCTCATACAGCAGAGCTTGAATGTTGGGGATTAAACCACGATTCGCCCTCATATAGAATCTGCTACGGGATCTCTTAACGGATGGTATCTATTTAAGTACTGACATTAGTACACGGTGCACGGTGGTCCTGGGCAAGGGGGTGCCTGCTAGTCCGATAAGGTCCTGACGCGTCGAGTAATGATTCCCGGTAGGCGGGAGTAGTACAACTACAGAGTTGACGAAGAAAAGTATAGGCACTTGGCATGGCGTCGTACAATATGTGCCGTTTTGAACGAGACACGGCTTTTCACCTACACATCCAATACAAACGCTCTACCTTGCACGAGCTGAACGGTTAAGACCCCTTGACCATACTAGCGAACCTGA
A: 3 2 2 1 2 3 2 2 0 4 4 3 5 2 1 2 1 1 3 3 1 1 4 3 3 0 1 2 4 3 5 6 4 4 2 0 1 2 3 3 5 0 

<h4> Version 2 - Biopython (by Glen)</h4>

In [0]:
# Read FASTA to a np array
sequence_array = np.array([record.seq for record in SeqIO.parse('rosalind_sample.fasta','fasta')])

# generate counters for each column of the sequence array
counters = [Counter(column) for column in sequence_array.T]

# generate the profile matrix
profile_matrix = np.array([[counter.get(char,0) for char in "ACGT"] for counter in counters])

print("".join([counter.most_common(1)[0][0] for counter in counters]))
for i,letter in enumerate("ACGT"):
  print(f"{letter}: {' '.join([str(count) for count in profile_matrix.T[i]])}")

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


## Mortal Fibonacci Rabbits

**[Problem](http://rosalind.info/problems/fibd/)**

Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation Fn=Fn−1+Fn−2 and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months. See Figure 4 for a depiction of a rabbit tree in which rabbits live for three months (meaning that they reproduce only twice before dying).

**Given:** Positive integers n≤100 and m≤20.

**Return:** The total number of pairs of rabbits that will remain after the n-th month if all rabbits live for m months.

**Sample Dataset**


```
6 3
```


**Sample Output**


```
4
```



<h4>Version 1<h4>

In [0]:
def mortal_rabbits(n,m):
  """
  Given: Positive integers n (<=100) and m (<=20).
  Return: The total number of pairs of rabbits that will remain after the n-th 
          month if all rabbits live for m months and are only able to reproduce
          after the first month (to reach maturity).

  n = number of months.
  m = number of months all rabbits live.
  """
  
  # starting pair and potential offspring pairs.
  # list stores how many pairs are alive and at what age.
  # e.g. if m = 3; [0] = 1 month, [1] = 2 months, [2] = 3 months
  pairs = [1] + [0]*(m-1)

  # iterate over total months, excluding starting month shown above.
  for month in range(n-1):
    # number of rabbit pairs each month; new offspring from existing adults 
    # plus number of rabbit pairs that have not died (i.e. exclude last index).
     pairs = [sum(pairs[1:])] + pairs[:-1]

  # total number of surviving rabbit pairs.
  return(sum(pairs))

In [0]:
# Sample Dataset
mortal_rabbits(6,3)

4

In [0]:
# Test Dataset
mortal_rabbits(94,17)

19546080223167369637

## Calculating Expected Offspring

[Problem](http://rosalind.info/problems/iev/)

For a random variable X taking integer values between 1 and n, the expected value of X is E(X)=∑nk=1k×Pr(X=k). The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

As a motivating example, let X be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it's not possible to roll a 3.5). The formula for expected value confirms that E(X)=∑6k=1k×Pr(X=k)=3.5.

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a uniform random variable (in the die example, this "equal spacing" is equal to 1). We can generalize our die example to find that if X is a uniform random variable with minimum possible value a and maximum possible value b, then E(X)=a+b2. You may also wish to verify that for the dice example, if Y is the random variable associated with the outcome of a second die roll, then E(X+Y)=7.

**Given:** Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa

**Return:** The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.

**Sample Dataset**


```
1 0 0 1 0 1
```


**Sample Output**


```
3.5
```



<h4>Version 1</h4>

In [0]:
def expected_offspring(given):
  """
  Given: Six non-negative integers (each <=20,000) as a string. The integers 
         correspond to the number of couples in a population possessing each 
         genotype pairing for a given factor. In order, the six given integers 
         represent the number of couples having the following genotypes:

         AA-AA AA-Aa AA-aa Aa-Aa Aa-aa aa-aa

  Return: The expected number of offspring displaying the dominant phenotype in 
          the next generation, under the assumption that every couple has 
          exactly two offspring.

  """
  # punnet square non-dominant factor probabilities.
    # AA-AA = 0
    # AA-Aa = 0
    # AA-aa = 0
    # Aa-Aa = 0.25
    # Aa-aa = 0.5
    # aa-aa = 1

  dataset = list(map(int, given.split()))
  total_adults = sum(dataset)

  # determine probabiliy of offspring with recessive phenotype.
  rec_prob = ((dataset[3]*0.25) + (dataset[4]*0.5) + dataset[5]) / total_adults

  # return probability of offspring with dominant phenotype multiplied by
  # the number of offspring.
  return round(((1 - rec_prob)*total_adults*2),1)

10000000 loops, best of 3: 55.6 ns per loop


In [0]:
# Sample Dataset
expected_offspring('1 0 0 1 0 1')

3.5

In [0]:
# Test Dataset
expected_offspring('19024 18068 16651 19205 16761 17082')

153054.5

<h4>Version 2 (by Sharno)</h4>

In [0]:
def expected_offspring2(given):
  """
  Given: Six non-negative integers (each <=20,000) as a string. The integers 
         correspond to the number of couples in a population possessing each 
         genotype pairing for a given factor. In order, the six given integers 
         represent the number of couples having the following genotypes:

         AA-AA AA-Aa AA-aa Aa-Aa Aa-aa aa-aa

  Return: The expected number of offspring displaying the dominant phenotype in 
          the next generation, under the assumption that every couple has 
          exactly two offspring.

  """
  
  return sum([a*int(b) for a,b in zip([2,2,2,1.5,1,0], given.split())])

10000000 loops, best of 3: 54.5 ns per loop


## Overlap Graphs

[Problem](http://rosalind.info/problems/grph/)

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v,w) (but not by (w,v)). A directed loop is a directed edge of the form (v,v).

For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).

**Given:** A collection of DNA strings in FASTA format having total length at most 10 kbp.

**Return:** The adjacency list corresponding to O3. You may return edges in any order.

**Sample Dataset**


```
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
```


**Sample Ouput**


```
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
```



## Finding a Shared Motif

[Problem](http://rosalind.info/problems/lcsm/)

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

**Given:** A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

**Return:** A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

**Sample Dataset**


```
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
```


**Sample Output**


```
AC
```



<h4> Version 1 </h4>