<a href="https://colab.research.google.com/github/cmrn-rhi/bioinformatics-practice/blob/master/Rosalind_Bioinformatics_Stronghold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Rhiannon Cameron - [Rosalind Profile](http://rosalind.info/users/cmrn-rhi/)*

*Python 3*

# Bioinformatics Stronghold
Learning bioinformatics and programming through problem solving

## Counting DNA Libraries





[Problem](http://rosalind.info/problems/dna/)

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

**Given:** A DNA string s of length at most 1000 nt.

**Return:** Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

<h4> Sample Dataset </h4>

```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
```

<h4> Sample Output </h4>

```
20 12 17 21
```


<h4> Version 1 </h4>

In [0]:
import collections

def dna_nuc_count(dna_string):
  """
  DNA Nucleotide Count
  
  Given: A DNA string s of length at most 1000 nt.
  Return: Four integers (separated by spaces) counting the respective number of
          times that symbols 'A','C','G', and 'T' occur in s.
  """
  # ensure incoming data are all uppercase
  dna_string.upper()
  
  # create orderws dictionary to ensure correct output order
  dna_dict = collections.OrderedDict()
  dna_dict['A'] = 0
  dna_dict['C'] = 0
  dna_dict['G'] = 0
  dna_dict['T'] = 0
  
  # count nucleotides and store associated count in dictionary
  for nucleotide in dna_string:
      dna_dict[nucleotide] += 1
  
  # return nucleotide counts separated by spaces
  return ' '.join([str(i) for i in dna_dict.values()])
  
  
  """
  Alternative Ouput
  Returns nucleotide counts separated by spaces; better format for inputing
          parameters in future functions. Works with unordered Dictionary.
  
  """
  # return dna_dict['A'], dna_dict['C'], dna_dict['G'], dna_dict['T']

In [0]:
# Sample Dataset Test

test_string = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
dna_nuc_count(test_string)

In [0]:
# Dataset Test

test_string = 'GAAGGAATGGGCTCATGCTGGTTTGCCCAACGGCTCCTCAACATGAAAGCCTGAAGTACACGTGAGATATATACCATTCCATACGCACCCGCGTACCGGGCATGGGAGTACAACCTGACGGCTCCGGCCGTGAGATCGGGTCATCTTGGCAGTGACACTAGGATGTTACAGGGTTAACGGACAGTTTGAACGCAGCTGTGATGGCGGCGAGCAATGAACTCAATCCATCACGAGTCACCTGGGTTAAACGGAGTATAAAGCTACTTCCGTCGGACTGGTCTTGTGGTCTGAGAAAATTGGATCGTCCGGCGAGACTAGTCCGGTACTCGAGTTAGGGTTCGCATTCCCGGGCTTTCCTCAGAGGGCTCACTACCTAGTGAACTTAGGAAGATCTACTCTGTGCGGCGTGTCAATTAGACAATGCTTTATGGAACGGGGCACTGGACAATTAGTGGTAAAAACTGAATCGCCCCCACGTGAAACTGCTCACGCTACATGCTGAATTGTTAACCCCTATAACAGTCTCTGTCAGGACTAGAACGGCTGGCGATCATATCCGGCGACGACTTTGCTCTCGTTTCAGAGGTATTGATCCTGCAGGCAGAAGGCGACCCCCGATTTTCATTCCTTTCCTGGTGTCATTTTGCTCAATATGGTATGATATAAGGTAGATGCGAATGTGTCTATATGTTGTCGAGCGCGCTTAGCCATTCGGTCACGCGTATTGACCATGGTTCGTCTAGATCATCACCCTCTCATAAGAGGAATGCCCCGGCAAGCAAGGGTGTCGCGTTTAGGGAGACTTTGTTGCCATAGCACGTGGAGAGAGTAAGAGGGGAAGCGCATAAGCGGGTGCTACGACCCGAAGCTGTCTCGGATGGCTATGTATCCTGGGGTAGGGACAGTAGAAAAGCGTGTAATGGCCGATACATCAGTCGCAATCCCGGGGCGTAGCTCTCGATAACTCTCCT'
dna_nuc_count(test_string)

In [0]:
# Dataset Test - From File

with open('rosalind_dna.txt') as infile:
  test_string = infile.readline()
  test_string_clean = test_string.rstrip('\n')
  print(dna_nuc_count(test_string_clean))

<h4> Version 2 (Better) </h4>

In [0]:
def dna_nuc_count2(dna):
  """
  DNA Nucleotide Count v2
  
  Given: A DNA string s of length at most 1000 nt.
  Return: Four integers (separated by spaces) counting the respective number of
          times that symbols 'A','C','G', and 'T' occur in s.
  """
  return dna.count("A"), dna.count("G"), dna.count("C"), dna.count("T")

<h4> Version 3 (Simple) </h4>

In [0]:

print(*map(input().count, "ACGT"))

## Transcribing DNA into RNA


[Problem](http://rosalind.info/problems/rna/)

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

**Given:** A DNA string t having length at most 1000 nt.

**Return:** The transcribed RNA string of t.

**Sample Dataset**


```
GATGGAACTTGACTACGTAAATT
```


**Sample Output**


```
GAUGGAACUUGACUACGUAAAUU
```



<h4> Version 1 </h4>

In [0]:
def dna_to_rna(dna_string):
  """
  Given: a DNA string, having length of at most 1000 nt.
  Return: the transcribed RNA string.
  
  """
  rna_string = ''
  
  for nucleotide in dna_string:
    if nucleotide == 'T':
      rna_string += 'U'
    else:
      rna_string += nucleotide
      
  return rna_string

In [0]:
# Sample Dataset Test

test_string = 'GATGGAACTTGACTACGTAAATT'
dna_to_rna(test_string)

In [0]:
# Dataset Test

dna_to_rna(input())

<h4> Version 2 </h4>

In [0]:
# User Input

dna_string = input()
print(dna_string.replace("T", "U"))

In [0]:
# From File

with open('rosalind_rna.txt') as infile:
  print(infile.read().replace("T", "U"))

UAUUUGCUAACCAUAGACCGAGAAGUCAACAAAUCGUGAAGCGUGACUAUUUCGGGCGCUACUUCACCCCGACGCCCGGCAUUUCACGCUUGGUGUGUACUACCACGGGGCCGCGGCGGAAUCCCACGCGGAGCUAACCCGUUGGGAUCAUGCAUGCGCCUCUCGGUGGUCAACGGAUGAACUAGUACACCGUCGACGUAAACAAGCUUAUCGCUACCUGGAGGAUUACUGCUGAACUAGUAAACCGGUACCCGGUAAGCCGAAUGCCUGUGAUCGAUGAGUUCUGGGUGUGCAGGAAGUACAUCGGUUAAAGUCUUGGCAGACAACUACCUUAUUUGUCCCCCAGGUGGAAUUACGAGGUGAGGCACUGACGGUAUAACAAGGAAGAGCAAUCAUUUUACACCGGACCUGUGGCCAACACCAAGGCACAGCAGUAUGCGGGAAUGACGCAAAGUCUUUGCACGAGACGGGCAAGCAGUGGCUGUAAAGUUGGACACAGUUGACUACAUUAGAGAUCUAAUACGCGUUCCAUAUCCGACUGGAGCUGGCUAAACAGGUAGAACACAUCGGCCAUUAGCAGAAGCUGUUGAAUUUGGAUCUAAUCCAAAACUAAACUCCAGUUAUGAACUCCCAAAUGAACACAACUCUCGGCGCCCAGUAGAAAUUCGUGCCUCAUUGUUCGGCGCUUGGUACACGUCGGGGCCUGGUGAGUAGUAUAGAUUUGCGCUCGUACAUUACUUCCCGCGCCGCGGGGAGAGUUUUAAAACGUGGUCUACUUGGGAUCUGAGCCUGAUAUGAAAUGGGAGAUGACAUCAGUGUACGGCGCAGCCGCACACGUCAGAAACGAAUAACUUUUUCGUAGCGCCAUUUGCCAAAGGGCUGAUUGUUGAUUGAGGAGCUCAAUCGUAUUGUAAGUCACUUGACAGCCCCUGCCUCUGGA



<h4> Version - Bash </h4>

`tr T U < rosalind_rna.txt`

## Complementing a Strand of DNA

[Problem](http://rosalind.info/problems/revc/)

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

**Given:** A DNA string s of length at most 1000 bp.

**Return:** The reverse complement sc of s.

**Sample Dataset**


```
AAAACCCGGT
```


**Sample Output**


```
ACCGGGTTTT
```



<h4> Version 1 </h4>

In [0]:
def dna_rev_comp(dna_str):
  """
  DNA Reverse Complement
  
  Given: a DNA string (dna_str) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  
  # t & c temporarily hold positions; t will become T and c will become C
  temp_str = dna_str.replace("A", "t").replace("G", "c")
  temp_str2 = temp_str.replace("T", "A").replace("C", "G")
  comp_str = temp_str2.replace("t", "T").replace("c", "C")
  
  return comp_str[::-1]

In [0]:
# Sample Dataset Test

test_string = 'AAAACCCGGT'
dna_rev_comp(test_string)

In [0]:
# From File

with open('rosalind_revc.txt') as infile:
  print(dna_rev_comp(infile.read()))

<h4> Version 2 (Better) </h4>

In [0]:
def dna_rev_comp2(dna_str):
  """
  DNA Reverse Complement v2
  
  Given: a DNA string (dna_str) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  
  # t & c temporarily hold positions; t will become T and c will become C
  comp_str = dna_str.replace("A", "t").replace("G", "c").replace("T", "A")\
             .replace("C", "G").replace("t", "T").replace("c", "C")
  
  return comp_str[::-1]

<h4> Version 3 (Best) </h4>

In [0]:
def dna_rev_comp3(dna_str):
  """
  DNA Reverse Complement v3
  
  Given: a DNA string (dna_str) of length at most 1000 bp.
  Return: the reverse complement (rev_comp) of the DNA string.
  """
  comp_str = { "A" : "T", "C" : "G", "G" : "C", "T" : "A" }

  return "".join([comp_str[n] for n in dna_str][::-1])

## Rabbits & Recurrence Relations

[Problem](http://rosalind.info/problems/fib/)

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence).

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

**Given:** Positive integers n≤40 and k≤5.

**Return:** The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

**Sample Dataset**


```
5 3
```


**Sample Output**


```
19
```



In [0]:
def fib_recur(n, k):
  """
  Fibonacci Recursion
  
  Given: Positive integers n <= 40 and K <= 5
    n = months
    k = rabbit pairs
  Return: the total number of rabbit pairs that will be present after n months,
          if we begin with 1 pair and in each generation, every pair of 
          reproduction-age rabbits produces a litter of k rabbits.
  """
  # Case for when pair matures and is not yet producing offspring
  if n == 0 or n == 1:
    return n
  # Case for when mature pair produces offspring, adjusted for k
  else:
    return fib_recur(n-1, k) + fib_recur(n-2, k)*k
 

In [0]:
fib_recur(29, 5)

1850229480761

## Computing GC Content

[Problem](http://rosalind.info/problems/gc/)

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

**Sample Dataset**


```
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
```


**Sample Output**



```
Rosalind_0808
60.919540
```



In [0]:
def gc_content(dna_strings)
  """
  Computing GC Content
  
  Given: FASTA file containing <= 10 DNA strings (length <= 1kbp/string).
  Return: DNA string ID with the highest GC-content, followed by the GC-content
          of that string. Default error of 0.001.
  """
  
  # grab data/lines between ">"
  # grab data separated by " "
    # first portion becomes dictionary key
    # secon portion becomes dictionary value
  # iterate through dictionary checking GC content in values
    # keep record of ID with highest GC content
  # return ID and value with highest GC content each on their own line. 
  
  with open(dna_strings) as infasta:
    string_text = infasta.readline()
  

<h4> Version 2 - BioPython </h4>

In [0]:
!pip install biopython # includes NumPy

In [0]:
# import Bio
from Bio import SeqIO

In [0]:


for seq_record in SeqIO.parse("rosalind_fasta_example.fasta", "fasta"):
  # print(seq_record.id)
  # print(repr(seq_record.seq))
  # print(len(seq_record))
  nt_count = 0
  gc_count = 0
  gc_greatest = 0
  
  for nt in seq_record.seq:
    
    nt_count += 1
    
    if nt == 'G' or nt == 'C':
      gc_count += 1
      
    gc_percent = (gc_count/nt_count)*100
    
    if gc_percent > gc_greatest:
      gc_greatest = gc_percent
      id_greatest = seq_record.id
     
  
  print(id_greatest)
  print(gc_greatest)
    

Rosalind_6404
100.0
Rosalind_5959
100.0
Rosalind_0808
100.0
