<a href="https://colab.research.google.com/github/ebatty/CodingBootcamp/blob/main/content/Session5_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 5 Exercises

## DNA Parsing

In [1]:
import numpy as np
import matplotlib.pyplot as plt

We will be parsing a real DNA sequence to translate it into the corresponding protein sequence.

DNA consists of a sequence of nucleotides (A, G, T, or C). In the genetic code, each group of three consecutive nucleotides form a codon that translates to a single amino acid. There are a small number of common amino acids - we can use a look-up table to pair each codon with the respective amino acid. We also have stop codons that signify that the DNA should stop being translated there.

In this way, we can work through a DNA sequence, taking each group of three nucleotides (the first three, then the next three, and so on) and translating them to the corresponding amino acid. The resulting sequence of amino acids constitutes the protein that the DNA sequence codes.

We are getting our data from a public repository of DNA sequences from NCBI. We will be looking at a DNA sequence from a Golden Retriever. The data can be found here.

The cell below assigns the data to variables. You do not need to do anything.

`dna_sequence` contains the DNA sequence.

`dna_codons` stores the pairings from triplets/codons to amino acids in a dictionary. The triplets are the keys and the amino acids are the values.

`true_translation` contains the translated protein (from the NCBI website, under CDS/translation)

In [2]:
dna_sequence = 'ATGAGCGAGTCGAGCTCGAAGTCCAGCCAGCCTTTGGCCTCCAAGCAGGAAAAGGACGGCACTGAGAAGCGAGGGCGGGGCCGGCCGCGCAAGCAGCCTCCGAAGGAACCCAGTGAAGTGCCAACACCTAAGAGACCTCGGGGCCGACCAAAGGGGAGCAAAAACAAGGGTGCTGCCAAGACCCGGAAAACTACCACAACTCCAGGGAGGAAACCGAGGGGCAGACCCAAAAAACTGGAGAAGGAGGAAGAAGAGGGCATCTCGCAGGAGTCCTCCGAAGAGGAGCAGTGA'

dna_codons = {'TTT' : 'F', 'CTT' : 'L', 'ATT' : 'I', 'GTT' : 'V',
           'TTC' : 'F', 'CTC' : 'L', 'ATC' : 'I', 'GTC' : 'V',
           'TTA' : 'L', 'CTA' : 'L', 'ATA' : 'I', 'GTA' : 'V',
           'TTG' : 'L', 'CTG' : 'L', 'ATG' : 'M', 'GTG' : 'V',
           'TCT' : 'S', 'CCT' : 'P', 'ACT' : 'T', 'GCT' : 'A',
           'TCC' : 'S', 'CCC' : 'P', 'ACC' : 'T', 'GCC' : 'A',
           'TCA' : 'S', 'CCA' : 'P', 'ACA' : 'T', 'GCA' : 'A',
           'TCG' : 'S', 'CCG' : 'P', 'ACG' : 'T', 'GCG' : 'A',
           'TAT' : 'Y', 'CAT' : 'H', 'AAT' : 'N', 'GAT' : 'D',
           'TAC' : 'Y', 'CAC' : 'H', 'AAC' : 'N', 'GAC' : 'D',
           'TAA' : 'STOP', 'CAA' : 'Q', 'AAA' : 'K', 'GAA' : 'E',
           'TAG' : 'STOP', 'CAG' : 'Q', 'AAG' : 'K', 'GAG' : 'E',
           'TGT' : 'C', 'CGT' : 'R', 'AGT' : 'S', 'GGT' : 'G',
           'TGC' : 'C', 'CGC' : 'R', 'AGC' : 'S', 'GGC' : 'G',
           'TGA' : 'STOP', 'CGA' : 'R', 'AGA' : 'R', 'GGA' : 'G',
           'TGG' : 'W', 'CGG' : 'R', 'AGG' : 'R', 'GGG' : 'G'
           }

true_translation = 'MSESSSKSSQPLASKQEKDGTEKRGRGRPRKQPPKEPSEVPTPK\
RPRGRPKGSKNKGAAKTRKTTTTPGRKPRGRPKKLEKEEEEGISQESSEEEQ'

1) Index into dna_sequence to get the first three nucleotides (ATG). Store this as a variable called codon

2) Use the dna_codons dictionary to translate this codon to the corresponding amino acid. Store this as a variable called protein.

Spoilers: you should get the amino acid M

3) Index into dna_sequence to get the second codon (the next three nucleotides AGC). Translate this to an amino acid and add it to the protein string you created in 2.

Spoilers: protein should now be "MS"

4) Get the length of dna_sequence (how many letters are there in this string)

5) Create a for loop within which you grab and print each codon. So, on the first iteration of the for loop, your code should print the first three letters in dna_sequence (ATG). On the next iteration, you should print the next three letters (AGC), and so on.

6) Within your for loop, translate each codon to the corresponding amino acid. How can we set things up so we get protein at the end as a string of all the translated amino acids?

Hint: Initialize `protein` as an empty string
with `protein = ''`

(Advanced) 7) When we hit a stop codon (the amino acid is 'STOP'), we want to stop translating. Add this to the code above.

If our sequence in the example above had been CCCCATAGTGGGAGCTAG, we would get'PHSGRSTOP' since TAG is a stop codon. We do not want to include the 'STOP'

## Function Practice

1. Write a function `greet_guests` that will greet guests by name. The inputs should be the name and the greeting type. Try to do this using f-strings!

For example,

`greet_guests('Ella', 'Hello')` should return the string "Hello, Ella!"

`greet_guests('Mary', 'Hi')` should return the string "Hi, Mary!"

`greet_guests('Xander', 'Whatsup')` should return the string "Whatsup, Xander!"


2. In your function above, make the default value for the greeting type 'Hello'.

So, `greet_guests('Ella')` should return the string "Hello, Ella!"

3. Write a function rescale that takes an array as input and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0.

Hint: If L and H are the lowest and highest values in the original array, then the rescaled value of value v should be (v-L) / (H-L).

Note: This problem is modified from Software Carpentry CC-BY materials https://swcarpentry.github.io/python-novice-inflammation/08-func.html

In [None]:
import numpy as np

input_array = np.array([[3, 5, 1, 2], [6, 7, 5, 1]])



(Advanced) 4. We want to create a function that takes in an array, finds the second largest value of the array and returns it. Specifically, we want the second largest of the unique values. For example, using the function on this array a = np.array([100, 100, 50]) should return 50, not 100.

